Ep. 010 - How Much Do GPUs Really Cost, and Where Does the Value Go? (AI Cloud TCO) | Jordan Nanos, Dan Nishball, Kang Wen Cheang, Zane Fong

SemiAnalysis Weekly

May 01, 2026•47 min•Season 1Ep. 10

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

This episode features Jordan Nanos (@JordanNanos) and Daniel Nishball (@dnishball) breaking down the economics of GPU clusters through real-world data and experience. Joined with Kang Wen Cheang and Zane Fong, the team discussed moving beyond theoretical TCO models as they examine how reliability differences between top-tier and lower-tier providers create significant cost disparities that aren't captured in simple per-GPU pricing. The discussion introduces practical frameworks for measuring goodput and understanding how system failures cascade through entire training jobs.Nanos walks through the mechanics of fault-tolerant frameworks including AWS's Checkpointless Training and explains why a single GPU failure can halt progress across hundreds of nodes. The conversation reveals how hyperscalers and NeoClouds price their services and why paying premium rates for reliable infrastructure often delivers better value than chasing the lowest per-hour costs. Subscribe to SemiAnalysis for in-depth analysis of AI hardware economics and infrastructure trends that impact the entire semiconductor ecosystem.

For the best experience, listen in Metacast app for iOS or Android