The AI Chip War: Who Controls the Future of Artificial Intelligence?

Speaker 1

00:01

Welcome to the Sentient Code, where intelligence is engineered, autonomy is emerging, and a line between human and machine grows thinner. Each episode, we decode the algorithms, explore the robotics, and examine the ideas shaping the future of artificial minds.

Speaker 2

00:23

Welcome back to the deep Dive.

Speaker 3

00:25

You know.

Speaker 2

00:27

We spend so much time obsessing over the software side of AI.

Speaker 3

00:31

Oh, we absolutely do. It is all anyone talks about, right We look.

Speaker 2

00:34

At the chatbots and the image generators, the flashy demos, the code. It is always look what this new model can do. But today I want to physically unplug all of that.

Speaker 3

00:46

If we are going down to the basement.

Speaker 2

00:48

We are going to the basement, we are stepping away from the cloud and we're going to talk about the actual physical machine that makes the cloud exist. We are talking about the iron.

Speaker 3

00:57

The iron. You know, it is funny you use that term because for the first really the first fifty years of computing history, software people look down on hardware people.

Speaker 2

01:06

It was just plumbing to them.

Speaker 3

01:08

Exactly. Hardware was a commodity. It was dirt. It was just the physical thing you ran your brilliant code on. And if your code was running slow. You didn't rewrite it. You just waited two years for Intel to make a faster chip, and your problem was magically solved.

Speaker 2

01:21

Moore's law was basically a free lunch for developers.

Speaker 3

01:24

It was a free lunch. But that free lunch is over. The script is completely flipped. Now we have moved from this era of code dominance to an era of compute dominance.

Speaker 2

01:34

Because in the AI world today, the smartest algorithm doesn't necessarily win anymore.

Speaker 3

01:39

No, it doesn't. The team with the biggest, most specialized pile of silicon wins, period.

Speaker 2

01:44

And we are calling this the compute race. I think what is so surprising to people tuning into this is that this isn't just Apple versus Microsoft anymore. This is arguably the central geopolitical conflict of the twenty twenties.

Speaker 3

01:56

It is the absolute bottleneck of the modern world. You look at the supply chain for advanced AI chips, it is terrifyingly concentrated. Right we were talking about a single company in the Netherlands that makes the manufacturing machines, a single island Taiwan that actually manufactures the chips, and a handful of US companies that design them and if.

Speaker 2

02:15

Any single link in that chain breaks, the.

Speaker 3

02:17

AI revolution doesn't just slow down, it stops entirely.

Speaker 2

02:20

So today we are going to trace that chain. We are going to figure out how a piece of hardware designed to let teenagers play Call of Duty somehow became the brain of modern civilization.

Speaker 3

02:31

Is an incredible pivot, and.

Speaker 2

02:33

We are going to look at why Google panic built their own secret chip factory, and we will get into the physical limits too, because apparently we are literally running out of atoms to work with.

Speaker 3

02:42

It is a story about physics, it is a story about economics, and ultimately it is a story about war.

Speaker 2

02:49

Let us start with the origin story, because this is the part that always feels like a massive accident of history to me. If you look at the trillion dollar club today, Nvidia is sitting right there at the top, But twenty years ago they were not trying to build artificial intelligence.

Speaker 3

03:04

Not even close. If you walked into Nvidia headquarters in say nineteen ninety nine or two thousand and five, they were obsessed with one single thing, polygons video games, specifically rendering three D graphics. They wanted to make better explosions, realistic textures, dynamic lighting, shadows and to understand why that matters for AI, we really have to get a little technical here.

Speaker 2

03:23

Okay, let's break it down. We need to distinguish between the chip in your standard laptop, which is the CPU, and the chip in a gaming card, the GPU. Because I think most people hear the word processor and they just picture a little black square. What is the actual architectural difference inside that square?

Speaker 3

03:40

Okay, let's unpack this. Imagine a CPU, the central processing unit. This is your Intel Core I nine or your AMD risin. A CPU is like a team of incredibly smart mathematicians. Let us say a tight knit team of twelve geniuses.

Speaker 2

03:56

Small team, very high IQ.

Speaker 3

03:58

Extremely high i Q, and highly versatile. If you give one of these CPU cores a really complex problem, something like run this operating system, then open Excel, then calculate this complex formula, then check for incoming email. It can handle that context switching perfectly.

Speaker 2

04:12

Because it is designed for serial processing.

Speaker 3

04:14

Exactly Step A, then step B, then step C. It has massive amounts of complex logic built in just to handle branching paths like if this happens, then do that.

Speaker 2

04:22

So a CPU is optimized for logic and sequence.

Speaker 3

04:24

Correct now look at a GPU, the graphics processing unit. A GPU is not a team of twelve geniuses. It is a stadium filled with ten thousand average high school students.

Speaker 2

04:35

Okay, I like where this analogy is going.

Speaker 3

04:37

Individually, those students aren't that smart. They cannot run a modern operating system. They will completely freeze up if you give them complex branching logic chains.

Speaker 2

04:46

But they have numbers on their side.

Speaker 3

04:48

Exactly, if you give them a task that is simple and repetitive, like take these two numbers and add them together, and you tell all ten thousand of them to do it at the exact same.

Speaker 2

04:56

Time, they will completely obliterate the CPU.

Speaker 3

04:59

They will leave it a in the dust.

Speaker 2

05:00

And this is the core concept of parallelism.

Speaker 3

05:03

Specifically data parallelism. And this is exactly where video games come into the picture. Think about your computer screen right now. It is a grid of pixels. A standard monitor is nineteen twenty y ten eighty, which is roughly two million pixels. To render just one single frame of a video game, you need to calculate the exact color for every single one of those two million pixels based on the virtual lighting, the texture of the wall, the geometry of the character.

Speaker 2

05:28

And the crucial part here is that the color of the pixel and the top left corner generally does not depend on the color of the pixel in the bottom right corner.

Speaker 3

05:36

Precisely, they are mathematically independent. You don't need to calculate pixel one and then wait to calculate pixel two and then pixel three. You can calculate all two million of them simultaneously. Computer scientists actually have a great term for this. They call it an embarrassingly parallel problem.

Speaker 2

05:51

I love that term so much. It is so parallel it is actually embarrassing not to do it all at once, and.

Speaker 3

05:57

That is exactly why the GPU was invented. It actively sacrifices individual core speed and complex logic in exchange for raw massive parallelism. It uses thousands of tiny, relatively dumb cores instead of a few really smart ones.

Speaker 2

06:13

Okay, so we have this chip that was fundamentally designed to run games like Doom and Quake. How do we make the jump from rendering a virtual shotgun to running chat GBT?

Speaker 3

06:21

This is where we hit the great convergence. Around the late two thousands, AI researchers were hitting a massive wall. They had these theoretical ideas about neural networks, which are basically mathematical structures inspired by the biological human brain, but actually training them was agonizingly slow because they were trying to run them on those CPUs. The twelve Geniuses, right, and the geniuses were just getting bogged down because a neural network, at its very core is just a massive

06:49

grid of numbers. In math we call the matrices. To train in AI, you have to multiply these giant grids of numbers together, adjust the results slightly, and then do it again billions of time, literally billions of times.

Speaker 2

07:01

And I'm guessing matrix multiplication is.

Speaker 3

07:03

It is embarrassingly parallel. Multiplying a massive matrix is really just performing the exact same simple multiplication operation on thousands of numbers at the exact same time. It turns out the math required to simulate a photon of light bouncing off a three D wall in a video game is almost identical to the math required to simulate a virtual neuron firing in an artificial brain.

Speaker 2

07:25

That is just such a wild coincidence to me.

Speaker 3

07:28

It is the happy accident that gave us the entire modern world.

Speaker 2

07:31

So when did the industry actually realize this? Was there a specific moment where the light bulb suddenly went on for everyone, there was.

Speaker 3

07:38

A big bang moment. It was twenty twelve. A competition called image.

Speaker 2

07:41

Net set the scene for us. What exactly was image.

Speaker 3

07:44

Net Imaget was basically the Olympics of computer vision. You had this massive data set of millions of images pictures of cats, dogs, airplanes, strawberries, and researchers had to write software that could look at the pixels and identify what was actually in.

Speaker 2

08:00

The picture, which is incredibly hard for a computer, very hard.

Speaker 3

08:03

For years, the best teams in the world, mostly using traditional hand coded logic techniques, were stuck getting air rates around twenty six percent.

Speaker 2

08:11

That is not great, That is missing one out of every four pictures.

Speaker 3

08:14

It was the best we had At the time. Progress was agonizingly slow. Most people thought true human level computer vision was decades away. But then in twenty twelve, this small team from the University of Toronto shows up. Alex Krzewski, Ilia Sitzkaver, and Jeffrey Hinton, and.

Speaker 2

08:31

Those Nate I mean is Ilia Setzkaver and Jeffrey Hinton. These are the absolute titans of AI today, but back then they were kind of the outsiders.

Speaker 3

08:40

Right, they are the crazy ones. Neural networks were widely considered a dead end by most serious computer scientists, but this team entered a neural network they called alex net, and it didn't just win the competition, it utterly destroyed the field. They dropped the air rate from twenty six percent down to fifteen percent in.

Speaker 2

08:57

A single year. That is unprecedented for that competition.

Speaker 3

09:00

In one year. It was a mathematical massacre. The entire conference room when completely silent when they presented. But here's the specific detail that matters for a story today. To train alex Net, they didn't use a massive government supercomputer. They didn't use a giant server cluster.

Speaker 2

09:16

What did they use.

Speaker 3

09:16

They literally went to a consumer electronics store and bought two Nvidia GTX five eighty graphics cards.

Speaker 2

09:23

Two gamer cards, the exact kind of thing you would put in a Dusktop PC to play Skyrim.

Speaker 3

09:29

Exactly two cards that cost maybe five hundred dollars each at the time. They shoved them into a standard PC. They wrote some custom code to move the math off the CPU and onto the GPU, and they suddenly realized they could train their model in a matter of days instead.

Speaker 2

09:43

Of months and That is the true Aha moment, because if you can iterate in days, you can actually learn and adapt. If an experiment takes six months, you are just stuck.

Speaker 3

09:53

Exactly. Speed is intelligence in this field. If you can run one hundred experiments in the time it takes your rival to run one one, you get smarter incredibly fast. Honestly, in vidia stock price chart should basically have a little bronze statue of alex Net next to it. But this is the deep dive of nuance we need to hit. It wasn't just the physical hardware that made.

Speaker 2

10:11

This possible, right, because you cannot just plug a video card into a motherboard and tell it to learn English. It inherently speaks graphics, doesn't speak math.

Speaker 3

10:19

Correct And this is where in Nvidia's CEO Jensen Huong showed just incredible almost prophetic foresight years before alex Net, way back in two thousand and six, and Video released a software platform called CUA ce UA.

Speaker 2

10:33

I see this acronym constantly when reading about this space. It is usually described as Invidia's massive mote. What is it actually doing under the hood?

Speaker 3

10:43

Well before CD existed, If you wanted to use a GPU for scientific math, you basically had to hack it. You literally had to trick the graphics card into thinking your math problem was actually a texture or a shadow on a polygon. It was an incredibly painful process.

Speaker 2

10:57

Please render this massive spreadsheet as an explosion.

Speaker 3

11:00

Basically, yes, it was a total nightmare for researchers. CUDA changed all of that. It was a software layer that let normal programmers write standard code like C plus plus that ran directly on the GPU. It exposed the raw mathematical power of the chip without all the annoying graphics baggage.

Speaker 2

11:18

So Nvidia essentially built the translation layer before anyone even really knew what language they wanted to speak.

Speaker 3

11:23

Jensen Wong practically bet the entire company on it now and Wall Street absolutely hated it at the time. Investors were furious. They said, why are you spending billions of dollars on R and D for a feature that only a few academic weirdos in universities use.

Speaker 2

11:35

And then a few years later those weirdos invented modern AI.

Speaker 3

11:39

And because all those weirdos learned to code specifically in CUDA, the entire foundation of modern AI was built on top of Nvidia's proprietary software. Now If you are a hardware startup today and you want to build a brand new chip to beat in Nvidia, you have a massive, massive problem.

Speaker 2

11:56

Because nobody knows how to program your new chip exactly.

Speaker 3

12:00

All the libraries, all the developer tools, all the research they all natively speak CUDA. It is the classic ecosystem lock in. It is like Windows in the nineties or the iPhone app store today. It is incredibly difficult to break that habit.

Speaker 2

12:14

Let us fast forward to today then, because we are obviously not using five hundred dollars GTX five eighties anymore. We're using the H one hundred. This is the chip that companies are fighting over, the one Mark Zuckerberg is supposedly buying three hundred and fifty thousand.

Speaker 3

12:25

Of H one hundred is It is a monster. It is a true marvel of human engineering.

Speaker 2

12:29

Give me the physical stats. What are actually looking at here?

Speaker 3

12:32

It is a slab of silicon that has eighty billion individual transistors carved into it using a four and nanimeter manufacturing process. Just wrap your head around that eighty billion on one chip. But honestly, the raw transistor count isn't even the most impressive part. It is how highly specialized. The architecture has become specialized in what way? Remember how

12:53

the old GPUs were fairly general purpose for graphics. The H one hundred is designed specifically from the ground up for the math of transformers DASH, which is the t in chat GPT. It has specific hardware units inside it called tensor.

Speaker 2

13:07

Cores tensor course.

Speaker 3

13:08

Think of them as dedicated calculator services that do nothing but matrix multiplication. They cannot render graphics, they cannot run an operating system. They just do that one specific math operation incredibly fast. The H one hundred can perform roughly four thousand trillion floating point operations per second if you use the right precision levels.

Speaker 2

13:25

Four thousand trillion operations per second. That is unfathomable.

Speaker 3

13:28

But here's the crazy part. Raw compute speed is actually the easy part of chip design. Now, the real bottleneck, the thing that actually keeps chip architects up at night, is memory.

Speaker 2

13:37

This is the memory wall concept I keep reading about, right.

Speaker 3

13:40

It simply does not matter if your process or brain can think of billion thoughts a second, if you cannot get the data into the brain fast enough, A super SaaS chip with slow memory is like a Ferrari with a clogged fuel line, it just stalls out. So the H one hundred uses a brand new type of memory called HBM High bandwidth memory.

Speaker 2

13:59

How does that solve the fuel line problem?

Speaker 3

14:01

It is stacked vertically. They literally build skyscrapers of memory chips directly on top of the processor itself to physically shorten the distance the electrical signals have to travel.

Speaker 2

14:12

So they are building three D towers of memory right next to the logic cores just to save the fractions of a nanosecond it takes for the signal to travel across a standard motherboard.

Speaker 3

14:21

We are actively fighting this speed of light. At this point. The H one hundred has a memory bandwidth of over three point three terabytes per second. To put that in perspective, that is like downloading thousands of full four K movies in a single second. It is an absolute fire hose of data.

Speaker 2

14:37

And they use something called envylink to string them together. Right.

Speaker 3

14:40

Yes, Envylink is their proprietary interconnect because one H one hundred isn't enough. You need thousands of them to function as one giant unified brain. Envylink is the nervous system that lets them talk to each other fast enough to stay synchronized.

Speaker 2

14:55

And yet despite all of that insane power and the CDA mode, and VIDIA is not the only player in town anymore. Which brings us to this sleeping giant that suddenly woke up Google.

Speaker 3

15:07

This is truly one of my favorite corporate history stories because while everyone else in the world was just blindly buying in vidio chips, Google looked at their internal usage data and absolutely freaked out.

Speaker 2

15:18

This was back around twenty thirteen r twenty thirteen.

Speaker 3

15:20

Yeah, yeah. Google engineers did a back of the n appting calculation that terrified them. They looked at the rapid rise of voice search on Android phones and they realized that if every single Android user used voice search for just three minutes a day.

Speaker 2

15:33

Just three minutes, that is like two quick searches.

Speaker 3

15:35

Exactly almost nothing. But they calculated that those three minutes would require so much compute power to process the speech recognition that it would completely double Google's entire global data center.

Speaker 2

15:45

Footprint, doubled their entire footprint.

Speaker 3

15:48

They would have had to build twice as many data centers as they had built in their entire corporate history just to support three minutes of voice search. They realized instantly that if they relied on buying standard Intel CPUs and Nvidia GPUs, they would literally go bankrupt. The economics just flat out did not work at that scale.

Speaker 2

16:07

So, in classic Google fashion, they just decided, we will build our own hardware.

Speaker 3

16:12

They launched a highly secret internal project to build the TPU, the tensor processing unit, and their design philosophy was incredibly radical compared to Nvidia. Because in Vidia sells GPUs to everyone, right, they have to be good at gaming, cryptomning, self driving cars, scientific simulations. Google said, we do not care about gaming, We do not care about graphics at all. We want a chip that does deep learning and absolutely nothing else.

Speaker 2

16:38

So they stripped the sports car all the way down to the chassis. No AC, no radio, just a massive engine.

Speaker 3

16:42

Even the engine itself is totally different. They used a specific architecture called a systolic array.

Speaker 2

16:47

Systolic like blood pressure, like a.

Speaker 3

16:49

Heart beat, exactly like a heart beat, and a normal CPU. With GPU, the chip acts kind of like a library. You go to the shelf to get a book which is your data. You bring it to the desk the processor, you read it, and then you walk all the way back to put it on the shelf. That walking back and forth accessing the memory takes a massive amount of energy and time.

Speaker 2

17:07

And as we just established, energy and memory are the ultimate enemies.

Speaker 3

17:11

Here, right, So in a systolic array, you do not put the book back on the shelf. You process it, and then you immediately hand it to the person sitting right next to you. That data physically flows through the grid of the chip in a wave. One calculation finishes and simply pushes the result directly to the next math unit. It heavily mimics a continuous flow of blood through a circulatory system.

Speaker 2

17:33

So the data just enters one side of the chip, flows through this massive grid of math units getting multiplied, and just pops out the other side as a finished result.

Speaker 3

17:41

Exactly. It drastically reduced the need to constantly access external memory, and the result was staggering. That first internal TPU they deployed was roughly fifteen thirty times more efficient per watt than anything else available on the commercial market at the time.

Speaker 2

17:57

That is an insane leap inefficiency. And Google didn't stop there. They kept iterating on it.

Speaker 3

18:02

Oh yeah. Version two came out in twenty seventeen, and that was a huge deal because the first one could only run models, they couldn't train them. V two added full training capabilities, and today we are on V four and V five. They are entirely liquid cooled now and they deploy them in massive clusters they call pods.

Speaker 2

18:19

Thousands of chips all wired together.

Speaker 3

18:21

Right, and this brings up a massive advantage Google has over almost everyone else. It is their interconnect system called ICI.

Speaker 2

18:28

How is that different from in videos and vlink?

Speaker 3

18:31

Because Google completely controls their own data centers, they can wire these TPUs directly to each other in what is called a torus topology. Think of it like a giant three D donut shape. They use direct optical links between the chips. They don't have to route the data through standard bulky networking switches. It makes those thousands of TPUs act flawlessly as one single brain.

Speaker 2

18:52

And this is why today when you look at Google, they don't really buy in Vidia chips for their core internal AID training. They use their own hardware.

Speaker 3

19:01

They use TPUs for almost everything. Gemini, which is their massive competitor to GPT four, was trained entirely on TPUs. It gives Google this incredible vertical integration.

Speaker 2

19:10

They own the whole stack.

Speaker 3

19:11

They own the chip design, the physical server rack, the custom cooling systems, the softwaware framework which is TensorFlow or jx, and the final AI model itself. It is essentially the Apple iPhone strategy, but applied to a warehouse sized data center. They completely control their own destiny.

Speaker 2

19:27

Which is an amazing position to be in. So we have the raining commercial champion in Nvidia. We have the independent, vertically integrated superpower Google. But looking at the market right now, it feels like the floodgates have totally opened. Every major tech company is suddenly announcing their own custom chip.

Speaker 3

19:43

It is the great me too wave of Silicon and it is driven by very simple, very ruthless economics. If you are Amazon Aws or Microsoft Azure, you are currently spending tens of billions of dollars a year buying chips from Nvidia. That just vaporizes your profit margins.

Speaker 2

20:02

And worse than that, you are entirely dependent on a single supplier who literally cannot manufacture the chips fast enough to meet your needs exactly.

Speaker 3

20:10

So let's look at the broader landscape. You have Amazon Aws, who took a very smart, bifurcated approach. They split the AI problem in half. They built a chip called Trainium specifically for training models, and a separate chip called Inferentia for running them.

Speaker 2

20:23

We actually really need to pause here and defind this clearly. Because the difference between training and inference comes up constantly in this space. What is the actual practical difference.

Speaker 3

20:31

The best way to think about it is think of training as graduate school.

Speaker 2

20:36

Okay, graduate school, years of intense work, massive amounts of coffee, incredibly expensive exactly.

Speaker 3

20:42

Training is the phase where the AI is actively learning from scratch. You are feeling it essentially the entire written text of the Internet. It takes months of continuous run time. You need massive, incredibly expensive compute clusters with thousands of GPUs working in perfect unison. You need high precision maths so the model can learn tiny nuances. This is the graduate school phase, and this is where Nvidia absolutely dominates.

Speaker 2

21:08

But eventually the model finishes its exams, it.

Speaker 3

21:11

Graduates, It graduates, and it has to go get a job. That job is inference, okay. Inference is what happens when you open an app, ask chat GPT a question and it types out an answer. The model is no longer learning. Its weights are frozen. It is simply applying the knowledge had already gained in grad school to a new prompt.

Speaker 2

21:26

And that happens in real time instantly, right.

Speaker 3

21:29

It happens in milliseconds, and it happens billions of times a day across the world. And the hardware needs for that day job are completely different than grad school. For inference, you don't need massive precision. You care about latency, how fast can I serve this answer to the user? And you care deeply about costs and power consumption.

Speaker 2

21:47

Because you only train the model once maybe twice a year, but you run inference on it constantly every single second of every.

Speaker 3

21:56

Day, precisely as AI applications explode and get integrated into every piece of software. The overall market for inference chips is actually projected to grow much faster than the market for training chips. That is exactly why Amazon built Inferentia is designed to be a cheap, highly efficient chip just for that day job workload. Microsoft is doing the exact same thing with their Meya accelerator for Azure.

Speaker 2

22:18

But what about Meta? Because Facebook and Instagram they aren't selling cloud server space to startups like Amazon and Microsoft do. Why are they spending billions to design their own chip.

Speaker 3

22:28

Meta is a fascinating outlier here. Their core AI problem is fundamentally different from open ai or Google. They aren't primarily building conversational text chatbots. Their entire trillion dollar business depends on recommendation engines.

Speaker 2

22:44

Right, the algorithm deciding exactly which reel or add to show me next, so I do not close the app exactly.

Speaker 3

22:51

And computationally speaking, a recommendation engine is a very weird mathematical problem. It relies on something called embedding tables. These are just astronomy massive databases that map out user preferences and content features.

Speaker 2

23:04

So how does a chip process that differently?

Speaker 3

23:06

Well, when you are generating text with a language model, the math is very dense and predictable. But when you are pulling from embedding tables for an ad recommendation, the memory access pattern is random, sparse, and chaotic. You're jumping all over the place pulling bits of user history.

Speaker 2

23:21

So a standard in video GPU just isn't efficient for that kind of chaotic memory access.

Speaker 3

23:26

It is massive overkill in some compute areas and horribly inefficient in memory access for others. So Meta designed their own chip. The MTIA, the Meta Training and Inference Accelerator. It is tuned specifically to handle the chaotic memory demands of massive embedding tables. It really shows that the industry is moving rapidly away from this idea of one chip fits all and moving toward the right custom chip for the specific math problem.

Speaker 2

23:53

And Apple is doing this too, right with the Neural Engine on iPhones, but their goal is keeping the AI on the physical phone for privacy rather than sending it to a cloud data center.

Speaker 3

24:02

Exactly on device inference. Everyone is carving out their own specialized niche.

Speaker 2

24:06

Okay, we cannot talk about chip design without talking about the rebels in the room, the startup landscape, because honestly, it takes a certain level of sheer insanity to try to start a hardware company from scratch to compete against a giant like Nvidia, but people are actually doing it.

Speaker 3

24:20

It is notoriously the hardest game of Silicon Valley. But there are two startups right now that really highlight the extreme physical limits we are pushing in chip architecture. Cerebras and groc let Us.

Speaker 2

24:32

Start with Cerebra Systems. These are the wafer scale guys, So to.

Speaker 3

24:35

Understand what Cerebras did you have to look at how chips are normally made. Normally, a factory takes a silicon wafer, which is basically a shiny disc of silicon roughly the size of a dinner plate. They print hundreds of identical small chips onto that plate, and then they slice the plate up into little squares.

Speaker 2

24:54

And then you take those little individual squares, put them in protective plastic packages and wire them all back together on big green motherboard.

Speaker 3

25:01

Right, But that wiring them back together part, that is the ultimate bottleneck. Moving data across copper wires between different chips is painfully slow and burns a ton of energy. So the founders of Cerebras just asked a seemingly crazy question, why are we cutting the wafer it all?

Speaker 2

25:18

They just use the entire dinner plate as one single chip.

Speaker 3

25:21

The whole plate the Cerebra's wafer scale engine is a single chip roughly the size of an iPad. It contains four trillion transistors.

Speaker 2

25:29

Four trillion on one piece of silicon. Visually, it is just such a cool concept. But practically, I mean, manufacturing at the atomic level is not perfect. Usually, if a tiny speck of dust ruins one chip on a wafer, you just throw that one small square away and keep the other three hundred. If the entire wafer is the chip, doesn't one single manufacturing defect ruin the whole multi million dollar plate.

Speaker 3

25:52

You just nailed the exact reason no one ever successfully did this before. It's called the yield problem. Cerebras had to invent a completely novel networking architecture to route around the physically broken parts on the fly. If a microscopic section of the wafer has a manufacturing defect, the internal software simply ignores it and rotes the data to the healthy neighbors.

Speaker 2

26:11

It is like having a biological brain with a few dead neurons. The overall network just adapts and rotes around the damage exactly.

Speaker 3

26:18

And the massive benefit of doing this is unprecedented bandwidth. Because everything, all the memory, and all the compute cores is physically located on the exact same piece of unbroken silicon, communication is instantaneous. You never have to wait for data to travel across a slow external wire. Is an absolute beast for training massive models.

Speaker 2

26:38

And then on the other end of the spectrum there is grock spelled with a queue I have seen their inference demos online where it just prints out paragraphs of text instantly. It genuinely feels faster than human thought.

Speaker 3

26:49

Grok took a completely different, almost philosophical approach. They looked at the modern GPU and said, there is way too much chaotic management going on inside this chip. The standard GPU dedicates a massive amount of physical hardware and energy just to scheduling dynamically deciding which core should do which math problem next. Grock stripped all of that dynamic management out completely.

Speaker 2

27:11

They made it strictly deterministic.

Speaker 3

27:13

Yes, deterministic architecture in their system called an LPU or language processing unit. The software compiler maps out exactly what every single transistor will do it every single clock cycle before the program even starts running.

Speaker 2

27:28

Like a perfectly choreographed dance routine where everyone knows their steps in advance, so you don't need a director shouting orders in real time exactly.

Speaker 3

27:36

There is zero hesitation. And crucially, they do not use the massive slow external memory like HPM that Nvidia uses. They exclusively use something called SRAM. What is shram static ram? It is a type of incredibly fast memory that lives directly on the processor die itself right next to the logic gates. It is vastly more expensive to manufacture, and you physically cannot fit very much of it on a ship, but it completely a limit is the delay of fetching data.

28:01

That is exactly why growth is so blindingly fast at generating text tokens. It is perfectly optimized for the inference day job, where raw speed is everything.

Speaker 2

28:10

But again, all these incredible hardware startups face the exact same invisible wall we talked about earlier software.

Speaker 3

28:16

The Cuda ecosystem. You can literally build the fastest, most beautiful chip in human history, but if a researcher's standard PyTorch code does not run on it effortlessly out of the box, nobody's going to buy it. And Vidia has a fifteen year head start on software momentum. Breaking that psychological and technical walk in is arguably harder than breaking the laws of physics.

Speaker 2

28:39

Speaking of breaking things, let us zoom out to the global map, because up until now we've just been talking about corporate rivalries. But this isn't just about companies anymore. It is about.

Speaker 3

28:49

Countries, geopolitics. This is where the story gets genuinely scary.

Speaker 2

28:53

You mentioned at the very beginning that the supply chain is highly concentrated, walk us through just how fragile this map actually is.

Speaker 3

28:59

Imagine if all the oil in the world, every single drop, was pumped from just three buildings. That is the modern semiconductor industry. Let us start with the machine that actually prints the chips ASML, headquartered in the Netherlands, they are the sole manufacturer of extreme ultraviolet or EUV lithography machines.

29:17

To even begin to understand how insanely complex this machine is, these machines generate the UV light needed to print atomic level circuits by dropping a microscopic droplet of molten tin inside a vacuum cham molten tins falling through a vacuum, yes, and as it falls, they hit that microscopic droplet with a high powered laser. The impact flattens the droplet into a pancake shape, and then a microsecond later they hit

29:42

it again with a second, infinitely more powerful laser. This instantly vaporizes the tin into a plasma which amids a very specific thirteen point five nanometer wavelength of extreme ultraviolet light.

Speaker 2

29:54

That sounds like a weapon from Star Trek. And how often is it doing this? Laser plasma explosion fift.

Speaker 3

30:00

T thousand times a single second.

Speaker 2

30:01

That is just incomprehensible.

Speaker 3

30:03

It does that continuously fifty thousand times a second to generate enough light to etch physical features onto silicon that are literally the size of a few strands of human DNA.

Speaker 2

30:11

And you are telling me only one single company on planet Earth knows how to build this machine.

Speaker 3

30:16

Only one ASML. If their primary factory in the Netherlands experience is a major flood or a fire, Moore's law simply ends period nobody else can make them.

Speaker 2

30:26

And then once you have that two hundred million dollar ASML machine. The chips themselves are mostly manufactured in Taiwan.

Speaker 3

30:31

By TSMC, the Taiwan semiconductor manufacturing company. They manufacture upwards of ninety percent of the world's most advanced logic chips, all of Apple's chips, all of Nvidia's advanced chips. They all come out of TSMC fabs in Taiwan. This creates a massive, glaring geopolitical vulnerability for the rest of the world.

30:51

If China were to blockade the island of Taiwan, or if there was just a catastrophic earthquake there, the entire global economy would lose its primary computing engine overnight.

Speaker 2

31:00

And this sheer panic over that vulnerability is exactly why the United States initiated the chip war.

Speaker 3

31:05

Exactly the US government looked closely at this supply chain map and realized that AI is fundamentally a dual use technology. The exact same H one hundred chip that runs a friendly customer service chatbot can easily be used to model the aerodynamics of hypersonic nuclear missiles or orchestrate massive cyber warfare campaigns at a global scale.

Speaker 2

31:24

So the US government stepped in with heavy export controls.

Speaker 3

31:27

Starting heavily in twenty twenty two, the US Department of Commerce outright banned the sale of Viba's absolute top tier frontier chips, the eight one hundred and AGE one hundred, to any entity inside China. The explicit geopolitical goal was to freeze China's AI progress in place, keeping them permanently a generation or two behind American labs.

Speaker 2

31:48

But in Vidia's a publicly traded company, they obviously did not want to lose out on the massive Chinese tech market.

Speaker 3

31:54

No China represents billions of dollars in revenue for them, so Nvidia engineers went back to the drawing board and quickly to design new chips, the eight hundred and the eight eight hundred. These were specific, slightly modified versions of their flagship chips, designed specifically to comply with the exact letter of the new US law.

Speaker 2

32:10

How do they manage to cripple it enough to make it legal? Did they just turn down the clock speed and make the math slower.

Speaker 3

32:16

No, And that is what was so incredibly clever about it. They kept the raw compute speed exactly the same. The H eight hundred could crunch matrix math just as fast as the H one hundred. But what they did was cut the interconnect speed, the envy link communication speed completely in.

Speaker 2

32:30

Half, the wire speed between the chips exactly.

Speaker 3

32:32

Why. Because if you cannot connect thousands of chips together fast enough, you physically cannot build a cohesive supercomputer. You cannot train a massive trillion parameter frontier model like GPT four. If the chips can't talk to each other rapidly, the data just bottlenecks. You can still do basic inference, you can run smaller AI models locally, but you cannot create the next generation of frontier AI that.

Speaker 2

32:58

Is such a hyper specific surgical restriction, just snipping the communication cables.

Speaker 3

33:03

Essentially, it was brilliant engineering. But then the US government looked at the age eight hundred, realized it was still too powerful, and they abruptly tighten the rules again the following year to ban even those workaround chips. It has become this intense, high stakes game of regulatory cat and mouse.

Speaker 2

33:19

And what is China doing in response to all this? They surely aren't just throwing their hands up and giving up on AI.

Speaker 3

33:24

Not at all. And this brings up the massive unintended consequence argument that policy experts are debating right now. By aggressively cutting China off from the best Western hardware, we basically forced them to heavily subsidize and build their own completely independent supply chain. Huawei, despite massive sanctions, has released an AI chip called the Ascend nine ten B.

Speaker 2

33:47

Is it actually as good as an Nvidia H one hundred.

Speaker 3

33:50

No, it is generally considered slower and less efficient. But is it good enough to train large language models? Yes, it absolutely is. And Smick, which is China state back to semi conductor manufacturer, is figuring out brilliant ways to use older non EUV machines to painstakingly manufacture advanced seven nanometer chips.

Speaker 2

34:09

So by trying to completely starve them up chips, we might have accidentally accelerated the exact thing we were terrified of. Full Chinese independence and self sufficiency in cutting edge silicon.

Speaker 3

34:18

Is a very real possibility. We've pushed an economic superpower into a desperate corner and they are actively engineering their way out of it.

Speaker 2

34:24

Meanwhile, the US is desperately trying to bring manufacturing back onto domestic soil with the Chew GPS Act, which.

Speaker 3

34:30

Is a massive piece of industrial policy. The US government is spending over fifty two billion dollars to subsidize companies like Intel to build fabs in Ohio and TSMC to build fabs in Arizona.

Speaker 2

34:41

But fifty two billion, I mean earlier you said one single modern fab cost twenty billion dollars to build.

Speaker 3

34:47

Exactly. The money goes fast, and it's not just about the money. It is a severe talent deficit. We haven't built leading edge silicon factories at scale in the US for decades. We simply do not have the thousands of specialized PhD ease or the massive workforce of highly trained clean room technicians required you can pour concrete in a year, but it takes a generation to rebuild that specialized human capacity.

Speaker 2

35:10

Okay, we have covered the geopolitical wall, but there are two other massive walls we are currently slamming into at full speed, physics and energy.

Speaker 3

35:18

Let us tackle physics first. Simply put, we are running out of atoms.

Speaker 2

35:22

The famous death of Moore's law.

Speaker 3

35:24

Right. Moore's Law, for decades depended on our ability to just keep physically shrinking transistors so we could pack more of them onto the same sized chip. But we are currently manufacturing at the three nanometer and two nanometer scale. To give you perspective, a single silicon atom is roughly point two nanometers.

Speaker 2

35:42

Wide, So the physical wires inside these nude chips are what ten or fifteen atoms across exactly.

Speaker 3

35:48

We're dealing with structures that are literally counted in dozens of atoms. And when you get down to that microscopic quantum scale, classical physics breaks down. Quantum mechanics completely takes over. Electrons stop behaving like predictable solid particles and they start behaving like waves.

Speaker 2

36:03

And what does an electron wave do inside a transistor?

Speaker 3

36:06

It ignores the walls. It does something called quantum tunneling. The electron literally teleports through the physical barrier that is supposed to hold it back. You get massive electrical leakage, you get uncontrollable heat. You physically legally cannot shrink the silicon gate much further because the laws of the universe won't let you.

Speaker 2

36:23

So how on earth do we keep making computers faster every year? If we cannot make the microscopic parts any.

Speaker 3

36:27

Smaller, we completely change how we package them. This is called the chiplet revolution chiplets.

Speaker 2

36:32

So instead of putting one giant, monolithic chip, you use lots of little pieces.

Speaker 3

36:36

Exactly go back to that yield problem we talked about with cerebras. If you try to print a giant, complex chip the size of a cracker, the mathematical odds of a microscopic dust particle or a defect landing somewhere on that large surface are extremely high. Maybe only forty percent of the chips on your wafer actually work. That makes them astronomically expensive to produce.

Speaker 2

36:59

But if you intentional print very tiny, simple chips, the odds of.

Speaker 3

37:03

A defect landing on a tiny footprint are very low. You might get a ninety or ninety five percent yield. So now companies like AMD and Intel are pivoting entirely. They are printing small, specialized functional tiles, a compute tile, a separate memory tile, and input output tile, and they are stitching them closely together.

Speaker 2

37:20

After the fact, it is exactly like building with Legos.

Speaker 3

37:23

It is just like legos, but they use incredibly advanced packaging techniques, sometimes stacking them in three D so that electrically to the software they act exactly like one single unified chip. This modular chiplet design is realistically the only way we can keep performance scaling up now that traditional transistor shrinking is effectively dead.

Speaker 2

37:44

And then there is the energy wall. Honestly, this one feels the most tangible and immediate to me.

Speaker 3

37:49

It is the most immediate constraint on the entire AI industry right now. The power draw statistics are genuinely frightening. A single standard server rack full of Nvidia H one hundreds can draw four eighty to fifty kilowants of continuous.

Speaker 2

38:01

Power just to ground that. Compare that to a standard suburban house.

Speaker 3

38:05

An average American home might draw one to two kilowatts on average, So one single metal rack of AI servers is using the power of an entire neighborhood. A full AI data center is drawing the power of a medium sized city.

Speaker 2

38:17

I have heard these wild rumors that big tech companies are literally looking into buying nuclear power plants just for AI.

Speaker 3

38:23

I can assure you it is not a rumor. Microsoft is actively hiring directors of nuclear strategy right now. Amazon just bought a data center campus in Pennsylvania that is physically plugged directly into an existing nuclear power plant. They're actively lobbying for SMR small modular reactors because our aging national power grids simply cannot support the projected AI energy demand. We're talking about future training runs that will require gigawatts scale power.

Speaker 2

38:52

Gigawatts. We are going to have to construct dedicated nuclear power plants just to train GPT six.

Speaker 3

38:58

Literally, yes, If we do not fund and mentally improve the energetic efficiency of these chips, the entire AI revolution will just stall out because we will physically trip the breakers of the electric grid.

Speaker 2

39:07

That puts an unimaginable premium on completely new ideas. If traditional silicon is hitting a quantum wall and energy consumption is hitting an absolute ceiling, we clearly need new physics. What does the sci fi future of computing actually look like?

Speaker 3

39:21

There are three major alternative computing paths being researched right now that get me incredibly excited. The first is photonic.

Speaker 2

39:27

Computing, computing with actual light instead of electricity, right right now.

Speaker 3

39:31

To do math, we forcefully push electrons through solid copper wires. That physical friction generates heat, it creates electrical resistance, But photons, actual particles of light, have zero mass and experience zero electrical resistance. Innovative startups like light Matter are currently building chips that use microscopic devices called interferometers.

Speaker 2

39:52

How does an interferometer actually calculate anything?

Speaker 3

39:55

You take a single laser beam of light and you split it down two different tiny optic channels. By precisely controlling the phase of those light waves, basically shifting how the peaks and valleys of the waves line up when they recombine, you can actually perform complex matrix multiplication natively in the light itself. You are doing advanced aimth with pure light beams moving at the literal speed of light while generating almost zero heat.

Speaker 2

40:19

That sounds like actual magic. If it is that fast and cold, why aren't we ripping out our GPUs and doing it right now.

Speaker 3

40:25

Because controlling microscopic light waves precisely on a tiny silicon chip is outrageously difficult, and the big catches. You still have to convert the light signals back into slow electricity anytime you want to store the data in standard memory, but as a coprocessor just for doing the heavy matrix math. It is a massive potential game changer.

Speaker 2

40:44

Okay, so light is past number one? What is past number two? Neuromorphic computing neuro meeting modeling it after the biological brain.

Speaker 3

40:52

The human brain is, without a doubt, the most efficient computer in the known universe. Your brain runs on about twenty watts of power, basically incandescent light bulb, and with those twenty watts it manages complex continuous learning, real time stereoscopic vision, complex language processing, and fine motor control all simultaneously.

Speaker 2

41:10

And meanwhile, an AI trying to do just one of those things needs its own personal nuclear reactor. What exactly is the biological brain doing so differently?

Speaker 3

41:19

It uses an event based architecture. In a normal digital computer chip, there is a global clock that ticks billions of times a second, and with every single tick, every single transistor on the chip gets flooded with power, even if it has absolutely zero useful work to do. In the biological brain, neurons stay completely dark and only consume energy when they actively need to fire. We call these spikes.

Speaker 2

41:41

So the brain is just incredibly lazy in a highly optimized good way.

Speaker 3

41:45

Efficiently lazy. Yes, Neuromorphic chips like Intel's Loy Project or IBM's Northpole are attempting to physically mimic this spiking neural network architecture in silicon. If the incoming data isn't actively changing the relevant parts, the chips simply shut down and consume zero power. It is wildly efficient.

Speaker 2

42:03

And what is the third alternative path?

Speaker 3

42:05

Analog computing?

Speaker 2

42:06

Analog like going backwards to the nineteen fifties sort of.

Speaker 3

42:09

Yes, modern computing is purely digital. It uses strict ones and zeros. It is perfectly precise, but it takes a massive amount of physical transistors just to represent a long decimal number like zero point three four five two. In an analog system, you can represent that exact same complex number as a single continuous voltage level, just a fluid electrical wave.

Speaker 2

42:32

So you can actually do the math natively inside the continuous wave itself, rather than breaking it down into digital bits.

Speaker 3

42:39

Exactly, you can perform massive matrix multiplications instantly simply by passing different electrical currents through a grid of physical resistors. It is incredibly fast and saves a massive amount of energy. The main downside, however, is noise. Analog systems are affected by temperature variations and slight manufacturing defects. They aren't profecularly precise. The math gets a little fuzzy.

Speaker 2

42:59

But with ai ural networks isn't a little fuzzy. Actually, Okay, they are basically probability engines anyway.

Speaker 3

43:04

That is the exact billion dollar bet these analog startups are making. A neural network doesn't always need mathematically perfect precision to tell you it's looking at a picture of a cat. It just needs to be close enough. If we can accept just a tiny bit of analog noise in exchange for a one thousand x gain and energy efficiency, the entire industry will gladly make that trade, and ultimately that ties into the final evolution of all this, which is deep software hardware code design.

Speaker 2

43:31

Where the code and the silicon are actually designed together from day one.

Speaker 3

43:34

Exactly, we are moving away from writing general software for general chips. Future architectures will feature algorithms totally explicitly co designed alongside the cuts and physical hardware squeezing absolutely every last drop of performance out of the silicon.

Speaker 2

43:48

Wow. So, just stepping back for a second, we have gone from two kids buying a five hundred dollars GTX five to eighty in a Best Buy to play video games, all the way to massive H one hundred data center. We've gone through Google's secret TPU panic, through an escalating global trade war over tiny EUV machines, and now we are talking about computers running on actual light beams and analog waves. It has been quite a journey, it really.

Speaker 3

44:11

Has been, and I think the major synthesis here, the big takeaway, is crucial. We started this entire conversation by observing that software code used to be the undisputed king of tech, but today raw access to this unbelievable level of compute is all that really matters. And unfortunately that access is heavily centralizing.

Speaker 2

44:30

Right because you and I cannot just decide to build a ten thousand chip H one hundred cluster in our garage this weekend. This kind of power is exclusively concentrating in the hands of the massive hyperscalers and a select few sovereign nations.

Speaker 3

44:44

Exactly, the barrier to entry has moved from needing a clever idea and a laptop to needing billions of dollars in capital in your own dedicated power plant.

Speaker 2

44:52

So, as we wrap up what does this all mean for the listener, we always like to leave everyone with a provocative thought to chew on after the deep dive.

Speaker 3

45:00

Say this, the compute race that we've been discussing is not just a standard market competition between a few tech giants. It is quite literally a race to harness the fundamental laws of physics. We are currently pushing atomic structures and global energy grids to their absolute breaking point, all in an attempt to build synthetic minds.

Speaker 2

45:20

And whoever gets there first, whoever wins that race, gets all the spoils.

Speaker 3

45:24

Because whoever builds the fastest, most scalable, and most efficient hardware doesn't just win in a temporary commercial sector. They gain the power to permanently decide what artificial intelligence actually looks like for the rest of human history. They will fully control the fundamental infrastructure of thought for the twenty first century. That is the true stakes of what is being painstakingly etched onto those tiny silicon wafers in Taiwan.

Speaker 2

45:48

The infrastructure of thought. That is a wonderfully heavy concept to end on. Thank you so much for breaking all this down. We will be back next time with another deep dive. Thanks for listening, Thanks

Speaker 3

45:56

For having me.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript