AI Computing Hardware - Past, Present, and Future - podcast episode cover

AI Computing Hardware - Past, Present, and Future

Jan 29, 20252 hr 4 minEp. 237
--:--
--:--
Listen in podcast apps:

Episode description

A special one-off episode with a deep dive into the past, present, and future of how computer hardware makes AI possible.

Join our brand new Discord here! https://discord.gg/nTyezGSKwP

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Sponsors:

  • The Generator - An interdisciplinary AI lab empowering innovators from all fields to bring visionary ideas to life by harnessing the capabilities of artificial intelligence.

In this episode:

 - Google and Mistral sign deals with AP and AFP, respectively, to deliver up-to-date news through their AI platforms.  - ChatGPT introduces a tasks feature for reminders and to-dos, positioning itself more as a personal assistant.  - Synthesia raises $180 million to enhance its AI video platform for generating videos of human avatars.  - New U.S. guidelines restrict exporting AI chips to various countries, impacting Nvidia and other tech firms.

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Timestamps:

  • 00:00:00 Introduction
  • 00:03:08 Historical Recap: Early AI and Hardware
  • 00:11:51 The Rise of GPUs and Deep Learning
  • 00:15:39 Scaling Laws and the Evolution of AI Models
  • 00:24:05 The Bitter Lesson and the Future of AI Compute
  • 00:25:58 Moore's Law and Huang's Law
  • 00:30:12 Memory and Logic in AI Hardware
  • 00:34:53 Challenges in AI Hardware: The Memory Wall
  • 00:37:08 The Role of GPUs in Modern AI
  • 00:42:27 Fitting Neural Nets in GPUs
  • 00:48:04 Batch Sizes and GPU Utilization
  • 00:52:47 Parallelism in AI Models
  • 00:55:53 Matrix Multiplications and GPUs
  • 00:59:57 Understanding B200 and GB200
  • 01:05:41 Data Center Hierarchy
  • 01:13:42 High Bandwidth Memory (HBM)
  • 01:16:45 Fabrication and Packaging
  • 01:20:17 The Complexity of Semiconductor Fabrication
  • 01:24:34 Understanding Process Nodes
  • 01:28:26 The Art of Fabrication
  • 01:33:17 The Role of Yield in Fabrication
  • 01:35:47 The Photolithography Process
  • 01:40:38 Deep Ultraviolet Lithography (DUV)
  • 01:43:58 Extreme Ultraviolet Lithography (EUV)
  • 01:51:46 Export Controls and Their Impact
  • 01:54:22 The Rise of Custom AI Hardware
  • 02:00:10 The Future of AI and Hardware

Transcript

lwiai_hardware_ep-001: Hello and welcome to the last week in a podcast where you can hear us chat about what's going on with AI and not as usual in this episode we will not summarize or discuss some of last week's most interesting AI news instead this is our long promised episode on hardware we'll get into a lot of detail basically do a

deep dive unrelated to any AI news But I guess related to the general trends we've seen this past year with a lot of developments in hardware and crazy investments, right? In data centers. So to recap, I am one of your hosts, Andrey Kurenkov. I study AI and I now work at a startup.

yeah, i'm, jeremy harris I'm the co founder of gladstone national security company And I guess just by way of context of my end too on on the hardware piece so, you know the the work that we do is is focused on The kind of WMD level risks that come from advanced AI current and increasingly future systems. So my footprint on this is I look at AI a lot from the, through the lens of. Of hardware because we're so focused on things like export controls.

How do we prevent China, for example, from getting their hands on this stuff? You know, what kinds of attacks, one of the things we've been looking into recently, you know, what kinds of attacks can people execute against a highly secure data centers in the West, whether that's to exfiltrate models, whether that's to change the behavior strategically of models that are being trained, whether that's just to blow up facilities.

So a lot of our work is done these days with you know, like special forces and folks in the intelligence community. As well as increasingly some data center companies to figure out, you know, how do you secure these sites? And, and obviously all the kind of U. S. government work that we've been doing historically. So that's kind of my lens on it. and obviously the alignment stuff and all that jazz.

So I guess I, I know enough to be dangerous on the AI and compute side, but I'm not a PhD in, in AI and compute, right? My, my specialization is I know what I need to know for the security piece. And so.

To the extent possible, I'll try to flag, we'll try to flag some resources and maybe people for you to check out if you're interested in doing those deeper dives on on some of the other facets of this, especially compute that doesn't have to do with AI, compute that's not national security kind of related stuff. So hopefully that's, that's useful for you. I guess war flagging too.

On my end, I studied software and ai, I trained algorithms, so I have relatively little understanding of how it all works. Actually. I just use GPUs and kind of broadly know what I do. But you know, I'll be here listening and learning from Jeremy as well. I'm sure, I'm sure it'll go both ways. I mean, I'm excited for this anyway. I, yeah, I think there's a lot of opportunity here for us to cross pollinate. let's just get into it.

So I thought to begin, before we dive into the details of what's going on today, we can do like a quick historical recap of fun details in the past of AI and hardware. there's some interesting details there, AI and hardware go back to basically the beginning, right? Turing was a super influential person within the world of computing. And then Turing game, right, is his invention to try and, I guess, measure when we'll get AI or AGI, as you might say, and that's still widely discussed today.

So even before we had actual computers that were general purpose, people were thinking about it. by the way, that imitation game piece, in a way, it's freakish how far back it goes. I've never read Dune, but I know there's a reference in there to the Butlerian Jihad, right? And so, so Butler back in the like, it was 1860s, or I'm showing off how, how little I know my dates here.

But he was the first to observe that you could get like, you know, hey, these machines seem to be popping up all around us. Like, we're industrializing, we're building these things. What if one day we start like, kind of, you know, I don't know if it was like building machines that can help us build other machines eventually, will they need us? It wasn't with respect to computer or anything like that, but it's sort of an interesting thing.

Like when you, when you look back at how, how incredibly prescient some people were about this sort of thing. Anyway, sorry, I didn't mean to derail, but you're making a great point here that it's, it goes way, way before the days of, you know, early 2000s, people starting to worry about loss of control. Yeah. Yeah. Wow. You also reminded me that it's. It's called The Imitation Game. The Turing Game is not a thing.

There's a Turing test that people call The Imitation Game as it was originally published. Anyways, so yeah, it was conceptually, of course, on people's minds for a very long time. The concept of AI, of robotics, et cetera. But even as we go into the fifties and get into actual computing, still with vacuum tubes, not even getting to semiconductors yet, there's the beginnings of AI as a field in that time.

So one of the very early initiatives that could be considered AI was This little program that played checkers and you could go as early as 1951 there where someone wrote a program to do it. then, yeah, there's a couple of examples of things like that in that decade that showcased the very first AI programs. So there was a program from Marvin Minsky actually called the Stochastic Neural Analog Reinforcement Calculator.

I actually just learned about this in doing a prep for the show, I found it quite interesting. This was actually kind of a little newer than that. That Marvin Minsky built in hardware and it simulated rats learning in like a little maze and trying to simulate reinforcement learning as there are also theories coming out about human learning, brain learning, et cetera. And To give you some context, there are maybe 400 neurons. I forget a small number. Each neuron had six vacuum tubes and a motor.

And the entire machine is the size of a grand piano with 300 vacuum tubes. So they had that early example of kind of a custom built computer for this application. That's actually one thing too, right? In the history of computing everything was so custom for so long. That's something that's easy to lose sight of, the idea of even of, of building these very scalable modules of computing, you know, having ways to integrate all these things together.

That wasn't until really Intel came, came into the game. That was their big thing at first, as I recall. The thing that broke Intel in was like, Hey, we'll just come up with something that's not bespoke, so it won't be as good at a specific application, but boy, can it scale. all the time before that, you have all these, like you said, ridiculously bespoke kind of things. So it's more almost physics, in a sense, Then computer science, if that makes sense. Yeah, exactly.

Yeah. It was a lot of like people pulling together and building little machines, right, to demonstrate really theories about AI. There is a fun other example I found where there is the famous IBM 701 and 702. Right. IBM was just starting to build this massive, mainframes that were kind of a Main paradigm for computing for a little while, especially in business.

So the IBM 7 0 1 was the first commercial scientific computer, and there is Arthur Samuel who wrote a checkers program and it was maybe the, one of the first definitely learning programs that was demonstrated. So. It had very kind of primitive machine learning built into it. It had memorization as one idea, but then also some learning from experience. And that's one of the very first demonstration of something like machine learning.

Then famously, there's also the perceptron that goes to 1958, 1959. And that is sort of the first. Real demonstration, I would say of the idea of neural nets famously by Frank Rosenblatt, again, a custom built machine at that point that had kind of these, if you look, there's just photos of it online. It's, it looks like this crazy tangle of wires that built a tiny neural net that could learn to differentiate shapes. And at the time Rosenblatt and others were very excited about it.

And then of course, a decade later, kind of the excitement died out for a little while. And then there's some interesting history we won't be getting into later in the 80s with custom build hardware. There were like, Custom hardware for expert systems that were being sold and being bought for a little while. There were this thing called Lisp machines where Lisp was a pretty major language in AI for quite a while. It was developed kind of to write AI programs.

And then there were custom machines called Lisp machines that were utilized, By, I guess, scientists and researchers that were doing this research going into the seventies and eighties, when there was a lot of research in the realm of, I guess, logical AI and, and search and so on, symbolic AI. Then again, continuing with quick recap about the history of AI and computing, we get into eighties, nineties. The Lisp machines, the expert hardware systems died out.

This is where sort of, as you said, I guess this was the beginning of general purpose computing proper with Intel and Apple and all these other players, making hardware that doesn't have to be these massive mainframes that you could actually buy more easily and distribute more easily. And so there's kind of fewer examples of hardware details aside from what will become Deep Blue in the late 90s. IBM was working on this massive computer specially for playing chess.

That, and I, I think a lot of people might not know this at Deep Blue wasn't just a program. It was like a massive investment in hardware so that it could do these ridiculously long searches. It was really not a learning algorithm. To my knowledge, basically it was doing kind of the well known search with some heuristics approach to chess and with some hard coded evaluation schemes, but to actually win at chess, the route there was to build, some crazy hardware specialized for playing chess.

And that was how we got the demonstration without any machine learning of a sort we have today. And let's finish off the historical recap. So of course we had Moore's law all throughout this computing was getting more and more powerful. So we saw research into neural nets making a comeback in the eighties and nineties, but I believe at that point people were still using CPUs and trying to train these neural nets without any sort of parallel computing as as a common paradigm today.

Parallel computing came into the picture with GPUs, graphics processing units that were needed to do 3D graphics, right? And so there was a lot of work starting in around the late nineties and then going into 2000s. That's how NVIDIA came to be by building these graphics processing units that were in large part for the gaming market. And then kind of throughout the 2000s Before 2010s, a few groups were finding that you could then use these GPUs for scientific applications.

You could solve, for instance general linear algebra programs. And so this was before the idea of using it for neural nets, but it kind of bubbled up to a point that by I think 2009, there was some work by Andrew Ng applying it. There was the rise of CUDA where you could actually program these NVDA GPUs for whatever application you want. And then of course, famously in 2012, there was the AlexNet.

paper where we had the AlexNet neural net, one of the first deep neural nets that was published and destroyed the other algorithms being used at the time onto the ImageNet. Benchmark. And to do that, one of the major actually novelties from the paper and why it succeeded was that they were among the first to use GPUs to train this big network, probably they couldn't have otherwise.

They used two NVIDIA GPUs to do this, and they had to do a whole bunch of custom programming to even be able to do that. That was one of the major contributions of the students. And that was kind of when I think, NVIDIA started to get more In the GPU for AI direction, they were already going deeper into it. They wrote cu C-U-D-N-N. C-U-D-N-N. Yeah, yeah, yeah, yeah. And they were starting to kind of specialize their hardware in various reasons.

They started creating architectures that were better for ai, you know, the Kepler architecture, Pascal, et cetera. so again, for some historical background, maybe people don't realize that way before GPT, way before Chad GPT, the demonstrations of deep learning in the early 2010s were already kind of accelerating the trend towards investment in GPUs, towards building data centers.

Definitely by the mid 2010s, it was very clear that you would need deep learning for A lot of stuff for things like translation and Google was already making big, big, big investments in it, right? Buying deep mind, expanding Google brain. And of course investing in TPUs in the mid 2010s, they developed the first customized AI hardware to my knowledge, custom AI chip. And so throughout the 2010s, AI was already on the rise.

Everyone was already on kind of the, Mindset that bigger is better, but you want a big, bigger neural nets, a bigger data sets, all of that. But then of course open AI realized that that should be cranked up to 11. You shouldn't just like have 10 million, a hundred million parameter models. You got to have billion parameter models. And that was their first well, they had many innovations, but they're Breakthrough was in really embracing scaling in a way that no one has before.

So I think one of the things too, that's worth noting there is like this rough intuition and you can hear people, you know, pioneers like Jeff Hinton and Andrew Ng talk about the, the general sense that more data is better, larger models are better, all this stuff, but what really comes with. the Kaplan paper, right? That famous scaling laws from neural language models paper, the, the proof point that was GPD three and GPD two in fairness as well, and GPD one.

But what really comes from, from the GPD three inflection point is the actual scaling laws, right? For the first time, we can start to project with confidence how good a model will be, and that makes it an awful lot easier to spend more CapEx right now, all of a sudden it's, it's a million times easier to reach out to your CTO, your CEO, and say, Hey, We need a hundred million dollars to build this massive compute cluster because look at these straight lines on these log plots, right?

So kind of like change the economics because it decreased the risk associated with scaling. That's right. And I think the story of OpenAI in hindsight can almost be seen as the search for the thing that scales, right? Because for the first two years, they were focusing on reinforcement learning.

some of their major kind of PR stories that you could say, but also papers was working on reinforcement learning for Dota for video game Dota, and then even at the time they were like using a lot of compute, really spending a lot of money training programs, but in a way that didn't scale. Because reinforcement learning is very hard and you can't simulate the world very well.

They also were investing in robotics a lot and they had this whole arm and they did a lot of robotic simulations, but again, it's hard to simulate things that wouldn't scale. Evolutionary algorithms was another thread, right? Yeah, they, they did a whole bunch of things, right. From 2015 up through 2018. And then 2017 was the Transformers paper, of course. And then around 2018, the whole kind of idea of. pre training for natural language processing arose.

So from a very beginning, or okay, not very beginning, but pretty soon after AlexNet and around 2014, people realized that if you train a deep convolutional neural net on classification, you could then use those embeddings in a general way. So the kind of intelligence there was reusable for all sorts of vision applications, and you can basically bootstrap training from a bunch of weights that you already trained.

You don't need to start from scratch and you don't even need as much data for your task. So it didn't happen in natural language processing until around 2017, 2018. That was when language modeling was kind of. Seen or found out by a few initiatives as a very promising way to pre train, weights for natural language processing. BERT is one of the famous examples from around that time. And so the first GPT was developed in that context.

It was one of the first big investments in pre training a transformer on the task of language modeling. And then OpenAI, I guess it's, we don't know the exact details, but it seems like they probably were talking internally and got the idea that, well, you know, this task, you can just scrape the internet to get all the data you want. So the only question is how big can you make the transformer?

is a great architecture for scaling up because you can paralyze in GPUs, unlike in RNNs, so that was kind of necessary in a way. And yeah, then we got GPT 2 in 2019, that was like a, almost a 2 billion, like 1. 7 billion parameter model. By far the biggest that anyone has ever trained.

And even at the time, it was interesting because you had these early demos like on the blog where it wrote a couple paragraphs about that unicorn Island or whatever, already at that time, there was discussion of like the safety implications of GPT 2 and misinformation. And so on. It's they normally by then, right. Cause they'd open source GPT 2. Well, GPT, GPT one. And, and they had set this precedent of always open sourcing their models.

Hence the name, actually open AI GPT two was the first time they experimented with what they at the time called this staged release strategy, right? Where they would release. Large, incrementally larger versions of GPT 2 over time, monitor how supposedly they were seeing them get maliciously used, which it was always implausible to me that you'd be able to like, to tell if it was being used maliciously on the internet when it's an open source model. But okay.

then ultimately, yeah, GPT 3 was closed. So they, yeah, they followed, as you say, that kind of smooth progression. yeah, speaking of that, lead up to GPT 2 also what we know now from looking at the emails in the OpenAI versus Elon Musk case. It was never the plan. Yeah. What you can, some of the details there is that the conversations in 2018 and why they started to go for profit is that they did have the general kind of belief that hardware was. Crucial that Google had all the hardware.

And so Google would be the one to get to a GI. And so they needed the money to get more hardware, to invest more in training. And that's what kicked off all this for profit discussions in 2018 and led eventually to Sam Altman somehow securing 10 billion from Microsoft. I forget when this was announced. I Maybe 2019, I think there was a one, an initial 1 billion investment that I think was 2019. And then there was maybe a 2021 ish, 10 billion, something like that. Yeah, that sounds right.

It sounds like 1 billion is more reasonable. So yeah, like I think OpenAI was one of the first to really embrace the idea that you need what we now. You know, know as, as massive data centers and training, crazily paralyzed training for crazily large neural nets. And they already were going down that route with the Dota agent, for instance, where they're training in, in very large clusters. And even at that time, it was very challenging.

Anyways, 3. We get to 175 billion parameter models, we get to scaling laws, and we get to in context learning, and then by that point it had become clear that you could scale and you could get to very powerful language models, and the whole idea of in context learning was kind of mind blowing. Somehow, Everyone was still not kind of convinced enough to invest. Like looking back, it's kind of interesting that Meta and Google and so on, weren't training massive neural nets language models.

I think internally Google was. To some extent, but they were not trying to commercialize it. They were not trying to push forward. And then of course you had chat GPT in 2022 with GPT 3. 5, I think at the time that blew up and now everyone cares about massive neural nets, massive language models, and everyone wants massive data centers and is fighting over the electricity needed to fuel them. Elon Musk is, you know, buying a hundred thousand GPUs and hardware is.

Like a huge, huge part of a story, clearly. Yeah, by the way, the story of hardware isn't in a sense, I mean, we are talking about the story of the physical infrastructure that very plausibly will lead to superintelligence in our lifetime. So I think there almost isn't anything more important that's physical to study and understand in the world. It's also, we're lucky because it's a fascinating story.

Like we're not just talking about egos and, and, you know, billionaire dollars chasing after this stuff. scientific level. It's fascinating at a business level. It's fascinating. Every layer of the stack is fascinating. And that's one of the reasons I'm so excited about this, this episode, but you framed up really nicely, right? What is this current moment? We have the sense that scaling in the form of scaling, compute, scaling data and scaling model size, but that's relatively easier to do.

Is king, right? So the bitter lesson, right? The rich Sutton argument that came out right before scaling laws for neural language models came out in like the 2019 era says basically, Hey, you know, all these fancy AI researchers running around coming up with new fancy architectures and thinking that's how we're going to make AGI. Unfortunately, I know you want that. Unfortunately, human cleverness just isn't the factor we would have hoped it was. It's so sad. It's so sad, you know.

That's why it's the bitter lesson. Instead, what you ought to do, really, this is the the core of the bitter lesson, is get out of the way. Of your models, just let them be, let them scale, just take a dumb ass model and scale it with tons of compute and you're going to get something really impressive. And he was alluding in part to the successes of sort of early success of language modeling, also reinforcement learning. So it wasn't clear what the architecture was that would do this very soon.

It would turn out to clearly be the transformer but you know, you can improve on that. Really models, the way to think about models is, or architectures is that they're just. A particular kind of funnel that, that like pours compute that you, you pour in at the top and shapes it in the direction of intelligence. They're just your funnel. They're not the most important part of it. There are many different shapes of funnel that will do many different aperture widths and all that stuff.

And you know, if your funnel's kind of stupid, well just wait until compute, gets slashed in cost by, by 50 percent next year or the year after and, and your, your same stupid. Architecture is going to work just fine, right? So there's this notion that even if we are very stupid at the model architecture level, as long as we have an architecture that can take advantage of what our hardware offers, we're going to get there, right? That's the fundamental idea here.

And what this means at a very deep level is that the future of AI is deeply and inextricably linked to the future of compute and machine learning. And in the future of compute, that starts having us ask questions about Moore's law, right? Like this, this fundamental idea, which by the way, I mean, going historical just for a brief second here to frame this up. You know, this was back in 1975. Moore basically comes up with this observation.

You know, it's not, not he's not saying it's a physical law. It's just an observation about how The business world functions and how at least the, the interaction between business and science, we seem to see, he says at the time that the number of components, the number of transistors that you can put on an integrated circuit on a chip seems to double every year. That was his claim at the time. Now, we now know that that number actually isn't quite doubling every year.

Moore, in fact, in 1975, came back and he updated his time frame. He said, Nah, it's not every year. It doubles every two years. And then there was a bunch of argument back and forth about whether it should be 18 months. The details don't really matter. The bottom line is you have this stable, reliable increase, exponential increase, right, doubling every 18 months or so, in terms of the number of components, the number of computing.

Components transistors that you put on your chip and that means you can get more for less right your same chip can do more Intelligent work. Okay, that's basically the the trend the fundamental trend that we're gonna ride all through the years And it's gonna take different forms and you'll hear people talk about how Moore's law is dead and all that stuff None of that is correct, but it's incorrect for interesting reasons. And that's gonna be part of what we'll have to talk about in this episode.

and that's really the kind of landscape that we're in today. What is the jiggery pokery? What are the games that we're playing today to try to keep Moore's Law going? And how has Moore's Law changed? We're In the world where we're specifically interested in AI chips, because now we're seeing a specific Moore's law for AI trend.

That's different from the historical Moore's law that we've seen for integrated circuits over the decades and decades that, you know, made more famous for making this prediction. And on that point, actually, this, I think is not a term that's been generally utilized, but has been written about and video actually called it out. There is now the idea of Huang's law.

Where the trend in GPUs has been kind of very much in line with Moore's law, even faster where you start seeing again in the early 2010s, the start of the idea of using it for AI, and then more Sort of the growth of AI almost goes hand in hand with the improvements in power of GPUs, and in particular over the last few years, you just see an explosion of the power of the cost of the size of the GPUs.

of a GPU is being developed once you get to the H 100, it's like 1000, some big, big number compared to what you had a decade prior, just a decade prior, probably more than 1000. So yeah, there's kind of the idea of Huang's law where the architecture and the, I guess, development of parallel computing in particular. It has this exponential trend.

So even if the particulars of Moore's law, which is the density you can achieve at the, you know, nano scale of semiconductors, even if that might be saturating due to inherent physics, the architecture and the way you utilize the chips in paralyzed computing hasn't slowed down at least so far. And that is a big part of why we are where we are. Absolutely. And, and in fact, that is a great segue into sort of peeling the onion back one more layer, right?

So we have this general notion of Moore's law and now Andre is like, but there's also Huang's law. So, so how do you get from, you know, two X every 18 months or so to all of a sudden, you know, something closer to like, Forex every two months, or, you know, depending on the metric you're tracking. And, and this is where we have to talk about what actually is a chip doing? What are the core functions of, of a chip that really performs any kind of task?

And the two core pieces that I think were worth focusing on today, because they're especially relevant for AI. Number one, you have memory. You've got to be able to store the data that you're working on. And then number two, you have logic. You got to have the ability to do shit to those bits and bytes that you're storing, right? Kind of makes sense. Put those two things together. You have a full problem solving machine. You have the ability to store information.

You have the ability to do stuff to that information, carry out mathematical operations, right? So memory storage and logic. Sort of the, yeah, the logic, the reasoning, or not the reasoning, the number, the math, the number. And so when we actually kind of tease these apart, it turns out, especially today, it's very, very different. It's a very, very different process, very, very different skill set that's required to make logic versus to make memory.

And there are a whole bunch of reasons for that that have to do with the kind of architecture that goes into making like logic cells versus memory cells and all that stuff, but get into that later if it makes sense. For now, though, I think the important thing to flag is logic and memory. are challenging to make for different reasons, and they improve at different rates. So if you look at logic improvements over the years, right? The ability to just pump out flops, right?

Floating point operations per second, right? How quickly can this chip crunch numbers? There you see very rapid improvements. improvements. And part of the reason for that, a big part of the reason is that if you're a fab, that's building a logic, then you get to focus on basically just one top line metric that matters to you. And that's generally transistor density. In other words, how many of these compute components, how many transistors can you stuff onto a chip? That's your main metric.

You care about other things like power consumption and heat dissipation, but those are pretty secondary constraints. You've got this one clean focus area. In the meantime, if you care about memory, now you have to worry about not one key kind of KPI, you're worried about basically three main things. First off, how much can my memory hold? What is the capacity of my memory? Second, how quickly can I pull stuff from memory, which is called latency.

So basically you can imagine, right, you have a like a bucket of memory and you're like, I want to retrieve some, some bits from that memory. How long am I going to have to wait until they're available to me to do math on them? Right? That's the latency. So we have capacity, how much can the bucket hold, latency, how long does it take to get shit from the bucket, and then there's bandwidth. How much stuff can I pull from that memory at any one time?

And so if you're optimizing for memory, like you have to optimize these three things at the same time. You're not focused exclusively on one metric, and that dilutes your focus. Something's got to give and that thing is usually latency. So usually when you see memory improvements, latency hasn't really gotten much better over the years. Capacity and bandwidth have, they've gotten a lot better, a lot faster. Right.

so you can sort of start to imagine, depending on the problem you're trying to solve, you may want to optimize for, you know, like really high capacity, really high bandwidth, really low latency, which is what we're Often more of the case in AI or, or some other combination of those things. So already we've got the elements of chip design starting to form a little bit where we're thinking about, you know, what, what's the balance of these things that we want to strike.

And, historically, one of the challenges that's, that's come up from this, right. Is you have, as I said, low latency. So that's the thing that's, that's tended to be kind of crappy because people are focused when it comes to memory on. Capacity and bandwidth, right? How much can I pull it once? And how big is my bucket of memory? Because latency kind of sucks, because it's been growing really slowly, one consequence is our logic has been improving really fast, right?

We're able to stuff a whole bunch of transistors on a chip. What tends to happen is, There's this growing disparity between your logic capability, like how fast you can number crunch on your chip, and how quickly you can pull in fresh data to do new computations on.

And so you can kind of imagine the logic part of your chip, like, it's just crunched all the numbers, crunched all the numbers, and then it's just sitting there twiddling its thumbs while it waits To for more memory to be fetched so it can solve the next problem. And that, that disparity, that gap is basically downtime and it's become an increasing problem because again, transistor density logic has been improving crazy fast in AI, but latency has not has been improving much more slowly.

And so you've got this like crazy high capacity. To crunch numbers, but this relatively like long delay between subsequent rounds of, memory inputs. And this is what's known as the memory wall, or at least it's a big part of what's known as the memory wall in AI. So a big problem structurally in AI hardware is how do we overcome this? And there are a whole bunch of techniques people, people work on to do this.

Trying to do things like, anyway, stagger your, your memory and put so that your memory is getting fetched while your number crunching still on, on that previous batch of numbers so that they overlap to the maximum extent possible all kinds of techniques. But this is kind of the, the fundamental landscape is you have logic and you have memory and logic is improving really fast.

Memory is not improving quite as fast because of that dilution of focus, but both logic and memory have to come together. On a high performance AI chip. And basically the rest of the story is going to unfold with those key ingredients in mind. So I don't know, maybe that's a good tee up for, for the next step here. Yeah. And I can add a little bit, I think on that point, it's very true. If you go, just look at RAM capacity over the years has grown very fast, but not quite as fast as Moore's law.

And one of the, I guess. of memory is it's also more complex. Well, I guess CPUs are also complex. Now you're paralyzed, but memory is similarly complex where for various reasons is you don't just make the memory faster, you can have smarter memory. So you introduce caching where, you know, this data is something you use a lot. So you have a faster memory that's smaller, but you utilize. And cache important information so you can get it faster.

So you have these layers of memory that have different speed, different size. Right. And now you get to GPUs that need absurd amounts of memory. So on CPU is right. We have RAM. Which is random access memory, which is kind of like the fast memory that you can use, and that's usually eight gigabytes, 16 gigabytes. A lot of your OS is in charge of getting stuff from storage, from your hard drive to RAM, to then compute on, and then it gets into the cache when you do computations.

Well for neural nets, you really don't want to store anything that it's not in RAM. And you want as much as possible to be in cache. So I don't know the exact details, but I do know that a lot of engineering that goes into GPUs is that those kinds of caching strategies, a lot of re optimizations, intrasformers is about key value caching, and you know, you have just ridiculous numbers on the RAM side of GPUs that you would never see on, you know, your CPU, your laptop.

Where it's usually just 8, 16, 32 gigabytes or something like that. Yeah, absolutely. And actually, I think you introduced an element there that really helps us move into the, towards the next step of the conversation, which is what happens on the floor of a data center? Like, what does the data center floor look like?

the reason is that when you think about computing, the image to have in your mind is hierarchy, is a cascading series of increasingly complex and increasingly, increasingly close to the bare silicon operations. So think about it this way, heading into a data center, right? You have just like a gigantic amount of really, really high voltage, right? And just power lines that are coming in now on the chip itself.

You're dealing at like, you know, roughly the like electron level, you're dealing with extraordinarily tiny voltages, extraordinarily tiny currents and all that stuff to gradually step down, you know, to get that energy in those electrons, it's all kind of in those photons in the middle to do all, all that good work for you. You have to do a lot of gradual step downs, gradually kind of bringing, bringing the memory, bringing the power, bringing the logic you.

All closer and closer to to the, the, the place where at the atomic level, almost the actual drama can unfold that we're all after, right? The, the, the number crunching, the arithmetic that actually trains the models and, and does inference on them. So when we think about that hierarchy. I'll, I'll identify just like a couple of levels of memory for us to keep in mind for us to keep in REM, the, so, so this just starts to kind of fold in some of these layers that we can think about as we go.

But so at the, one of the higher levels of memory is, is sort of like flash memory, right? So this could be like your solid state drives or whatever. These are, this is very, very slow memory. it will continue to work even if your power goes out, right? So it's this persistent memory.

It's, it's slow moving, but it's the kind of thing where, you know, if you wanted to store you know, like, like a, a data set or I don't know, some, some interesting model checkpoints that come about fairly infrequently. You might think about putting them in flash memory, right? This is like a very slow, long term thing. And. You might imagine, okay, well now I also need memory though. That's going to get updated.

For example, like, I don't know, every time there's like a batch of data that comes in, you know, and batches of data are coming in like constantly, constantly, constantly. So, okay, well then maybe that's your high bandwidth memory, right? So, so this is. Again, closer to the chip because we're always getting closer to the chip physically as we're getting closer to the interesting operations, the interesting math. So now you've got your HBM.

Your HBM will talk about where exactly it sits, but it's really close to where the computations happen. It uses a technology called DRAM. Which we can, we can, we can talk about and actually should. And anyway, it, it requires periodic refreshing to maintain data. So if you don't keep kind of updating each bit, cause it stores each bit as a charge and a tiny capacitor, and because of a bunch of physical effects, like leakage of current, that charge gradually drains away.

So if you don't intervene, the stored data can be lost within milliseconds. So you have to keep refreshing, keep refreshing. It's much lower latency than your flash memory. So in other words, way, way faster to pull data from it. That's critical because again, you're pulling those batches, you know, they're coming in pretty, pretty hot, right? And so usually that's on the order of tens of nanoseconds. And so, you know, every kind of tens of nanoseconds, you pull some data off the HBM.

Now, even closer to where the computations happen, you're going to have SRAM. All right. So SRAM is your fastest, your ridiculous, like sub nanosecond access time. Very, very expensive as well. So you can think of this as well as an expense hierarchy, right? As we get closer to where those computations happen, Oh, we got to get really, really kind of small components, very, very custom, you know, custom designed or very purpose built and very expensive, right?

So this is, there's this kind of consistent hierarchy typically of Size of expense of, latency, all these things as we get closer and closer to the kind of leaves on our tree to those, those kind of end nodes where we're going to do the interesting operations and data centers and chips. These are all fractal structures in that sense. Really think about. You know, think about computing. You got to think about fractals. It's fractals all the way down.

You go from like one, you know, trunk to, to branches, to smaller branches, smaller branches, just like our circulatory system, just like basically all complex structures. And that's one thing. If you play, if you play factorial, you'll be nodding along, right? This, this is what it is about the world works in fractals in this way, higher and higher resolution at the nodes, but you do want to benefit from big tree trunks, big arteries that can just have high capacity in your system. Right.

And this kind of reminds me of A little fun fact, I know probably a lot of people still, and certainly as a grad student in the late 2010s, a big part of what you were doing is literally just fitting a neural net in a GPU. You're like, Oh, I have this GPU with eight gigabytes of memory or 16 gigabytes.

And so I'm going to do NVIDIA SMI and figure out how much memory there is available on it, and I'm going to run my code and it's going to load up the model into the GPU, and that's how I'm going to do my training. All right. And so for a long while, that was kind of the paradigm is you had one GPU, one model. You try to fit the model into the GPU memory. That was it. Of course, now that doesn't work.

The models are far too big for a single GPU, especially during training when you have to do backprop with. project propagation, deal with gradients, et cetera. During inference, people do try to, you know, scale them down, do quantization, fit them often in a single GPU. But why do we need these massive data centers? Because you want to pack a whole bunch of GPUs or TPUs all together.

We have TPU pods from Google going back quite a while to 2018, I think, when we had 256 DPUs and so you can now distribute your neural net across a lot of chips and now it gets even crazier because the memory isn't just about loading in the weights of the model into a single GPU. You need to like transfer information about the gradients on some weights and do some crazy complicated orchestration just to update your weights throughout the neural net and I really have no idea how that works.

Well, and you know, part of that we can get into for sure, I think to touch on that and just to make this connect, by the way, to some of the stuff we've been happening, we've been seeing happen recently with kind of reasoning models and the implications for for the design of, of data centers and compute this stuff really does tie into, right? So I'll circle back to this observation, right? That memory like HBM high bandwidth memory in particular has been improving.

More slowly than logic, right? Then the ability to just number crunch, right? So our ability to fetch memory and fetch data from memory and the bandwidth and all that has been improving more slowly than our ability to crunch the numbers. One interesting consequence of this is that. You might expect these reasoning models that make use of more inference time compute to actually end up, disproportionately running better on older chips.

And, and so I just want to explain and unpack that a little bit. So if you have just during inference, you have to load A language model into active memory from HBM, and your batch sizes, your data that you're feeding in those batch sizes will tend to be pretty small. And the reason they tend to be pretty small at inference time is that you can imagine like you're getting these bursts of user data that are unpredictable.

And all you know is you better send a response really quickly, or it'll start to affect the user experience. So you can't afford to sit there and wait. for a whole bunch of user queries to come in and then batch them, which is what's typically done, right?

The idea with high bandwidth memory is you want to be able to batch a whole bunch of data together and amortize the delay, the latency that comes from, you know, loading that memory from the high bandwidth memory, amortize it across a whole bunch of batches, right? So sure. Like, logic is sitting there waiting for the data to come in for a little while. But when it comes in, it's this huge batch of data. So it's like, okay, that was worth the wait.

The problem is that when you have inference happening, you can't, again, you gotta, you gotta send responses quickly. So you can't wait too long to create really big batches. You've got to kind of. Well, get away with smaller batches and as a result your memory bandwidth isn't going to be consumed by by the, the kind of user data induced data, right? Like you're, you're getting relatively small amounts of your user data in.

Your memory bandwidth is disproportionately consumed by just like the model itself. And so you have this high base cost associated with loading your model in. And because the batch size is smaller, you don't need as much logic to run all those computations. You have maybe, you know, eight user queries instead of 64. So, so that's relatively easy on the flops. So, so you don't need as much hard compute, you don't need as much logic.

What you really need though is that baseline high memory requirement because your model's so big anyway. So even though your user queries are, you know, not very numerous, your, your, your model's big, so you have a high baseline need for HBM, but a relatively low need for flops. Because flops improve more slowly, this means you can step back a generation of compute and you're, you're going to lose a lot of flops. But your, your memory is going to be about the same.

And since this is more memory intensive disproportionately than compute intensive, it tends to favor, like inference tends to favor older machines. Bit of a kind of layered thing, and it's okay if you follow that whole thing. But if you're interested in this, you want to listen back on that, or ask us questions about it. I think this is actually one of the really important trends that we're going to start to see is like older hardware be useful for inference time compute.

Big, big advantage to China, by the way, because they only have older hardware. So this whole pivot to like reasoning and in inference time compute is actually a really interesting advantage for the Chinese ecosystem. And yeah, just to get a little, I think that brings up another interesting tangent, pretty quick tangent. We'll try to get into it. So you brought up batches of data and that's another relevant detail is you're not just loading in GPUs.

You're loading in, You consider our batches of data and that what that means is right. You have data sets, data sets are pairs of input and output. And when you train a neural net and when you do inference on it as well, instead of just doing one input, one output, you do a whole bunch together. So you have N inputs and outputs. And that is essential because when training a neural net, you could try to do just one example at a time, but an individual example isn't very useful, right?

Because if you can update your weights for it, but then the very next example might be the opposite class. So you would be just not finding the right path. And then it's, it's also not very feasible to train on the entire data set, right? You can't feed the entire data set and compute the average across all the inputs and outputs because that's going to be, A, probably not possible, B, probably not kind of very good for learning.

So one of the sort of key miracles, almost mathematical kind of surprising stochastic gradient descent, where you take batches of data, you take, you know, 25. 50, 200, 56, whatever. Inputs and outputs turns out to just work really well. And, you know, theoretically you should be taking the entire data set, right? That's what gradient descent should be doing. Stochastic gradient descent, where you take batches turns out to be.

Probably a good regularizer that actually improves generalization instead of overfitting. But anyway, one of the other things with OpenAI that was a little bit novel is massive batch size. So as you increase the batch, that increases the amount of memory you need on your GPU. So the batch sizes were relatively small, typically during training, like 128, 256. Now, the bigger the batch, the faster you could train and the better the performance could be.

But, yeah, typically you just couldn't get away with very big batches and OpenAI, I still remember this was one of the early organizations getting into like 2000 example batches or something like that. And then I think one of the realizations that happened with very large models is that especially during training, massive batches are very helpful. And so that was another reason that memory is important. And super, super economical too, right? Like this is one of the, the.

Crazy advantages that open AI enjoys and anyone with really good distribution in this space enjoys a distribution of their product. I mean, like if, if you've got a whole ton of users, you've got all these queries coming in at very, very high rates, which then allow you to. Do bigger batches at inference time, right? Because you may, you may you may tell yourself, well, look, I've got to send a response to my users within, I don't know, like 500 milliseconds or something like that, right?

And so basically what that says is, okay, you have 500 milliseconds that you can wait to collect inputs, to collect prompts from your users, and then you've got to process them all at once. Well, the the number of users that you have at any given time is going to allow you to fill up those batches really nicely If that number is large, and, and that allows you to amortize the cost, you're, you're, you're getting more use outta your GPUs by doing that.

This is one of the, the reasons why some of the smaller companies serving these models are realistic advantage. They're often serving them, by the way, at a loss because they just can't hit the large Batch size is they need to amortize the cost of their hardware and energy. To, to be able to turn a profit. And so a lot of the VC dollars you're seeing burned right now in the space are being burned specifically because of this low batch size phenomenon, at least at inference time.

at that point in case it's not clear or maybe some people don't know, right, a batch, the way it works is yes, you're doing end to end, but then you're, you're doing all of these in parallel. Right? You're giving all the inputs all together and you're getting all the outputs all together. So that's why it's kind of filling up your GPU. And that is one of the essential metrics is GPU utilization rate.

If you do one example at a time, that takes up less memory, but then you need, you're wasting time, right? Because you need to do one at a time. Versus if you get as many examples as your GPU can handle, Then you get, you know, those outputs altogether and you're utilizing your GPU a hundred percent and, and I'm getting the most use out of it. Yeah. And this, this ties into this dance between model architecture and and hardware architecture, right?

Like CPUs, CPUs tend to have a handful of cores, right? The cores are the things that actually do the computations. They're super, super fast cores and they're super flexible. But, but they're, they're not very numerous. Whereas GPUs have, can have like thousands of cores but each individual core is very slow.

And so what that sets up is a situation where if you have a very parallelizable task where you can, you know, split it up into, you know, a thousand or, or 4,000 or 16,000 little tasks that each core can handle in parallel it's fine if the each core is relatively slow compared to CPU.

If they're all chugging away at those numbers at once, then they can pump out, you know, like, like thousands and thousands of these operations in the time that, that is, you know, CPU core might, might do, you know, 20 or whatever, right? So it is slower on a per core basis, but you have so many cores, you can amortize that and just go way, way faster. And that is at the core of what makes AI today work. It's the fact that it's so crazy paralyzable.

You can take a neural network and you can, you can chunk it up in any number of ways. Like you could, for example feed it a whole bunch of a whole bunch of prompts at the same time. That's called data or data parallelism. So you can actually, that's more like you send some chunks of, of data over to one you know, one set of GPUs, another, another chunk to another set. So essentially you're parallelizing Those, the processing of that data you can also take your neural networks.

You can slice them up layer wise. So you can say layers zero to four they're, they're going to sit on these GPUs layers, you know, five to eight will sit on these GPUs and so on. That's called pipeline parallelism. Right. So your each stage of your model pipeline, you're kind of imagining, you know, chopping your, your model up lengthwise and, and farming out the different chunks of your model to different GPUs. And then there's even tensor parallelism. And this is within a particular layer.

imagine chopping that layer in half and having. A GPU chew on or process data that only, you know, going through just that part of the, of the model. And so these three kinds of parallelism, data parallelism, pipeline, parallelism, and tensor parallelism are all used together in overlapping ways in modern high performance AI data centers in, in these, these big training runs. And they play out at the hardware level.

So you can actually see, like, you'll, you'll have, you know, data centers with chunks of GPUs that are all seeing, you know, there's one chunk of the data set and then within those GPUs, one subset of them will be specialized in, you know, a couple of layers of the model through pipeline parallelism. And then, A specific GPU within that set of GPUs will be doing a specific part of a layer or a couple of layers through tensor parallelism.

And that's how you really, you know, kind of split this model up across as many different machines as you can to benefit from the massive parallelism that comes from this stuff. Right. And, and by the way, I guess just another fun detail, why did the graphics processing units turn out to be really good for AI? Well, it all boils down to matrix multiplications. Right. It's all just a bunch of numbers. You have one vector, one set of numbers.

You need to multiply it by another vector and get the output. That's your typical layer, right? You have end connections and inputs. You have one activation unit. So you wind up having two layers and you do a vector and so on. So anyway, it turns out that to do 3d computations, that's also a bunch of math. Also a bunch of matrices that you multiply to be able to, get your rendering to happen.

And so it turns out that, you know, you can do matrix multiplications very well by paralyzing over like a thousand cores versus if you have some kind of long equation where you need to do every step one at a time, that's going to be on a CPU. So yeah, basically 3d rendering is a bunch of linear algebra. Neural nets are a bunch of linear algebra. So it turns out that you can then do the linear algebra from graphics also for neural nets, and that's why that turned out to be such a good fit.

And now with tensor processing units, tensor is like. A matrix, but with more dimensions, right? So you do even more linear algebra. That's what it all boils down to. Excellent summary. Yeah. this is a good time. Now we've got some of the basics in place, right. To, to look at the data center floor and like some current and, and emerging. AI hardware systems that are going to be used like for the next beat of scale.

And, and I'm thinking here in particular of the GB 200 semi analysis has a great breakdown of how the GB 200 is set up. I'm pulling heavily from that in this section here with just some, some added stuff thrown in just for, for context and depth, but I do recommend semi analysis by the way. So yeah, so analysis is great. One of the challenges with it is it is highly technical. So I found I've recommended it to a lot of people.

Sometimes they'll read it and they'll be like like I can tell this is what I need to know, but, it's, it's really hard to like. Kind of get below and understand deeply what they're getting at. So hopefully this episode will be helpful in, in doing that.

Certainly whenever we cover, you know, stories that Semianalysis has covered I try to do a lot of translation at least, you know, when, when we're at the sharing stage there, but, but just, just be warned, I guess it's a, it's a pretty expensive newsletter and it does go into technical depth. They got some free stuff as well. You should definitely check out if you're interested in that sort of thing.

I got this premonition in case anyone wants to correct me and say, it's not just linear algebra because you have nonlinear activations famously and that's required. Yeah. That's also in there and that's not exactly linear algebra. You have functions that aren't just matrix multiplications of over values and, and modern activations. You kind of try to get away from that as much as possible. There's always some really good douche bag.

I don't want to be factually incorrect, so just FYI, that's not what I mean. Well, and actually, mathematically, the fun fact there is that if you didn't have that non linearity, right? Then, then multiplying just a bunch of matrices together would be equivalent, from the linear algebra standpoint, to having just one matrix. So you could replace it with, anyway. Okay. So let's step onto the data center for, let's talk about the GB 200. Why the GB 200?

Well, number one, the H 100 has been around for a while. We will talk about it a little bit later. But the GB 200 is the next beat and, and more and more kind of the futures oriented that direction. So I think it is really worth looking at. And this is announced and not yet out from NVIDIA. Is that right? Or is it already being sold? I, I believe it's already being sold, but it's only just started. So this is, yeah. This is the latest, greatest in GPU technology, basically. That's it.

It's got that, that new GPU smell. So, so first thing we have to clarify, right. You'll see a lot of articles that will say something about the B200. And then you'll see other articles that say stuff about the GB200 and the DGX B2, the B200 DGX, or like, you know, all, all these things. Like, what, like, what the fuck are these things, right? So, the first thing I want to call out is there is a thing called a B200 GPU. That is a GPU, okay?

So, the, the GPU is a very specific piece of hardware that is like the let's say, a component that is going to do the interesting computations that we care about fundamentally at the silicon level, but a GPU on its own is, oh man, what's a good analogy? I mean, it's, it's like a, It's like a really dumb jacked guy, like he can probably lift anything you want him to lift, but you have to tell him what to lift because he's a dumb guy. He's just jacked.

So the B200 on its own needs something to tell it what to do. It needs a conductor, right? It needs a CPU. At least that's usually how things work here. And so, there's the B200 GPU, yes, wonderful, but if you're actually going to, like, put it in a, you know, in a server rack, in a, in a data center, you best hope that you have it paired to a CPU that can help tell it kind of what to work on and orchestrate its activity.

Even better, if you can have Two GPUs next to each other and a CPU between the two of them helping them to coordinate a little bit, right? Helping them do a little dance That's good. now your your CPU by the way is also going to need its own memory And and so you have to imagine there's memory for that all that good stuff But fundamentally we have a CPU and two GPUs on, on this little kind of motherboard, right?

Yeah. That's, that's like you have two jacked guys and you're moving an apartment and you have a supervisor. You know what? We're, we're getting there. We're getting there. Right. Increasingly. We're going to start to replicate just like what, the Roman army looked like you have some like Colonel and then you've got the, the strong soldiers or whatever, and the Colonel's telling them, I don't know, and then there's somebody telling the Colonel, I don't I don't know.

Anyway, yeah, you got a CPU on this motherboard and you got these two B200 GPUs. In the case of the, so, okay, these are the kind of atomic ingredients for now that we'll talk about. Now that is sitting on a motherboard. All right. A motherboard you can imagine it as like one big rectangle and we're going to put two rectangles together, two motherboards together. Each of them has one CPU and two B200 GPUs. Together, that's two GPUs, that's, sorry, that's four GPUs, that's two CPUs.

Together, that's called a GB200 tray. It's each, each one of those things is called a Bianca board. So Bianca board is one CPU, two GPUs. You put two Bianca boards together, you get a tray that's going to slot into one slot in a rack, in a server, right, in a data center. So that's basically what it looks like.

Out the front, you can see a bunch of, Special connectors for each GPU that will actually allow those GPUs to connect to other GPUs in that same like server rack, let's say, or, or very locally in, in their immediate environment through these things called NV link cables. Basically, these are special Nvidia copper cables. They're alternatives too. But this is kind of like in an industry standard one, and so this together is, you can think of it as like one really tightly interconnected.

Set of GPUs, right? So why, why copper? I mean the, the copper interconnect, and this is also goes through a special switch called an NV switch that helps to mediate the connections between these GPUs, but the bottom line is you just have these GPUs really tightly connected to each other through copper interconnects. And the reason you want copper interconnects. is that they're crazy efficient at getting data around those GPUs. Very expensive, by the way but very efficient too.

And so this kind of bundle of compute is going to do basically like your, typically like your, your highest bandwidth requirement like tensor parallelism. Like this is basically the thing that requires the most frequent communication between GPUs. So you're going to do it over your most expensive interconnect, your NVLink.

And so the, the more expensive the interconnect, roughly speaking, the, the more tightly bound these GPUs are together in a little local pod the more you want to use them for applications that will require frequent communication. So tensor parallelism is that because you're basically taking like, a layer, a couple of layers of your neural network, you're chopping them up.

But in order to get a coherent output, you need to kind of recombine that data because one chunk of one layer doesn't do much for you. So they need to constantly be talking to each other really, really fast, because otherwise it would just be kind of like a bunch of garbage, you know, like they need to be very coherent. At higher levels abstraction, so pipeline parallelism, where you're talking about whole layers of your neural network.

And you know, one, one pod might be working on one set of layers and another pod might be working on another set of layers. For pipeline parallelism, you're going to need to communicate, but a little bit less slowly, right? Cause you're not talking about chunks of a layer that just need to constantly be to even be remotely coherent. Those chunks have to come together to form one layer, at least with pipeline parallelism, you're talking about. Coherent whole layers.

So this can happen a little bit slower. can use interconnects like PCIe is one possibility. Or even between different nodes over network fabric and go over InfiniBand, which is, you know, Yeah, another slower form of networking. The pod, though, is the basic unit of kind of pipeline parallelism that's often used here. This is called the back end network. So tensor parallelism, this idea again of, we're going to slice up just parts of a layer and have like, you know, one server rack, for example.

It's all connected through NVLink connectors. Super, super efficient. usually called like accelerator interconnect, right? So the very local interconnect through NVLink pipeline parallelism this kind of slightly slower you know, different layers communicating with each other. That is usually called the backend network. Okay. In a data center. So you've got accelerator interconnect. For the really, really fast stuff, you've got the backend network for the somewhat slower stuff.

And then typically at the level of the whole data center when you're doing data parallelism, you're sending a whole chunk of your data over to this bit, a whole chunk over that bit. You're, you're gonna, you're gonna kind of send your, your user queries in and they're gonna get divided that way. That's the front end network. So you've got your front end for your kind of slower lowest.

Lowest kind of let's say typically actually like kind of less expensive hardware too, because you're not going as fast. You've got your backend, which is faster. It's infinity band. And now you're moving things between typically between layers. And this can vary, but I'm trying to be concrete here. And then you've got your fastest thing, which is accelerator interconnect, even faster than the backend network, the activity that's buzzing around there.

That's kind of like one way to set up a data center. You're always going to find some kind of hierarchy like this, and the particular instantiation can vary a lot, but but this is, you know, often how it's done.

And so you're in the business, if you're designing hardware, designing models, you're kind of in the business of saying, okay, how can I architect my model such that I can chop it up To have a little bit of my model on one GPU here on this GPU there, you know, such that I can chop up my layers in this way that makes maximal use of my hardware, right?

There is this kind of dance where you're doing very hardware aware algorithm architecturing, especially these days, because the main rate limiting thing for you is how do I get more out of my compute? Right. And I think that's one of the, another big aspect of TPUs and Google, right? Google was a thing that OpenAI worried about partially because of DPUs, but also in big part, because they had expertise in data centers. That was part of the reason why Google went out.

They were really good at data center creation and they were early to the game. So they not only made TPUs, Tensor Processing Units, they. Pretty quickly afterwards, also worked on TPU pods where you combined, you know, 256, 2000 TPUs together, presumably with that sort of memory optimization you're talking about to have much larger neural nets, much faster processing, et cetera. Actually that's, and that's a great point, right?

Like there's, there's this interesting notion of what counts as a coherent blob of compute. the real way to think about this is in terms of like the, the latency or, you know, the, the, the timeline on which activities are unfolding at the level of that blob. So, you know, you think about like, what is a coherent blob of compute for tensor parallelism? Well, I mean, it's gotta be really, really fast, right?

Because you're, you know, you're, you're, these competitions are really quick, really efficient, but then you gotta move on really quick. And so.

What one of the things that that Google has done really well is these pods can actually coherently link together very large numbers of chips and you're talking in some cases about like hundreds of these, I think 256 is for TPU v4, like one of the standard configurations, but one of the key things to highlight here, by the way, is there is now a difference right between the GPU, which is the B 200 and the system, the GB 200, the system in which it's embedded.

So the, the GB 200 by definition is this, this thing that has a CPU and two GPUs. It's on a tray along with a bunch of other ancillary stuff. And you know, and that's your, your Bianca board and there's another Bianca board right next to it and together that's one GB 200 tray, right? So it's, we are talking about GPUs like the, the basic idea behind the GB 200 is to make those GPUs do useful work, but that requires a whole bunch of ancillary infrastructure that isn't just that B200 GPU, right?

And so the, packaging together of those components of the B 200 GPU and the CPU and all those ancillary things that's done by companies. For example, like foxcon that put together the servers. Once Nvidia finishes shipping out the GPUs, somebody's gotta assemble these, and Nvidia can do some of this themselves, but companies like Foxconn can step in and we covered a story, I think with Foxconn looking at a factory in Mexico to do this sort of thing, right? So they're actually building the.

The supercomputer in a sense like like putting all these things together into servers and farming them out. Anyway, there are different layers of that stack that are done by foxconn and different by by nvidia but fundamentally I just want to kind of differentiate between the gb200 system and The B 200 GPU, the GB 200 system also can exist in different configurations. So you can imagine a setup where you have, you know, one rack and it's got say 32 B, 200 GPUs. and they're all tightly connected.

Or you could have a, a version where you got 72 of them. All depending on the often what'll determine that is how much. Power density, you actually can supply to your server racks. And if you, if you just don't have the power infrastructure or the cooling infrastructure to keep those racks humming, then you're kind of forced to take a hit and literally like put just fewer, like put less compute capacity in a given rack.

That's one of the classic trade offs that you face when you're designing a data center. Yeah. And I think, and I have a shout out in case people don't have a background, another major aspect of data center design and construction is the cooling. Because when you have, billion chips, whatever computing, the way semiconductors work is that you're doing some electricity and you're using some energy, which produces heat.

And when you're doing a ton of computation, like GPUs, you get A lot of heat, you can actually warm up a bit if you really use your GPU. Well, so when you get to these racks, where you really try to concentrate a ton of compute altogether, you get into advanced cooling, like liquid cooling. And that's why data centers compute consume water, for instance. Why, if you look at the. Climate impacts of AI. Often they do cite kind of water usage as one of the metrics.

That's why you care about where you put your data center in terms of climate. And presumably that's a big part of engineering as well of these cores of these systems. Absolutely. And in fact, that's, that's what the H 100 series of chips is. Well, one of the things that's somewhat famous for is being the first chip that has a liquid cooled configuration of the black wells, all need liquid cooling, right?

So this next generation of infrastructure for the B 200 and so on, you're going to have to have liquid cooling integrated into your data center. It's just a fact of life now, because these things put off so much heat because they consume so much power. There's, there's sort of an irreducible relationship between computation and Power dissipation. absolutely. So these two things are profoundly linked. I think now it might make sense to double click on the B 200, just the GPU.

So we're not talking about the, the grace CPU that sits on the Bianca motherboard and helps orchestrate things, all that jazz, specifically the, the B 200 GPU, or just let's say the GPU in general. And I think it's worth double clicking on that. See, like, what are the components of it? Cause that'll start moving us into the. The fab, the packaging story, the kind of, you know, where does TSMC come in and introducing some of the main players? Does that, that makes sense? Yeah, I think so. okay.

So we're looking at the GPU and right off the bat, two components that are going to matter. This is going to come up again, right? So we have our logic and we have our memory. The two basic things that you need to do useful shit in AI, right? So, okay, what is, let's start with the memory, right? Because we've, we've already talked about memory, right? You, you, you care about what is the latency? What is the capacity? What is the bandwidth of this memory?

Well, we're going to use this thing called high bandwidth memory, and that's going to sit on our GPU. We're going to have stacks of high bandwidth memory, stacks of HBM. And you can think of these as basically like Roughly speaking one, one layer of the stack is a, like a grid that contains a whole bunch of capacitors, a whole bunch of, of, that each store some, some information. And you want to be able to pull numbers off that grid really efficiently.

Now, historically those, those layers, by the way, are, are DRAM, DRAM was a form of memory that, that goes way, way back. But the innovation with HBM is stacking those layers of DRAM together. And then connecting them, putting all the way through those stacks, these things called through silicon vias or TSVs. And TSVs are important because they basically allow you to just like simultaneously pull data from all these layers at the same time. Hence the massive bandwidth.

You can get a lot of throughput of data through your system because you're basically drawing down from all of those kind of layers in your stack. At once. So many layers of DRAM. And you'll see, you know, eight layer versions, 12 layer versions. The latest versions have, have like 12 layers. The companies, by the way that manufacture HBM are different from the companies that manufacture the logic that sits on the chip.

So the memory companies, the HBM companies, you're thinking here, Basically the only two that matter are SK hynix in South Korea and Samsung also in South Korea. There is micron, but they're in the U S and they kind of suck. They have like none of the market right now. But yeah, so fundamentally when you're looking at like, you know, NVIDIA GPUs, you're going to have, you know, like HBM stacks from say SK hynix, and they're just really good at pulling out massive amounts of data.

The latency is not great, but but you'll pull down massive amounts of data at the same time. And feed them into. Your logic die, right? Your main GPU die or your compute die. People use all these terms kind of interchangeably, but that refers to the logic part of your GPU that's actually going to do the computation. Now, this is for the H100. It's sometimes known as the GH100, but this is the fundamentally the place where the magic happens.

So you're pulling into the logic die, this data from the HBM in massive quantities all at once. One thing to recognize about the difference between HBM and the kind of main GPU die. The process to fabricate these things is very different. So you need a very different set of expertise to make HBM, high bandwidth memory, versus to make GPU. A really good logic die, and this means that the fabs, the manufacturing facilities that actually build these things are different.

So SK hynix, you know, might do your HBM, but TSMC is almost certainly going to do your logic die, right? And the reason is, there's process reasons, Part of it is also the resolution, the effective resolution. So logic dies are these very irregular structures, right? We talked about how high bandwidth memory is this, you know, these like stacked grids, basically. They're very regular. And as a result a couple of things like you don't need as high resolution in your fabrication process.

So you'll typically see people use like 10 to 14 nanometer processes to do like HBM3, for example. But. If you're looking at logic for the logic die, you're building transistors that are kind of like, you know, these weird irregular structures that are extremely bespoke and all that. and that as a result, you need a much, much higher grade process, typically four to five nanometer processes. That doesn't mean that TSMC could just turn around.

So TSMC is usually the ones who they do all the kind of truly leading edge processes. They can't really turn around and just make HBM very easily again, different set of core competencies. And so what has to happen is you're going to source your HBM from one company. You're going to source your logic from another, and now you need to make them dance together. Somehow you need to include both the logic and the memory on the same chip.

And for that nowadays, the solution people have turned to is to use an interposer. Right.

So an interposer is a structure that the logic and the memory and a couple other components too, is are going to sit on and the interposer essentially allows you to, connect, like say from the bottom of the HBM to the bottom of the, the logic to, to, to create these, these kinds of like chip level connections that link your your different Well, you're, you're different chips or sorry, not chips, but you're, you're different, like components together.

And this is called packaging this process of doing this packaging. Now TSMC famously has this co op packaging process. There are two kinds of co ops there's co ops S and co ops L the details. We don't have time to get into, but they are kind of fascinating. The bottom line is that this is just a, a way of. Number one, linking together your your memory die and your main GPU die, your logic die.

But also, an interesting thing that happens is, as you move down the package, the resolution of the interconnects gets lower and lower. Things get coarser and coarser, bigger and bigger. And what you're trying to do is, At the chip level, you're, you've got crazy high resolution kind of connections happening. Like your, your pitch size, as it's sometimes called, the sort of resolution of the structure is, is really, really fine. It's really, really small.

You want to actually deliberately decrease that as quickly as you can because it allows you to have thicker wires, which are you know, better for, more efficient for, from a power delivery standpoint make it possible for you, for you to use. Kind of like more antiquated fabrication processes and all that stuff as quickly as possible. You want to get away from things that require you to use really, really advanced processes and things like that. So this is basically the landscape.

You've got a bunch of stacked DRAM, in other words, high bandwidth memory, right? Those stacks of, memory sitting next to a a GPU die, a logic die that's actually going to do the computations. And those are all sitting on top of an interposer. Which links them together and, and has a bunch of anyway, really nice thermal and other properties. And then on that point, you know, we mentioned TSMC and fabs and their part of the story, which I think deserves a little bit more background, right?

So fab means fabrication. That's where you take the basic building block, like the raw material and convert it into computing. So let's dive in a little bit, what it involves for any less technical people. First, what is a semiconductor? It's literally a semiconductor. It's a material that due to magic of quantum mechanics and other stuff, you can use it to let current through or not. Fundamentally, that's the smallest building block of computing. And so what is a fab?

It's something that takes raw material and creates, you know, like nanometer scale sculptures or, you know, structures of material that you can then, give power to, you can kind of power it on or off and that you can then combine various patterns to do. computations. So why is fabrication so complicated? Why is TSMC the one player that really matters?

It sounds like like there are a couple of organizations that can do fabrication, but TSMC is by far the best because it's like, I think, as we mentioned before, like the most advanced technology that humanity has ever made. You're trying to take this raw material and literally make these nanometer sized patterns in it for semiconductors, right? You need to do a little sculpture of raw material in a certain way and do that a billion times. in a way that allows for very few imperfections.

And as you might imagine, when you're dealing with nanometer sized patterns, it's pretty easy to mess up. Like you let one little particle of dust into it, and that's bigger than, I don't know how many transistors, but it's pretty big. And there's like a million things that could go wrong that could mess up the chip. And so it's a super, like the most delicate, intricate thing you can attempt to do.

And the technologies that enable this to actually do the fabrication at nanometer scale levels, and then now we are getting to that sort of place where the quantum effects are crazy and so on. But anyway. All right. The technology there is incredibly, incredibly complicated, incredibly advanced and incredibly delicate. So as we've kind of previewed you, we're now seeing TSMC trying to get to the U S and it's taking them, it's going to take them years to set up a fab.

And that's because you have a lot of advanced equipment that you need to set up in a very, very delicate way. And you're literally kind of taking large blocks of raw material, literally these slabs of. Silicon, I believe, and you're cutting it into little, circles. You need to transfer that all around to various machines that do various operations. And somehow you need to end up with something that has the right set of patterns.

So it's, it's fascinating how all this works and the advanced aspects of it. I don't even know. It's, it's like, And it costs hundreds of millions of dollars as we've covered to get to the most advanced technology. You have like one corporation that can do a technology required to make these patterns at like two nanometer, whatever resolution we have nowadays. And so that's why fabrication is such a big part of the story. That's why NVIDIA farms out fabrication to a TSMC. They have just.

perfected the art of it, and they have the expertise and the capability to do this thing that very, very few organizations are capable of even trying. And that, by the way, is also why China can't just easily catch up and do these most advanced chips. It's just incredibly advanced technology. Yeah, yeah, absolutely. And I think so as we discuss this, by the way, we're going to talk about things called process nodes or processes or nodes, right?

So these are our fabrication processes that fabs like TSMC use TSMC likes to identify their processes with a number in nanometers historically. At least up until now, so they talk about, for example, the seven nanometer process node or the five nanometer process node. And, you know, famously people refer to this as well, there, there are three layers of, of understanding when it comes to that terminology.

The first layer is to say something like, When we say 7 nanometer process node, we mean that they're fabricating down to 7 nanometer resolution. Right? Which sounds really impressive. Then people point out at the next layer, oh, that's actually a lie. They'll sometimes call it marketing terminology, which I don't think is accurate. That speaks to the third layer. phrase seven nanometers is sometimes referred to as a piece of marketing terminology because it's true.

There's no actual component in there that is like seven nanometer resolution. Like it's not like there's any piece of that that is truly physically down to seven nanometers. But what the seven nanometer thing really refers to is it's the performance you would get if historical trends in Moore's law continued. You know, there was a time back when we're talking about the, you know, the two micron resolution that it actually did specify that.

And if you kept that trend going, the transistor density you would end up with. Would be that associated with hitting the seven nanometer threshold. We're just doing it in different ways. So my kind of lukewarm take on this is I don't know that it's actually marketing terminology so much as it is the outcome based terminology that you actually care about as a buyer, right? You care about. Will this perform as if you were fabbing down to seven nanometers?

Or will it perform as if you were fabbing down to three? And that's the way that you're able to get to numbers of nanometers that are like, you know, we're getting to the point where it's like, you know, a couple of angstroms, right? Like a 10 hydrogen atoms strung together. Obviously we're not able to actually fab down to that level. And if we could, there'd be all kinds of quantum tunneling effects that would make it impossible. So, so anyway, that's the basic idea here. Today's leading.

Leading node is switching over into the two nanometer node right now. What you'll tend to see is the leading node is subsidized basically entirely by Apple phone companies. They want it small. They want it fast. Apple is willing to spend. And so they will work with TSMC to develop the leading node each. Year reach cycle, right? And that's a massive partnership boost for TSMC.

Other companies, former competitors of TSMC, like global foundries suffer a lot cause they need a partner to help them subsidize that next node development. So this is a big, big strategic kind of moat for TSMC. They have a partner like Apple. It's willing to do that. This means Apple monopolizes. the most kind of advanced node for their phones every year, then that leaves the next node up free for AI applications. The interesting thing, by the way, is that might change.

You could see that start to change as AI becomes more and more in demand, as NVIDIA is able to kind of compete with Apple potentially down the line for the very same deal with TSMC, right? If, if AI is just Fueling way more revenue than, than iPhone sales or whatever else.

Well, now all of a sudden NVIDIA might be able to muscle in and you might see a change in that dynamic, but at least for right now, that's how it's playing out and so NVIDIA now gets to work with the five nanometer process for the H 100, right? That's the process they used for it. They actually started to use the four nanometer process, which really is a variant of the five nanometer, but the details don't super matter there. Fundamentally, the story then is about how.

TSMC is going to achieve these sorts of effects. And one, one part of that story is how do you architect the shape of your transistors? the breakthrough before the most recent breakthrough is called the FinFET. Basically this is like a fin like structure that they bake into their transistors and it works really well for reasons. There's the gate all around transistor that's coming in the next cycle. That's going to be way more efficient and blah, blah, blah.

But bottom line is they're looking at like, how do we tweak the shape of the structure that the transistor is made up of to make it more effective and to make it work with smaller currents, to make it more from a, power density standpoint, better thermal properties, better, and so on and so forth. But the separate pieces, what is the actual process itself of creating that structure? Right. That process is basically a recipe. Right.

So this is the, the, the, the sweet sauce, the, the, the magic, that really makes TSMC work. If you are going to replicate what TSMC does you need to follow basically the same iterative process that they do to get to their current recipe. Right. This is like a chef that's iterated over and over and over with, their ingredients to kind of, to get a really good outcome. You can think of a TSMC FAB as a thing, a box with like 500 knobs on it.

And you've got PhDs tweaking every single knob, and they're paid an ungodly amount of money, takes a huge amount of time. They'll start at the, you know, say, 7 nanometer process node, and then based on what they've learned to get there, they iterate to get to the 5, the 3, the 2, and so on. And you really just have to do it hands on. You have to climb your way down that hierarchy. Because the things you learn at seven nanometer help to shape what you do at five and three and two and so on.

And this is one of the challenges with, for example, TSMC just trying to spin up a new fab starting at the leading node in like North America or whatever. You can't really do that. It's best to start, you know, a couple of generations before and then kind of work your way down into the future. Locally, because even if you try to replicate what you're doing generally in another location, dude, air pressure, humidity, everything's a little bit different. Things break.

This is why, by the way, Intel famously had a design philosophy for their fabs called copy exactly. And this was like famously a thing where, you know, everything down to the color of the paint in the bathrooms would have to be copied exactly to spec because nobody Fucking knew why the fricking yields from one fab were great and the other one were shit. And it was just like, I don't know, maybe like, let's just not mess with anything. Right. That was the game plan.

And so TSMC has their own version of that. That tells you how hard this is to do, right? This is really, really tough, tough stuff. The actual process starts with a pure silicon wafer. So you get your wafer source. This is basically sand that has been purified and, you know, roughly speaking, sand, glass and you put a film of oxide on top of it. This is like oxygen or water vapor. That's just meant to protect the surface and block current leakage.

And then what you're going to do is deposit on top of that a layer of a material that's meant to respond to light. is called photoresist.

And the idea behind photoresist is if you expose it to light some parts of the photoresist will become soluble what you'll be able to remove them using some kind of process or others might harden and depending you might have positive photoresist or negative photoresist depending on whether the part that's exposed either stays or is removed, but essentially the photoresist is a thing that's able to retain the imprint of light that hits your your wafer in a specific way. a specific way.

So by the way, the pure silicon wafer, that is a wafer. You're going to, we're ultimately going to make a whole bunch of, of dyes on that wafer. We're going to make a whole bunch of say, you know B200 dyes on, on that one wafer. So the next step is once you've laid down your photoresist, you're going to shoot light source.

At a pattern, sometimes called photomask, a pattern of your chip and the light that goes through is going to encode that pattern and it's going to image it onto the photoresist. And there's going to be an exposed region and you're going to replicate that pattern all through your wafer in a sort of raster scan type of way, right?

And anyway, so you, you are going to then etch away, you're going to get rid of your, Sort of like photoresist your you'll then do steps like ion implantation, where you use a little particle accelerator to fire like ions into your, your silicon to dope it because semiconductors need dopants. Like basically, yeah, you, you take, you make some imperfections and that turns out to mess with how the electrons go through the material and it's all magic, honestly.

Yeah. And by the way do at point of copy exactly. This is another fun detail in case you don't know, one of the fundamental reasons TSMC is so dominant and why there rose to dominance is yield. So actually you can't be perfect. Like it's a fundamental property of fabrication that some stuff won't work out, some percent won't work out. Of your chips will be broken and not usable. And that's yield. And if you get a yield of like 90%, that's really good.

If only 10 percent of what you fabricate is broken when you get smaller. And especially as you set up a new fab, your yield starts out bad. It's like inevitable. And TSMC is very good at getting the yield to Improve rapidly. And so that's a fundamental aspect of competition. If your yield is bad, you can't be economical and you lose. A hundred percent.

And in fact, this is where, when it comes to, you know, SMIC, which is TSMC's competitor in China, which by the way stole a bunch of TSMC's industrial secrets. In a very fun way, but yeah, there's some really fun details there for sure. Yeah, yeah, yeah. Like lawsuits and all kinds of stuff. Yeah. But, but yeah, fundamentally SMIC so, so stole a lot of that information and replicated it quite successfully. They're now at the seven nanometer level, right?

So the, and they're working on five, but their yields are suspected to be pretty bad. And one of the things is with, with China, the yields matter less because you have massive government subsidies of the, fabrication industry. And so. You know, they can maybe get away with that to make the market competitive because the government of China has identified or the CCP has identified this as a key strategic thing. So they're willing to just shovel money into the space.

But yeah, so this fabrication process has a lot of steps. By the way, a lot of them are cleaning, like a lot of them, just kind of polishing off surfaces, cleaning them to make sure everything's level. So there's a lot of boring stuff that goes on here. You know, I, anyway, I, I work with a lot of guys who are very deep in this space. So I do like to nerd out on it, but I'll, I'll, I'll contain myself.

The part of this process though, that I think is, is sort of most useful to draw your attention to is this idea of just shining a light source onto a reticle, onto a, this photo mask that contains the imprint of the circuit you want to print essentially onto your wafer. So that light source and that whole, the kind of optics around it, that is a huge, huge part of the of the trade craft here. so when you think about things that make this hard, number one, there's the recipe.

How do you do these many, many, many layers of kind of like, you know, photo mask and etching and, you know, ion implantation and, and, you know, deposition, all that jazz. That know how, that's what TSMC knows really, really well, right? That's the thing that's really, really hard to copy. But even if you could copy that, you would still need the light source that allows you to do this photolithography as it's called, the kind of exposure of, of specific patterns onto your wafer.

And so those photolithography machines become absolutely critical in the AI supply chain, in the hardware supply chain. And there is. Really just one company that, that can do it well and, and, and in a way, it's a complex of companies. So this is called ASML. This is in the Netherlands, Dutch company.

They have this really interesting overlapping history with Carl Zeiss company, and they are essentially kind of a complex of companies just because of ownership structure and, and overlapping talent and stuff like that. But through ASML, Carl Zeiss complex.

So when we, we talk about photolithography, right, this very, very challenging stage of like, how do we, how do we put light onto our chip or onto our wafer such that it gives us with high fidelity, the pattern we're after that is going to be done by photolithography machines produced by ASML. And. That brings us to the, I think the sort of final stage of the game to talk about how the photolithography machines themselves work and just why they're so important.

Does that make sense or is there, is there stuff that you wanted to add on the TSMC bit? I think one thing we can mention real quick since we were touching on process nodes is you know, where does Moore's law fit into this? Well, if you look back a decade ago and 2011, we were at the 28 nanometer stage. Now we getting into like, we were using five nanometer roughly for AI trying to get to two nanometer. And that is not according to Moore's law, right?

Moore's law has slowed down kind of empirically. Like you just, it's much slower at least relative to when you get to the 80s or very early on to decrease get to smaller process size. And that's why partially you have seen the idea of CPUs having multiple cores parallelization. And that's why GPUs are such a huge deal. Even though we can't scale down and get to smaller process nodes as easily. It's, it's like incredibly hard.

If you just engineer your GPU better, even without a higher density of transistors, by getting those cores to work better together, by, you know, combining them in different ways, by designing your chip in a certain way. That gets you the source of jump and compute speed and capacity that you used to get just through getting smaller transistors.

Yeah. And, and it, I mean, it is the case also that thanks to things like FinFET and gate all around, like we have seen a surprising robustness of even the fabrication process itself. Like, so the five nanometer process first came out in, in like 2020. And then we were hitting three nanometers in early 2023. So, you know, like it's not, yeah, it's, there's still some juice to be squeezed, but it's slowing down. I think it's fair to say. Yeah, no, I think that's, that's true.

And you can actually look at the projections by the way, because of the insane capital expenditure required to set up a new fab. Like TSMC can tell you what their schedule is for like the next three nodes, like going into 2028, 2029, that sort of thing. And, and that's worth flagging, right? They were talking tens of billions of dollars to set up a new fab, like aircraft carriers worth of risk capital. And it really is risk capital, right?

Cause like Andre said, you build the fab and then you just kind of like hope that your yields are good and. They probably won't be at first. And like, that's a scary time. And so you know, this is a very, very high risk industry. A TSMC is very close to base reality in terms of like unforgiving market exposure. Right. So, okay. I guess photolithography, this sort of like last and final glorious step in the process where really we're going to squeeze a lot of the high resolution.

Into our, into our fabrication process. This is where a lot of that resolution comes from. So let's start with the D U V, the deep ultraviolet lithography machines that allowed us to get roughly to where we are today. Roughly to the, let's say, 7 nanometer node, arguably the 5 nanometer, there's some debate there.

So, when we talk about DUV, the first thing I want to draw your attention to is that there is a law in physics that says, roughly speaking, that the wavelength of your light is going to determine the, the kind of level of precision with which you can make images. All right. With which you can, in this case, imprint a pattern. So if you have a 193 nanometer light source you're typically going to think, oh, well, you know, I'll be in the kind of.

You hundreds of nanometers in terms of the resolution with which I can I can sort of image stuff right now. There's a whole bunch of stuff you can do to change that. You can use larger lenses essentially what this does, it collects a lot more rays of that light. And by collecting more of those rays, you can focus more tightly or in more controlled ways and, and image better. But, but generally speaking your wavelength of your light.

Is going to be a big, big factor and, and the size of your lens is going to be another, that's the numerical aperture sometimes it's described as. So those are, anyway, those are the two kind of key components. 193 nanometers is the wavelength that's used for deep ultraviolet light. This is, this is, A big machine costs millions and millions of dollars. It's got a bunch of lenses and mirrors in it. and ultimately it ends up shining light onto this photo mask.

And there's a bunch of interesting stuff about, you know, technologies like off axis illumination and, and eventually immersion lithography and so on that get used here, but fundamentally you're shining this laser and you're trying to be really clever about the lens work that you're using to get to these, these feature sizes that might allow us to get to seven nanometers. You can go further than seven nanometers with DUV if you do this thing called multi patterning.

So you, you take essentially your, your wafer and you go over it once and you go over it again with the same laser. And that allows you to kind of let's say do a first pass and then not necessarily a corrective, but an improving pass on your dye during the fabrication process. The challenge is that this reduces your throughput. It means that you have to, instead of passing over your your wafer once, you've got to pass over it twice, or three times, or four times.

And that means that your output is going to be slower, and because your, like, capital expenditure is so high, basically you're amortizing the cost of, of these insanely expensive photolithography machines over the number of wafers you can pump out. So slowing down your output really means that Reducing your profit margin very significantly. And so, you know, SMIC is looking presumably at using multi patterning like that to get to the five nanometer node.

But again, that's going to effectively cost in the same way as like yield is really bad. It's going to cost you throughput and, those things are really tricky. So that is the DUV machine. It, you know, allowed us to get to, to about seven nanometers. But then at the five nanometer level, Pretty quickly, you just need a new light source. And that's where EUV, Extreme Ultraviolet Pathography, comes in. It is a technology that has been promised forever.

Like, it, there are, I don't know, 10 generations or something of, TSMC processes where they're like, Ah, this is gonna be the one that uses EUV, and there's always some stupid shit that comes up, and then they can't ship it. So finally, we're at the EUV generation now. EUV light source is 13. 5 nanometers. It is really, really fucking cool. I want to, I'm just going to tell you how, how crazy this is. Okay. So somehow you need to create 13. 5 nanometer light.

Okay. By the way, what I'm sharing here, there's a really great explainer of this that goes into some, like much of this detail and has great illustrations on the Asianometry YouTube channel. Check that out. That's another great resource. But so turns out like, so back in the day, people realized that you could fire a laser at a tin plate, like a flat sheet of tin and get it to emit.

13. 5 nanometer light, 13. 5 nanometers is like super, super like extreme ultraviolet, very, very short wavelength, high energy light. The problem with that though, is that what you tend to find is that the, the light is kind of gonna fly out in all different directions and you need to, to find a way to collect it somehow. So people went, okay, you know what, like let's experiment with. Concave tin plates.

So we're going to shape a tin plate kind of in the shape of a concave mirror so that when we shine light at it, the, the light that we get back will hopefully be more focused, more, more yeah, more not collimated, but yeah, more, more controlled. So they tried that. The problem with that is when you shine light on that concave tin plate, you get a bunch of sputtering. You get a bunch of like, Vaporization of the tin.

And so yeah, you produce your 13 nanometer light, but that light gets absorbed by this, like all these annoying tin particles that then get in the way. So you're like, ah shit, well, okay, now, now we're screwed. Tin doesn't work. But then somebody came up with this idea of using tin droplets. So here's what's actually going to happen. It's pretty fucked up inside an EUV machine. So you've got a tin droplet generator. This thing fires these tiny little, like 100 micron.

Tin droplets at about 80 meters a second, they are flying through this thing. So tin droplets go flying. As they're flying, a pre pulse laser is going to get shot at them and hit them to flatten them out, turning them into basically the plates, reflective plates that we want, getting them in the right shape. So you're a tin droplet, you're flying through at top speed, you get hit by laser pulse number one to get flattened.

And then in comes the main laser pulse from a CO2 laser that's gonna vaporize you and have you emit your plasma. Now, because you're just a tiny tin droplet, there's not enough of you to vaporize that'll get in the way of that 13. 5 nanometer light so we can actually collect it. So that, that's like, I mean, you are taking this, it's like hitting a bullet with another bullet twice in a row, right? You've got this tin droplet flies crazy fast. Pre pulse laser flattens it out.

Then the next laser, boom, vaporize it outcomes, the EUV light. And by the way, that has an overall conversion efficiency of about 6%. So like you're losing the vast majority of your, of your power there outcomes, the EUV light, and then it's going to start to hit a bunch of mirrors, no lenses, just. Mirrors. Why? Because at 13. 5 nanometers, basically everything is absorbent, including air itself. So now, you've gotta fucking have a vacuum chamber.

This is all, by the way, happening in a fucking vacuum, because your life now sucks, because you're making EUV laser. So you've got a vacuum chamber, because air will absorb shit, and you're not allowed to use lenses. Instead, you've gotta find a way to use mirrors, because your life sucks. Everything in here is just mirrors. There's like about a dozen, just under a dozen mirrors in an EUV system. All they're, they're trying to basically replicate what lenses do.

Like you're trying to focus light with mirrors, which they saw my optics background. I mean, like that is a hard thing to do. There's a lot of interesting jury pokery that's gets, gets done here, including Poking holes in mirrors so you can let light go through like mostly and hopefully not get too lost. Anyway, it's, it's a mess, but it's really cool, but it's a mess.

And so you've got these like 12 mirrors or 11 mirrors or 10 mirrors, depending on the configuration desperately trying to kind of collect and pull this. It's all happening in vacuum. Finally, it hits your and even your photo mask. Has to be reflective because if you know light would just be absorbed in any kind of transmissive material and so you anyway this creates so many painful painful problems You're literally not able to have any what are called refractive elements.

In other words lens like elements where The light just goes through, gets focused and blah. No, everything has to be reflective all the time. And that, that is a, it's a giant pain in the butt. It's a big part of the reason why these machines are a lot harder to build and a lot more expensive, but that is EUV versus DUV.

It seems like all you're doing is changing the wavelength of the light, but when you do that, all of a sudden, like you'll find, so, so even these mirrors, by the way, are about 70 percent reflective, which means about 30 percent of the light gets absorbed. And if you've got multilayer mirrors, then all the way through, you're going to end up with just 2 percent transmission. Like if 70, sorry, 30 percent of light gets lost at mirror 1, 30 percent at mirror 2.

If you work that through with 10 mirrors, you get to about 2 percent transmission, right? So you're getting really, really crap conversion efficiency on all the power you're putting into your system. By the way, the CO2 laser is so big it's got to be under the floor the room where you're, you're doing all this stuff. It, this whole thing is a giant, giant pain in the butt and that's part of the challenge. That is EUV.

There's also like pi numerical aperture UV, which is the next beat that basically just involves using effectively bigger lenses, like tweaking your mirror configuration because you're in UV to, to effectively. Kind of anyway, collect more more rays of light so you can focus down more tightly. The problem with that is that all the setup, all, all the semiconductor fabrication setup assumes a certain size of optics.

And so when you go about changing that, you got to refactor a whole bunch of stuff. You can't image the whole photo mask at once, the size of the photo mask that you can actually image. In other words, the size of the circuit, you can imprint on your, chip drops by about 50%. So now if you want to make the same chip, you've got to stitch together two kind of photo masks, if you will, rather than just having one clean circuit that you're printing, you're going to stitch together two of them.

And how do you actually get these insanely high resolution circuits to act to line up in just the right way? That's its own giant pain in the butt with a whole bunch of interesting implications for the whole supply chain there. I'm going to stop talking, but the bottom line is EUV is a big, big leap forward from DUV and it's what China right now is completely missing. That is so export controls have fully prevented China from accessing EUV machines let alone high in AUV. So they're all on DUV.

They're trying to do multi patterning to match what we can do at TSMC and other places with, with EUV. Yeah, I think you did a great job conveying just how insane these technologies are, like. You know, once you realize how absurd what's going on is in terms of precision, it's pretty mind blowing. And I think it also brings us to maybe the last point we'll get to and, large part of why were doing this episode is when it comes to export controls.

Maybe we can get dive into like, what are they, like, what is being controlled and how does it relate to fabrication, to chips and so on? Yeah, actually great question, right? It's, it's, it's almost like people treat it as a given when they're like, we're going to export, control, export, but what are you export controlling? So there's a whole bunch of different things. So the first, you go through the supply chain basically, and you can, you can make sense of it a bit more.

The first is, hey, let's prevent China from getting their hands on these EUV lithography machines. They can't build them domestically. They don't have a Carl Zeiss. They don't have an ASML. So, you know, we can presumably cut them off from that, and hopefully, that just makes it really hard for them to domestic to like, yeah, domesticate their, their own photolithography industry.

Secondly as a sort of defense in depth strategy, maybe we can also try to block them off from accessing TSMC's outputs. So in other words, prevent them from designing a chip and then sending it off to TSMC for fabrication. Because right now, that's what happens in the West. You know, NVIDIA, say, designs a new chip, they send it to TSM send the design to TSMC, TSMC fabs the chip and then maybe packages it or whatever it gets packaged, and then they send it off.

but what you could try to do is prevent China from accessing essentially TSMC's outputs. Historically, like, China's been able to enjoy access to both whatever machines ASML has pumped out and to whatever TSMC can do with those machines. So they could just, like, send a design to TSMC, have it fabbed. And there you go.

But in the last couple of years, as export controls have come in, gradually the door has been closed on accessing frontier chips and then increasingly on photolithography such that again, there's, there's not a single EUV machine in China right now. By the way, these EUV machines also need to be constantly maintenanced.

So even if there were an EUV machine in China, one strategy you could use is just like make it illegal to send the repair crews, send the 20 or so people who are needed to keep it up and running to China and presumably, you know, that that would at least make that one machine less valuable and you know, they could still reverse engineered and all that, but the, the fabrication is, is part of the magic. Yeah, so it's those kind of two layers are are pretty standard.

And then you can also prevent, companies from in China from just buying the finished product, the NVIDIA GPU, for example, or the server, right? And so these three layers are being targeted by export control measures. Those are maybe the three kind of main ones that people think about is photolithography machines. TSMC chip fab output, and then even the final product from companies like say NVIDIA.

The interesting thing, by the way, that you're starting to see, and this bears mentioning in this space too, is like NVIDIA used to be the only designer really, I mean, for frontier, for cutting edge GPUs, what you're starting to see increasingly is as different AI companies like Anthropic, like OpenAI are starting to bet big on different architectures and training strategies.

Their need for specialized AI hardware is starting to evolve such that when you look at the the kinds of servers that Anthropic is going to be using You're seeing a much more GPU heavy Set of servers than the ones that open AI is looking at which you're starting to veer more towards the kind of like two to one to To CPU ratio and that's for interesting reasons that have to do with the API Well, maybe we can use we need more verifiers.

We want to lean into using verifiers to validate certain outputs of chains of thought and things like that. And so if we do that we're going to be more CPU heavy and anyway, blah, blah. So you're starting to see custom ASICs the need for custom, custom chips. Develop with these frontier labs and increasingly like open AI is developing their own chip and obviously Microsoft has its own chip lines and Amazon has its own chip line that they're developing with anthropic and so on.

And so we're gonna see increasingly bespoke hardware and, and that's gonna result in firms like Broadcom being brought in. Broadcom specializes in basically saying, Hey. You have a need for a specific new kind of chip architecture will help you design it will be your NVIDIA for the purpose of this chip. That's how Google got their TPU off the ground back in the day.

And it's now how opening I apparently reportedly we talked about this last week is building their own kind of new generation of custom chip. So Broadcom. likes to partner with, with folks like that. And then they'll of course, ship that design out to TSMC for fabrication on whatever node they choose for, for that that design. So anyway, that's kind of the, the, the big design ecosystem in a nutshell. Yeah. And yet another fun historical, well, I guess, interesting historical detail.

I don't know if it's fun. TSMC is. Unique or was unique when it was starting out as a company that just provided fabrication. So a company like could design a chip and then just ask TSMC to fabricate it. And TSMC promised not to then. Use your design to make a competing product. So prior to a TSMC, you had companies like Intel that had fabrication technology. Intel was making money from selling chips from CPUs and so on, right?

TSMC, their core business was taking designs from other people, fabricating with Getting to you and nothing else. We're not going to, you know, make GPUs or whatever. And that is why NVIDIA, well, why NVIDIA could even go to them, right? NVIDIA could not sort of ask a potential competitor let's say AMD. I don't know if AMD does fabrication, but anyway, they, it could be the case that they do some design in house. And then contract to TSMC to then make the chips.

And as you often find out, TSMC has a limited capacity for who it can make chips for. So, you know, you might want to start a competitor, but you can't just like call TSMC and be like, Hey, can you make some chips for me? It's, yeah, it's not that simple. And one of the advantages of NVIDIA is, This very, very well established relationship going back to even the beginnings of Nvidia, right? They very fortuitously struck a deal very early on.

That's how they got off the ground by getting TSMC to be their fabrication partner. So we have a very deep, close relationship and have a pretty significant advantage because of that. Yeah, absolutely. Actually, great point to call that out, right? TSMC is famous for being the first, as it's known, pure play foundry, right? That's kind of the term. You'll also hear about, like so fabulous so fabulous chip designers, right? That's the other side of the coin, like NVIDIA.

NVIDIA doesn't fab, they design. They're a fabulous designer. Whereas, yeah, TSMC is a pure play foundry. So they just fab. It kind of makes sense when you look at the, the insane capital expenditures and the risks involved in this stuff. Like you just can't focus on both things. And the classic example, it's your point of, you know, NVIDIA can't go to AMD. So AMD is fabulous, but, but Intel isn't. And Intel tries to fab for other companies.

And that always creates this tension where, yeah, of course, NVIDIA is going to look at Intel and be like, fuck you guys. Like you're, you're coming out with you know, whatever it is, Arrow Lake or you know, a bunch of AI optimized designs. Like. Those ultimately are meant to compete with us on design. So of course, we're not going to like give you our, our fab business. We're going to go to our partners at TSMC. So it's almost like the economy wants these things to be separate.

And you're increasingly seeing like, this is the standard state of play. No global foundries pure play fab. SMIC is a pure play. And then Huawei, the Huawei SMIC partnership is kind of like the NVIDIA TSMC partnership where Huawei does design and then SMIC does the fabbing. all this stuff is so deep and complex and there's webs of relationships that are crazy. And it. Technology, the number of steps to get from the design to an actual chip.

We haven't even gotten into, I don't think we got into packaging or we, we touched on it. We touched on it, but yeah, and then there's, you know, constructing the motherboard, which is a whole other step. So anyway, it's it's pretty fascinating and I think we might have to call it with that level of detail, but hopefully we've provided a pretty good overview of.

Kind of a history of hardware and AI and, and the current state of it and why it's such an important part of the equation and such a, I guess, pivotal aspect of who gets to win, who gets to dominate in AI and why everyone wants to build massive data centers and get, you know, a hundred thousand GPUs. That's the only way to scale. Is via more chips and more compute. And that's just the game that's being played out right now. Well hopefully.

You enjoyed this very detailed episode on just this one topic. We haven't done this kind of episode in a while, and it was a lot of fun for us. So do let us know. You can comment on YouTube, on substack or, or leave a review. We'd love to hear if you would want more of these specialized episodes. We have, you know, touched on quite a few. We could do, we could talk about projections for a GI. Triffic is really interesting Jetix systems, like a thousand things.

So please do comment if you found this interesting or you have other things you'd like us to talk about.

Transcript source: Provided by creator in RSS feed: download file