20VC: Why Google Will Win the AI Arms Race & OpenAI Will Not | NVIDIA vs AMD: Who Wins and Why | The Future of Inference vs Training | The Economics of Compute & Why To Win You Must Have Product, Data & Compute with Steeve Morin @ ZML - podcast episode cover

20VC: Why Google Will Win the AI Arms Race & OpenAI Will Not | NVIDIA vs AMD: Who Wins and Why | The Future of Inference vs Training | The Economics of Compute & Why To Win You Must Have Product, Data & Compute with Steeve Morin @ ZML

Feb 24, 20251 hr 13 min
--:--
--:--
Listen in podcast apps:

Summary

Steeve Morin discusses the future of AI inference, challenges in AI hardware, and the economics of AI compute. He explores the differences between training and inference infrastructure needs, the roles of NVIDIA and AMD, and the importance of product, data, and compute. Morin also touches on scaling laws, model efficiency, and the potential of Retrieval Augmented Generation (RAG).

Episode description

Steeve Morin is the Founder & CEO @ ZML, a next-generation inference engine enabling peak performance on a wide range of chips. Prior to founding ZML, Steeve was the VP Engineering at Zenly for 7 years leading eng to millions of users and an acquisition by Snap. 

In Today’s Episode We Discuss:

04:17 How Will Inference Change and Evolve Over the Next 5 Years

09:17 Challenges and Innovations in AI Hardware

15:38 The Economics of AI Compute

18:01 Training vs. Inference: Infrastructure Needs

25:08 The Future of AI Chips and Market Dynamics

34:43 Nvidia's Market Position and Competitors

38:18 Challenges of Incremental Gains in the Market

39:12 The Zero Buy-In Strategy

39:34 Switching Between Compute Providers

40:40 The Importance of a Top-Down Strategy for Microsoft and Google

41:42 Microsoft's Strategy with AMD

45:50 Data Center Investments and Training

46:40 How to Succeed in AI: The Triangle of Products, Data, and Compute

48:25 Scaling Laws and Model Efficiency

49:52 Future of AI Models and Architectures

57:08 Retrieval Augmented Generation (RAG)

01:00:52 Why OpenAI’s Position is Not as Strong as People Think

01:06:47 Challenges in AI Hardware Supply

 

Transcript

The thing with NVIDIA is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like, who gives a sh** about CUDA? OpenAI is amazing, but it's not their compute. Ultimately, if you don't own your compute, you're starting with something at your ankle.

In five years, I would say 95% inference, 5% training. You have the products, the data, and the compute. Who has all three? Google has like Android, Google Docs. They have everything they can sprinkle everywhere. This is the sleeping giant in my mind. This is 20VC with me, Harry Stebbings, and our show with Jonathan Ross at Grok went so well last week, but...

I had so many more questions on two things, the future of chips and the future of inference. So today we dig deep on both and there's no one better to join me than Steve Morin. Steve is the founder of ZML, a next generation inference... enabling peak performance on a wide range of chips. Literally the perfect speaker for this topic and this was a super nerdy show. It was probably the most information dance episode we've done in a long time. So do slow it down.

pause it get a notebook out but wow there is so much gold in this one but before we dive in today Turning your back-of-a-napkin idea into a billion-dollar startup requires countless hours of collaboration and teamwork. It can be really difficult to build a team that's aligned on everything from values to workflow, but that's exactly what Coda was made to do.

Coda is an all-in-one collaborative workspace that started as a napkin sketch. Now, just five years since launching in beta, Coda has helped 50,000 teams all over the world get on the same page. Now, at 20VC... We've used Coder to bring structure to our content planning and episode prep, and it's made a huge difference. Instead of bouncing between different tools, we can keep everything from guest research to scheduling and notes all in one place, which saves us so much time.

With Kodi you get the flexibility of docs the structure of spreadsheets and the power of applications all built for enterprise And it's got the intelligence of AI, which makes it even more awesome. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time. To try it for yourself, go to coda.io slash 20VC.

today and get six free months of the team plan for startups. That's coda.io slash 20VC to get started for free and get six free months of the team plan. Now that your team is aligned and collaborating, let's tackle those messy expense reports. You know, those receipts that seem to multiply like rabbits in your wallet, the endless email chains asking, can you approve this?

Don't even get me started on a month-end panic when you realize you have to reconcile it all. Well, Plio offers smart company cards, physical, virtual, and vendor-specific, so teams can buy what they need, while finance stays in control. Automate your expense report.

process invoices seamlessly, and manage reimbursements effortlessly, all in one platform. With integrations to tools like Xero, QuickBooks, and NetSuite, Plio fits right into your workflow, saving time and giving you full visibility over everything. every entity, payment and subscription. Join over 37,000 companies already using Plio to streamline their finances. Try Plio today. It's like magic, but with fewer rabbits. Find out more at plio.io forward slash.

20 vc and don't forget to revolutionize how your team works together rome a company of tomorrow runs at hyper speed with quick drop-in meetings a company of tomorrow is globally distributed and fully digitized the company of tomorrow instantly connects human and AI workers. A company of tomorrow is in a Rome virtual office. See a visualization of your whole company, the live presence, the drop-in meetings, the AI summaries, the chats. It's an incredible view to see.

Rome is a breakthrough workplace experience loved by over 500 companies of tomorrow for a fraction of the cost of Zoom and Slack. Visit Rome, that's O-R dot A-M, for an instant demo of Rome today. Nobody knows what the future holds. But I do know this, it's going to be built in a Roam virtual office, hopefully by you. That's RoamRO.AM for an instant demo. You have now arrived at your destination.

Steve, dude, I am so grateful to you for joining me today. I've wanted to make this one happen for a while, but when we were discussing who would be best for this topic, I was like, we've got to have Steve on. So thank you for joining me today, man. Well, thank you. I feel humbled. I appreciate it. Thank you. Dude, I want to start. Can you just give us a quick overview of ZML and specifically your role in the infrastructure strategy today and where you sit?

So at the very bottom of things, ZML is an ML framework that runs any models on any hardware. We sit ultimately at the infrastructure layer. We enable anybody to run their model better, faster, more reliably, but on any compute whatsoever.

doesn't really matter it could be nvidia it can be amd it could be tpu and whatnot and we do all that without compromise that's the key point because if there's a compromise then it's not really you know agnostic right can i ask you then if we think about sitting between any model and any provider there in terms of amd and video do you think then we will be existing in a world where people are using multiple models simultaneously and that is

concurrently running. Yes, you actually can see it. It's been happening for a while. Models now are not the right abstractions, at least. If you look at closed source models, they're not really models. They're more like back-end. And there are a lot of tricks that you feel like you're talking to one model, but ultimately you're talking to a constellation, an assembly of back-ends that produces a response.

Probably the number one, you know, I would say obvious thing would be that if you ask a model to generate an image, then it will, you know, switch to a diffusion model, right? Not an LLM. And there's many, many more tricks. The Turbo models and OpenAI do that. There's a lot of tricks. So definitely models in the sense of getting weights and running them.

is something that is ultimately going away because, you know, in favor of like full-blown backends, right? You feel like you're talking to a model, but ultimately you're talking to an API. The thing is, that API will be running locally. in your own cloud instances and so on. So we will have a world where we're switching between models and there's kind of this trickery around them. Okay.

Perfect. So we've got that at the top, then we've got ZML in the middle, and then you said, and then on any hardware. So will we be using multiple hardware providers at the same time, or will we be more rigid in our hardware usage? No, absolutely. You can get order... like probably an order of magnitude, more efficiency, depending on the hardware you run on. That is substantial.

Not a lot of people have that problem at the moment. Things are getting built as we speak. But a simple example is if you switch... From NVIDIA to AMD on a 7TB model, you can get four times better efficiency. in terms of spend right so that is substantial that is very much substantial now the problem is getting some amd gpus right i'm really sorry if there is such a cost efficiency four times why does everyone not do that

So there's a few reasons. Probably the most important one is the PyTorch CUDA, I would say, Duo. And that's very, very hard to break. These two are very much intertwined. Can you just explain to us what PyTorch and Kudo are? Oh, yes, absolutely. PyTorch is the ML framework that people use to build, actually train models, right? You can do inference with it, but by far the most successful framework for training is

is PyTorch, and PyTorch was very much built on top of CUDA, which is NVIDIA software, right? Let's just say the strings of PyTorch make it ultimately very, very bound to CUDA. So, of course, it runs on, you know, it runs on AMD, it runs on, you know, even Apple and so on. But there was always, you know, the tens of little details that...

not exactly run like you would expect and there's work involved, but then also there's supply. So probably that's the number one thing. The second thing is there's a lot of GPUs on the market. Pretty much all of them are NVIDIA. The reason being that if you think, you know, in layers and you say, all right, I'm going to buy, let's say, GPUs and I'm going to sell them to folks to maybe not even do training, right? Just do inference.

then most likely if you look at it that way you'll end up buying nvidia because everybody will want to run on nvidia because nobody knows really how to do whatever and they've trained on nvidia so they're like i can just reuse my code and so on so there's like this self perpetuating circle of people just buy NVIDIA because they want to resell and people just use NVIDIA because it's there, right?

But it's by far not the most efficient platform. And arguably, even in terms of software, it's not the best software platform. So that is probably... two of the most um i would i'd wager the most important reasons can i ask before you know we were chatting about Nvidia and AMD when DeepSeek obviously happened and the stock crash that happened. Why did Nvidia rebound, do you think, in a way that AMD didn't? Because the chips are there.

There's a lot of things, but in my opinion, there's going to be a need for inference. Very hard to say whether it will be worth everybody's money to do it on H100.

a bubble that i think will blow some time i'm kind of afraid of that to be honest why do you think that's a bubble that will blow some time why is that not legitimate because it was built on the A100, I would say, financial model, which was at Generation Zero, we do training, but when it's last generation, we do inference, and it worked beautifully.

right for a100 then h100 comes along and inference is it's worth five times the price and it maybe runs twice uh in terms of performance on inference that is on training it's a lot better but on inference it's like maybe twice as fast when it actually when it came out it rained at the same speed than the a100

So there's a money gap that's going to have to, you know, be bridge sometime, right? And the part that worries me is that I see, you know, amortization plans in like, you know, six, seven years. with the GPUs at the Collateral. And I'm like, well, I'm not sure how it's going to work because at least when they came out, they were worth five times the price and they're just two times faster.

something has got to give. Is speed of development trumping chip development speeds where it's now becoming a real problem where as we say models are far outpacing the speed of chip deployment? Not much, ultimately. The two things that could very much shake the chip industry, in my opinion, are agents and reasoning. Number one, agents. Why does that change the chip industry?

I think this is where NVIDIA can be attacked. I mean, why agents and why reasoning? The differences for agents and reasoning, you need to wait until... the end of the request to get whatever it is you came for. You don't really care about the speed at which the text outputs, which is what you want in a chat, right? You only care about...

how much time does it take between the beginning of my request and the end? And so that fundamentally changes the incentives from throughput bound to latency bound. And so GPUs Let's say you're running a GPU at, let's say, 10,000 tokens per second. You very much like to do it, you know, 100 times 100, right? And they can do that, but they cannot give you 10,000 tokens per second only on you. Per stream, what we say. But in terms of agents or reasoning, this is exactly what you want.

Because you don't want to wait like, you know, 50 seconds for whatever thinking, right? And agents, it's the same. So these two, I think, are the shot. that might make Nvidia change its course with respect to chips. I mean, they're not idiots, right? How should agents change Nvidia's strategy? Hard to say, because Nvidia has a very, very vertical approach. They do more of more, right? Like if you look at Blackwell, it's actually crazy what they did for Blackwell. They assembled two chips.

But the surface was so big that the chip started to bend a bit, which further perpetuated the problem because it then didn't make contact with the heat sink and so on. So they are very much, and you know, the power envelope, they push it. to a thousand watts. It requires liquid cooling and so on. So they are very much in a very vertical foot to the pedal in terms of GPU scaling. But the thing is, GPUs are a good trick for AI, but they're not built for AI. It's not...

a specialized chip. It is a specialization of a GPU, but it is not an AI chip. Forgive me for continuously asking stupid questions. Why are GPUs not built for AI? And if not, what is better? So the way it worked is that you can think of a screen as a matrix. And if you have to render pixels on a screen, there's a lot of pixels and everything has to happen in parallel, right? So that you don't waste time. Turns out, you know, matrices...

are a very important thing in AI. So there was this cool trick in which we essentially tricked the GPU into, back that was like probably 20 years ago, we would trick the GPU into believing it was doing graphics rendering where actually we would making it do parallel work, right? It was called GPGPU at the time, right? So it was always a cool trick, but it was not dedicated for this.

The pioneers probably were, of course, Google with TPU, which are very much more advanced on the architectural level. But essentially, the way they work, it kind of works for AI, but for LLMs... that starts to, you know, to crack because they're so big and there's a lot of memory transfers and so on. Actually, that's why Grok achieves, not Grok, but Grok, Cerebras and all these folks, they achieve very high performance single stream.

It's because the data is right in the chip. They don't have to get it from memory, which is slow, which GPU has to do. So there's a lot of these things that... ultimately make it a good trick, but not a, I would say, dedicated solution per se. That said, though, the reason probably NVIDIA won, at least in the training space, is because of Mellanox.

Not because of the raw compute. Because you need to run lots of these GPUs in parallel. So the interconnect between them is ultimately... what matters right so how fast can they exchange data because remember when you do a matrix multiplication let's say you read the matrix is read like hundreds of times during the multiplication

So there's a lot of transfers going on. And so far, Melanox with, you know, InfiniBand had the best technology. So that's why, you know, a lot of people, and when you do training, by the way, it is the name of the game, the interconnect. When you do inference, not so much. You don't care when you do inference. Before we move to inference, I do just want us to stay on chips and just say, okay, so we have TPUs, we have NVIDIA, we have AMD.

In terms of distribution of gains, is this a winner-take-all market? Is this cloud where you have several providers who are dominant? What does the distribution of games look like in the chip market? So I would divide it in two categories. Well, three categories. The GPUs you can buy or rent. The TPUs you can rent.

and the TPUs you can buy. This is how the market is structured today, right? Right now, if you want to go dedicated, at least in the cloud, there's two options, TPUs and Tranium. TPUs on Google, Tranium on Amazon. So these are available chips. You can rent them today. If you want to buy GPUs or rent GPUs, you know, their GPUs, we...

We know it all the time. And there's this new wave of computing, which are dedicated chips you can actually buy. The Tense Torrent, the Etched, the Visora. So I think it will be a mix of, you know... For instance, let's say you are in Google Cloud. Of course, you don't want to do NVIDIA. You get ripped off. Here's the dirty secret is that NVIDIA, like a TSMC sells you at 60% margin. NVIDIA sells you at 90% margin.

And on top of that, there's Amazon that takes, let's say, a 30% margin. So you are a very thin crust on a very big cake. It's a bit of a losing game if you go all in on one provider. You want optionality. With increasing competitiveness within each of those layers, do we not see margin reduction? Absolutely, yes. Yeah, yeah, yeah.

Here's the problem, though. Let's say you are on Google Cloud and you run TPUs. Suddenly, you just remove that 90% chunk on the spend. The problem is that for multiple software reasons, which we are solving, at DML is that they're not really, I would say, a commercial success. They are very much successful inside of Google, but not much outside of Google. Amazon, same, is pushing very, very hard for their, you know, Terranium chips. So...

The future I see is that you use whatever your provider has because you don't want to pay 90% outrageous margin and try to make a profit out of that. okay so when we move to actually inference and training everyone's focused so much on training i'd love to understand what are the fundamental differences in infrastructure needs when we think about training versus inference so

These two obey fundamentally different, I would say, tectonic forces. So in training, more is better. You want more of everything, essentially. And the recipe for success is the speed of iteration. You change stuff, you see how it works and you do it again. Hopefully it converges. And it's like, you know, changing the wheel of a moving car, so to speak. So that is training. On inference, this is a complete reverse.

Less is better. You want less headaches. You don't want to be working up at night because inference is production. You could say that training is research and inference is production, and it's fundamentally different. In terms of infra, probably the number one thing that is the number one difference between these two is the need for interconnect.

If you do production, if you can avoid to have interconnect between, let's say, a cluster of GPUs, of course you will avoid that if you can. And this is why models... have the sizes they have. It's so that people can run them without the need to connect multiple machines together. It's very constraining in terms of the environment.

That is probably the fundamental difference, the need for interconnect. And number two is, ultimately, do you really care about what your model is running on as long as it's outputting whatever you want? it to output can you just help me understand sorry why is training more is more and that's great and in inference less is more why do we have that difference think of it like doing a painting and doing a million paintings.

the tools you will use the process you will do if you do one painting what you favor is the speed at which you can do a stroke and do some iteration if you do a million what you want is a process a process that is reliable that can deliver you efficiently paintings so that is the same for for training versus inference if you run around you know millions of instances of a model you cannot you know hack your way to do that by the way

People do hide their way today, but this is probably the fundamental difference. How do people then put inference in production today? You know, we've seen with training, that's really where NVIDIA have dominated so heavily. How do people put inference in production? There's a lot of duct tape. Here's also probably one of the problem is that training on first principle is actually two passes, forward and backward, right? It's called...

forward pass and backward pass. Inference is running only the forward pass. So that's how things are today. There are people who are trying to specialize a bit. Because at some point, duct tape doesn't really work out. And when you're on big scales, that makes a problem. And it's a problem that's growing because a lot of people are coming on the market with the needs for inference.

That wasn't the case a year and a half ago or a year ago. OpenAI had this problem, right? Maybe Entropic had this problem. But it wasn't a universal problem yet. And now it's becoming a universal problem. Can you articulate what problem did OpenAI and Anthropic have with regards to inference? So, for instance, probably the number one thing...

depending on how you deploy. But if you're deploying inference, the number one thing that will get you is what's called autoscaling. So as your systems get more and more loaded, you want to provision because these things are... tremendously expensive. You want to prevision them as you scale, right? So you want to say, I have a thousand GPUs.

you know 24 hours even like if there's like nobody on the production i will pay for them which is mind you what people are doing today this is crazy so what you want to do is you want to you know

provision, compute as you grow your needs, right? And you want to do it up and you want to do down. Probably the number one thing that, you know, gives you a lot of efficiency in terms of spend like we're talking you know multiples like you know five you know sometimes 10x you know improvement the thing is this is a problem at least in that in i would say back regular back engineering this is a problem everybody knows

right? Everybody's doing it because the savings are so huge, but on AI, nobody really had the problem. So now they're coming up to it. So this is one example. so the problem is that they're not doing provisioning they're paying a shit ton more because they are fully in production all the time versus provisioned as needed that's one example yeah another one is choosing the right compute

It's like kind of, I would say, a vicious circle because provisioning compute is very hard. So if you lose compute, it's very bad. You are essentially incentivized to overbuy. In the case of Amazon or Google, that would be buying reserved compute, which you're not going to use because if you buy it on demand, you will get tremendously ripped off. So that creates this like face scarcity of compute because that people buy preemptively because they're a shit ton of money and they're not using it.

So this is a major problem too. When you buy compute preemptively, does it not become outdated by the time you use it though? It might well be, yes. We are being spared a bit because Blackwell is late and orders are getting canceled. And so H series, I would say, are still, you know, in the active. But yes, absolutely. But, you know, what choice do you have?

Will we have a moment in time where there is this massive overhang or oversupply of compute, which we've proactively bought ahead of time, but then actually the hyperscalers go, we'd rather just burn it. and buy fresh and we have the money to do that so i might tell you that i think it already started i'm getting cold emails for you know discounts you know from services i never heard about

And I started getting these emails probably around October, November. Some people are left with a lot of capex that they don't know what to do with. You know, it's a different thing to build a cluster and run a training and do a training run than it is to build literally a cloud provider or hyperscaler or whatever you want to call it.

There are a lot of people who do their training runs on the regular providers, but then move to regular hyperscaler when they do production. So I very much worry there will be an oversupply of these chips. The problem is that, you know, remember... The chips are the collateral. So, you know, somewhere, you know, in the US or whatever, there's going to be a data center with like a thousand GPUs that people may buy, you know, 30 cents on the dollar. You know, this is what might happen.

What is the timeframe for that might happening? Probably this year. Jensen has made it very clear that inference opens up more revenue opportunity for NVIDIA. He said that 40% of their revenues today comes from inference. Right. to what extent is that correct or actually as jonathan at grok said in the show nvidia is not meant for inference definitely not and actually that market won't be won by nvidia

Technically speaking, he's right. But realistically speaking, I'm not sure I agree. The thing is, these chips are on the market. They're here. Altab on Chrome and get one. That is something that... I don't take lightly. Availability, that is, right? I think NVIDIA is here to stay, at least if not for the H100 bubble bust.

Because these chips are going to be on the market and people will buy them and do inference with them. Remains to see, you know, the OPEX and the electricity, etc. But the thing is, the only chips that are really, you know... frontier on that sense are probably tpus and then the upcoming chips but the thing is they're great chips but they're not on the market or like there are outrageous prices like millions of dollars to run a model so what chips are great and why aren't they on the market

Let's say, for instance, Cerebrus, incredible technology, incredibly expensive. So how will the market value the premium of having single stream, very high tokens per second? There is a value into that, right? As we saw with Mistral and perplexity. But I think that was done at the loss. I don't know. I don't have the details. But I think it was done at the loss that Cerebrus, you know, put it out.

Today, there's three actors on the market that can deliver this. I think this will be, I would say, the pushing force for change in the inference landscape. agents and reasoning so that is you know very high tokens per second only for you what is forcing the price of a cerebris to be so high and then you heard jonathan at grok on the show say that hey they're 80 cheaper than nvidia

So there's this trick. Because here's the thing, there's no magic. This little trick is called SRAM. SRAM is memory on the chip directly. So that is very, very fast memory. But here's the problem with SRAM. is that ESRAM consumes, you know, surface on the chip, which makes it a bigger chip, which is very hard in terms of yield, right? Because there's the chances of like the problems are higher and so on. So ESRAM is, I would say...

Very, very, very fast memory, which gives you a lot of advantage when you do very, very high inference. But it's terribly expensive. And if you look at, for instance, Grok, they have on their generation, this generation, they have 230 megabytes. of SRAM per chip. A 7TB model is 140 gigabytes. So you do the math, right? Cerebrus has 44 gigabytes of SRAM into what they call their wafer scale engine. which is a chip the size of a wafer.

I mean, most likely it's interconnected, but it's huge, right? And it has to be water-cooled. They have copper, you know, I would say needles that touch the chip. It's crazy stuff. Very, very impressive technology, mind you, but very, very expensive. So my bet is I think there will be chips on the market that do that at a much lower price. And there's two companies I see going in that direction. One is called Etched and the other one is called Visora.

that's the two i see because if you can deliver this at the i would say the price that is comparable to gpus you've won is minimizing sram the only way to reduce unit cost on these chips really It's hard to say. I mean, you need some SRAM, but if you can have a smaller process node, but if you can hook yourself with external memory, then yes, you can do that.

a lot better but the thing is if you go like full-blown esham then you know there's no magic you will have to pay the price uh i'm so enjoying this i'm also learning my notes here are just expanding by the day if that's today how do you think the inference market evolves over the next three to five years pushed by reasoning so

Reasoning not in the sense that you see on DeepSeq and whatever, right? Reasoning in what's called latent space reasoning. Latent space reasoning and agents will push the market towards different types of compute. can i just ask what's latent space reasoning so the way models reason today is the reason it was in in tokens so it's as if if you think to yourself you would you know say out loud what you're thinking

So yes, it works, but it is a bit inefficient, right? And you lose information doing this. Latent space reasoning is this without going, I would say, to English or whatever, right? So staying in what's called the latent space, which is where all the information of an LLM, let's say an LLM, an LLM lives, right? So this is very much how we... you know, work as humans. And we move toward what Yann Lequin calls an energy-based model in which

We have different types of longer or shorter, I would say, thinking times, if you will, right? So that fundamentally, GPUs cannot deliver this, plain and simple, at scale. Why can't GPUs deliver it? because the access to external memory prevents it. So HBM is all the rage, right? But HBM compared to SRAM is absolutely, you know, dark slow. So this is the problem you get. So HBM is like the best we can do, but it's still slow versus S-RAM.

so when i had jonathan on he was like actually nvidia have such a stronghold because they're one of the only buyers of hbm and that gives them this unique position actually is being a sole buyer of HBM irrelevant if the world needs SRAM instead? No, you want HBM, to be clear. No, SRAM, this will not deliver.

it's a dead end in terms of scaling isham means killing the surface mean you get you know depreciating problems it explodes everywhere right so you need some isham right so we you know we'll have

bigger amounts of SRAM into chips and of course bigger what's called external memory into chips. The issue... with hbm is that it's still slow and yes maybe nvidia has a stronghold and they can prevent you from getting some so that would be like i i call it the new the nutella situation in which you know nutella they owns 80 of the hazelnuts market right so yes

can do a competitor but who will you buy the nuts from right so there will be a need for HBM there will be a need for SRAM I would say better more dedicated architecture will be able to deliver these things And then there's like the next frontier after that, which is called compute in memory. There's two companies that are on that market. One is called RAINN, RAINN.AI. Sam Altman is one of the investors.

There's no surprise. The other one is called Fractile. So this is the next frontier. And the idea is that instead of transferring the data between external memory and the CPU and do the compute there, you actually you know bring the cpu to the memory and you do everything it's crazy stuff but uh it's coming maybe not this year but uh how does that change the situation it makes it much more efficient but what does that actually mean in reality

it means you get maybe not SRAM level performance, but you get a lot faster performance in terms of compute. And if you translate that to LLMs, let's say, you get much, much higher tokens per second in a single stream, which is... exactly what you want when you go into reasoning you want your model to maybe think let's say for like half a second and then boom you don't want to wait 50 seconds and you know context switch to some other thing which is the problem everybody has today mind you

So, yeah, I think inference will be pushed. The compute landscape will be pushed to change because of these two constraints. I know I'm working on it. If you were to ascribe value between training and inference out of a pool of 100. is it 80 inference 20 training what does that look like in five years i would say 95 inference with five percent training do you think nvidia owns both of those markets in five years time

Depends on the supply. I think that there's a shot that they don't. Because here's the thing, you know, even if we take, you know, same amount of, you know, let's imagine we have a new chip from Amazon, right? That is the same amount. Oh, wait, we do. It's called Tranium. You know, why would I pay 90% margin of NVIDIA if I can freely change to Tranium? My whole production runs on AWS anyways.

like if you run on the cloud and you're running on nvidia you're getting you know squeezed out of your money right so if you're on production on dedicated chips Of course, you know, so maybe through commoditization, but, you know, hey, I'm on AWS, I can just click and boom, it runs on AWS's chips. Who cares, right? I just...

run my model like I did, you know, two minutes ago. With that realization, do you think we'll see NVIDIA move up stack and also move into the cloud and model? They are. They have a protocol name. sort of does that. The thing with NVIDIA is that they spend a lot of energy making you care about stuff you shouldn't care about. And they were very successful. Like, who gives a shit about CUDA? I'm sorry, but I don't want to care about that, right? I want to do my stuff.

And NVIDIA got me into saying, hey, you should care about this because there's nothing else on the market. Well, that's not true. But ultimately, this is the GPU I have in my machine, so off I go. If tomorrow that changes, why would I pay 90% margin on my compute? That's insane. This is why I believe it ultimately goes through the software. This is my entry point to the ecosystem.

So if the software abstracts away those endiosyncrasies as they do on CPUs, then the providers will compete on specs and not on fake modes or circumstantial modes. So this is where I think the market is going. And of course, there's the availability problem. If you piss off Jensen, you might need to kiss the ring to get back in line, right? I mean, ultimately, I don't see this as being sustainable.

when we chatted before you said about amd and i said hey i bought nvidia and i bought amd and nvidia thanks jensen i've made a ton of money and amd i think i'm up one percent versus the 20 percent gain i've had You said that AMD basically sold everything to Microsoft and Meta and had a GTM problem. Can you just unpack that for me? So all, I would say, chip makers have a GTM problem. All of them.

Whether, you know, it's Google, whether it's AMD, whether it's TenStorn. The problem is that there's, I would say, probably two fundamental problems. The number one is if you're maintaining multiple stacks... today is very very very hard so you don't so let's say i buy you know amd i want to buy amd right that means i'm going to abandon nvidia oh crap you know i have a six-year amortization plan on that oh man what do i do

So do I need to support both stacks? Unclear. Maybe until AMD tells me, hey, you know, you have, I don't know, let's say a thousand NVIDIA GPUs. You're about to buy a hundred thousand of AMD. I mean... come on right and i'm like okay that is you know makes it worth my while right but that is ultimately the fundamental problem is that the steps are very high right i need to have a lot of incentives to buy into that ecosystem so i need to buy a lot of them

So if you're AMD, that is already a problem. But then Microsoft comes along and buys it all, makes, by the way, OpenAI, or at least on the inference side, puts OpenAI in the green. Because of the efficiency gains. I'm just trying to understand. So are you saying the switching costs are really high from one provider to another? Oh, yeah, absolutely. Which is why you don't? Or are you saying that to get into one of these buy processes, you have to buy so much?

that it prohibits you it's actually both so the buy-in is very high so to make it worth it you have to buy a lot And if you buy a lot, this is, you know, what we talk to all of them. They always have the same questions and it's completely understandable. They say, this is great, but who's the customer? Because on the other side, let's take Amazon, for instance, with Tranium.

Apple just came and said, hey, we're going to buy 100,000 of them. So you want to buy 10,000, you feel like the big shot, right? Yeah, but go back to the queue because there's Apple before you, right? So they have to have very high commitments. You cannot be incrementally better. It's very hard, right? And also very hard, I can give you one metric if you want. I know for a fact that being seven times better, take whatever metric you want.

whether it's spend, whether it's whatever. It's not enough to get people to switch. People will choose nothing over something. So this is a very hard market to enter into because you cannot also compete of incremental gains. it's very hard right so you have to convince a lot of people um maybe you can go the um middle east route in which you know they sprinkle everything and they you know evaluate everything

That's not, you know, very sustainable, I would say, strategy in the long term. What is the right sustainable strategy then? You don't want to go so heavy that you can't ever get out and you have that switching cost. Right. But you also don't want to sprinkle it around and do, as you said. Absolutely. The right approach to me is making the buy-in zero. If the buy-in is zero, you don't worry about this. You just buy whatever is best today. How do you do that by renting?

oh because this is what we do this is our promise our thesis is that if the buy-in is zero you know you completely unlock that value because you're free when you say the buy-in is zero what does that actually mean it means that you can freely switch you know compute to compute like freely right you just say hey now it's amd boom it runs you just say oh it's 10 store and boom it runs right how do you do that then do you have

agreements with all the different providers oh yeah yeah yeah not agreements but like we we work with them to support their their chips but the thing is my at least as you know a i would say a user myself of you know of our tech is that if it's free for me to switch or to choose whichever provider i want in terms of compute right amd nvidia whatever then i can take whatever is best today

And I can take whatever is best tomorrow and I can run both. I can run three different platforms at the same time. I don't care. I only run, you know, what is good at the moment. And that unlocks to me a very cool thing, which is incremental. improvement. If you are 30% better, I'll switch to you.

so are you taking the risk on those on that hardware then if you're the one providing them to turn off and on on demand provisioning you name it who takes the risk this is actually a great question i think that if you are doing it bottom up infra to applications you will lose because nobody will care as they don't today right if you look at tpus they're available they're great nobody cares why does nobody care about tpus sorry because the cost of buying

It's always the same, right? You have to spend six months of engineering to switch to TPUs. And mind you, TPUs do training. They're the only ones. We're training them now. But AMD can do training, but it's also... But in terms of maturity... by far the most mature software and compute is TPUs, and then it's NVIDIA, right? So the buy-in is so high that people are like,

We'll see, right? I'm not on Google Cloud. I have to sign up. Oh my God, right? So these are tremendous chips. These are tremendous assets. Now, in terms of the risk, I think if you want to do it, you have to do it top to bottom. You have to start with whatever it is you're going to build and then permeate downwards into the infrastructure. Take for example Microsoft with OpenAI.

They just bought all of AMD's supply and they run, you know, chat GPT on it. That's it. And that puts them in the green. That's actually what makes them, you know, profitable on inference. so or at least let's say not lose money right i'm sorry how does microsoft buying all of amd's supply make them not lose money on inference just help me understand that because i can give you like actual numbers if you run eight h100

You can put two 7TB models on them because of the RAM, right? That's number one. Number two is if you go from one GPU to two, you don't get twice the performance. Maybe you get 10% better performance. Yeah, that's the dirty secret nobody talks about. I'm talking inference, right? So you go from, let's say, 100 to 110 by doubling the amount of GPUs. That is insane. So you'd rather have two by one than one by two.

right? So with one machine of 8 H100, you kind of run two 7TBs model if you do, you know, four GPUs and four GPUs, right? That's number one. If you run on AMD... Well, there's enough memory inside the GPU to run one model per card. So you get eight GPUs, eight times the throughput, while on the other hand, you get eight GPUs, maybe two and a half times the throughput.

So that is, you know, a 4x right there. Just, you know, by virtue of this. So that is, you know, the compute part. But if you look at... all of these things, there are a tremendous amount of, you know, we talk to companies who have chips upcoming with almost 300 gigabytes of memory on it, right? So that is, you know, a model, like one chip per model. This is the best thing you want.

uh if you're on seven tbs right so which is what i would say not the state of the art but this is the uh the regular stuff people will use for serving so if you look you know top to bottom and you know what you're going to build with them

then it's a lot better to do the efficiency gains because four times is a big deal, right? And mind you, these chips are 30% cheaper than Nvidia's. It's like a no-brainer. But if you go bottom up and say, I'm going to rent them out, people will not rent them. Simple. So that's why, you know, I think it's a good way to attack it from the software, because ultimately, do you really care about that your MacBook, let's say, is an M2 or an M3? It's the better one.

And that's it, right? And imagine if you had to care about these things. That would be insane. When I listen to you now, I'm like, shit, I should sell my Nvidia and buy more AMD. If you were forced to buy one, I'm not saying sell the other. I'm not saying like this on the other.

buy one which would you buy and why stock yeah i used to think the market was efficient so probably i would go today at least i would go with nvidia still because the supply but you know if we play our cards right we ship Our staff, hopefully, will come back and tell you to buy AMD as much as you can. Or 10-storrent, you know, if they go public or whoever else. These chips are amazing, by the way.

What does everyone think they know about inference that they actually don't? Or what does everyone get wrong about inference? Probably not a lot of people are accustomed.

to what it entails to run production so that inference is production and production is hard somebody has to wake up at night and i used to be that guy right i don't want to do it again so production is hard thankfully we have a lot of software nowadays to do that a lot better but there's not a lot of reuse because the ai field at least is not really accustomed to that yet it's changing but you know the discussions i had

you know, a year ago, and the discussions I had today are not the same. They're going to the right direction, but they're not there exactly yet. So probably that would be the number one thing. That is only, you know, training code running only forward pass, right?

This is not what it is. Can I ask, how do you evaluate the data center investment that we're seeing being made? When you look at Facebook doing 60 to 65, Microsoft doing 80, and some of the intense... capex expenditure that you're seeing how do you think about that on the data center side i mean they're still going after training so there's still this frontier probably it's why also nvidia is the better buy right now

Because on the NVIDIA side, if you do training, it's incremental. If you have bought a thousand NVIDIA GPUs and you buy a thousand new NVIDIA GPUs, that gives you 2,000 GPUs, right? But if you buy a thousand and a thousand AMD, that gives you... twice a thousand right it's a bit different so they're still going after training definitely and they're very pragmatic in doing so but i mean they have the capex to spend they're not making their money out of it probably

The only one, by the way, that owns their compute are Google. There's like this triangle of, I would say, of wind that I, this is my mental model, mind you. You have the products, the data, and the compute. Who has all three? And you get everything flows from there. Product, data, compute. Who has all three? Google, Amazon. Amazon, they don't have products. They have Amazon, right? They have AWS, but they don't have actual products. Google, that's like, you know, Android.

Google Docs, whatever, they have everything they can sprinkle everywhere. This is the sleeping giant in my mind. If they're not busy doing a reorg, they might. It's fascinating because everyone, if you're a shallow thinker, you think that OpenAI challenges their golden goose, which is search, and Google is threatening more than ever now.

OpenAI is amazing, but it's not their compute. It is Microsoft's compute. And if you own your compute, you own your margin, is essentially what you're saying. Yeah. Even Microsoft, when they were running NVIDIA, they bought NVIDIA. at some outrageous margins. I talk to a lot of people that build data centers and I tell them, mind you, these people buy tens of thousands of GPUs. And I ask them, hey, do you get at least a discount or something? And they're like, no. The only thing we get...

as the supply. So, I mean, ultimately, if you don't own your compute, you're starting with, you know, something at your ankle. Definitely. And so this is why I like to think in this, like this, this triangle, product data compute, and you can see where everybody's...

sits and their weaknesses and their strengths can i ask you if we move a little bit you said it's totally rational that everyone's focusing on training still when we think about that it's rational if you think that efficiency and scaling laws continue to continue to place such emphasis on it. How do you think about model scaling and scaling laws coming into place? How do you think about that? There's like a brute force approach to this. It is a very American approach.

More and more and more. But the thing is, you look at, for instance, the XAI cluster. It's not 100,000 GPUs. It is four times 25,000. You're starting, you know, because InfiniBand and in the case, Rocky, which is anyways, the technology they used to bridge their GPUs together, you have upper bounds, right? At some point, you're fighting physics.

so you can push it's like you know trying to get to the speed of light as you approach it the the amount of energy you need is a lot higher and a lot higher and it grows and grows so there's two i would say counter to that would be that Number one is we're still scaled, but there's a lot of waste and excess spending on the engineering side, which is the deep seek approach, right? Very successful at that, mind you.

They said, yeah, if we do this and this differently, then we get, you know, multiples sometimes, right? So virtually you increase your compute capacity because you're more efficient. And the other approach is Jan Lequin's approach, which is... This is not scaling. And at some point, we need to look the problem in the face and do something better, right? So, of course, we push and push and push because there's capital still.

But I'm more of these two approaches. I think you can do more with less. At what point do we stop and say, hey, there is a lot of wastage and we could do more better? I think until somebody does it. Deep Seek was a good wake-up call, right? Suddenly efficiency is in.

That's number one. And number two is until there's a new architecture that comes out and changes the game. So in the case of LLMs, for instance, you have these what's called non-transformer models that changes fundamentally the compute requirements. So that might be a frontier that completely obsoletes the transformers. Sorry, the transformers are the, I would say, the building block by which current models work, right? So the way they work is that...

For each token or syllable, if you will, the model will look at everything behind it. So you can see that as you add more text, you have more work to do. So there are these new architectures that do not require this, that might... change these things and probably shift the amount of compute needed to do training or to do inference. And then there's the new thing, which is Jan's thesis, which is the word model, as in LLMs are at that end.

What we need is something that understands the world fundamentally. And this is its JEPA thesis, it's called. I'm very bullish on this, but it's very frontier. Why are you bullish on it? And why is it so frontier? Because it's Yann Lecon. It's hard to. He's no bullshit, right? So he explained to me how it worked and I was blown away. But it makes a lot of sense. We are creeped out because the machine talks back to us.

But it's not a new thing, right? It used to, you know, this is not new technology when it came out. When it exploded, it was a new technology. But suddenly it was talking back. And that freaked us out. And we got crazy on it, right? Language is one form of communication, but it is ultimately a very narrow window into the world.

We use it to describe the world arguably with some loss, right? And so the JEPA approach is, long story short, is that you have essentially two things you want to do and you try and minimize the energy to do them. And from this, understanding emerges, physics emerges, etc. Because you're trying to minimize the amount of energy to go from one state to the other.

And that actually makes sense. If you try and pick this AirPod case, I'm not going to go roundtrip around the block to get it, right? I just get it. And in my brain, it's wired to just do the thing. If I go and, you know... talk to myself out loud, put the hand down, move to the left and whatever, that feels very inefficient. So probably this will be something that changes. And in the case of LLMs, there's good work also on what's called diffusion-based LLMs, which means like, instead of...

thinking, you know, what's called autoregressively. That means you get a new token, you re-inject and you redo, etc. They think more like what we do, which is in patches, right? Imagine a paragraph of text and words appear until it's done. Is distillation wrong? And if we're all progressively moving towards a better future for humanity, more efficient models, is distillation not a... effectively open source in another wrapper. I think it's fair game, to be honest.

i will not shed a tear it's fair game if you there were like some people who tried to ask i think it was i don't remember if it was an open ai model so a diffusion model image right they asked it to generate an image from a Star Wars movie at whatever timestamp. And it came out with the Star Wars movie, you know, screenshot. Obviously, it was strained with it. I think it's fair game because there's no free lunch, right? It was strained with data.

You had a good ride. Somebody was sneaky and took it. But you took it from the beginning too. So let's just accept it's for a game. And you also learn from their advancements. Absolutely. Absolutely. I, you know, take my cup. I enjoy it very much. That movie.

single day you mentioned the training there you know obviously data and data quality dictates a lot of training ability when you think about the future of data that feeds into training how do you think about how that will be between synthetic data versus real data i'm a bit split on this there's

A part of me that said that if you re-inject data into the system, the system deteriorates. That feels a bit, I would say, intuitive. But if you look at AlphaGo, for instance, the moment it's, you know, ramped up in its skills is when they started generating. games. Synthetic games, right? So I'm a bit, you know, split, but there are some verticals that very much benefit from this. Code LLMs, for instance. We can run code.

So this is the poolside thesis. Just so I understand, why does it work for coding and not for other things? Because you don't use the AI model to generate output. You use the machine. You just run the code. And you see what it makes and you run all this code and you create data out of it. Whereas if you run an LLM and you say to an LLM, generate me two trillion tokens of text.

It will do it with its, you know, so you may inject and stuff. So there's a lot of tricks, but ultimately my guts tell me that it feels wrong, right? Because you re-inject data that was there. And so it will deteriorate. There's, you know, there's loss. So yeah, I'm a bit bullish. I'm not sure exactly on what vertical. Code is one. We'll see. Distillation is, in some sense, a bit like that. You create synthetic data from a bigger model into a smaller one.

Probably the most, I would say, mind-blowing thing about distillation is that sometimes the smaller models become better than the bigger model. through distillation so smaller models become better than bigger models purely because of the quality of the data that's inputted through them the one theory is that the smarter model is better at

generating output that you would want it to generate, essentially. It's not better in the general sense, it's better at the task at which you were measuring it. This is what it learned to imitate. How do you think about the future in terms of large monolithic models versus more dynamic architectures, smaller models? Sometimes it's wasteful to run big models.

A lot of times it's actually wasteful to run big models. I think there's going to be a lot of smaller models for efficiency reasons, but there's a but, which is you talk to people at DeepMind. And they don't even fine tune anymore because they have such, you know, what's called big context window, which is what the model, you know, the data, the model you inject, right? At runtime that nowadays they just dump data.

into it and just say do whatever you know that data tells you to do instead of fine-tuning as we used to do so if the efficiency gains we're not there yet right but if their efficiency gains i would say pass that threshold we'll just do it at runtime We'll just have a great model that will just specialize at each request. But that's not for tomorrow, I think. What is retrieval augmented generation first?

It's a very clever trick. What you do is you represent knowledge into what's called the vector space or latent space. And what you do is through what's called vector search. So imagine you have, let's say, a 3D space that represents all knowledge, all of everything. And let's say a cat sits here, a dog sits close because it's an animal, but it's far from some other property and so on. So what you do is you run the user's request through this same system. It's called an embedding.

And that will give you a vector and you will take whatever is closer to you. What's called semantically close. And then it's actually very clever. You actually insert those pieces of text before the request. So it's as if you would say, knowing the following and you give the data, let's say it's law or whatever, please answer my request. And that's it.

So that's a bit of a clever trick. It's a bit dirty because, of course, you are limited by the amount of data you can input, right? So there's this problem in which how do you chunk... you know the data that you input are a lot of things we do not retrieval augmented generation then where we say here's a link summarize it to the key points is that not right because we're inputting the data which is that and then it is it is

Depends on how it works, but yes, yes, sometimes it is. But think of it as in, it's like a preamble to your question. Knowing the following, and the following is a tiny window into the content. Please answer my question. And of course, as you talk more and more, it will forget because that window is fixed. And so how does that shift the movement from large generalized model to smaller, more advanced models?

What pushes smaller models are efficiency, roughly speed. You know, less is better. So if we can do with less, then less it is. Simple as this, right? In terms of RAG, the key frontier is what we call attention-level search. But this is something we're working on. You have the exclusivity. Now I'm putting it out there. It doesn't push, I would say, model sizes. What really pushes model sizes are the efficiency rather than specializing.

Meaning that if you can do the same performance with a smaller model that is fine-tuned with rag or whatever, then you'll do it with the smaller because again, less is better.

Can I ask you, before we move into a quick fire round, I do just want to ask you, when we had DeepSeek, as we mentioned, to what extent were you surprised that such innovation, I would argue, and I think many would agree with me, came from a Chinese competitor not from a western competitor oh i love it constrained is the mother of innovation

Yes, you know, we can, you know, trawl a bit about, you know, the Singapore, you know, gray market and all of these things. But ultimately, like, they had no choice. Here's the thing. If you can buy more, why would you give a damn, right? You can just buy more. So if you are pushed to efficiency, then you will deliver efficiency. These are very, very skilled people. This is the coolest thing to me about AI, honestly, is the geography doesn't matter anymore. You can just do things.

You appear out of nowhere, boom, you know, you're on the map. And so I'm very, very glad that they did. I found the reaction very entertaining, to be honest. So yeah, I mean... constrain is is a very good driver of efficiency do you think it is a meaningful threat to open ai and chat gpt bluntly they still have the consumer loyalty the consumer brand yeah to what extent is it actually a long-term threat

I'm not sure who is a threat to OpenAI at the moment. Here's why. You look at the numbers. I mean, we live in a bubble. We, you know, we follow every new episode, the whatever new model, whatever, who said, who what and so on. But, you know, I go to my mother and I ask her, you know, do you know ChatGPT? And she says, yes. And, you know, I don't know. I don't want to dunk on anybody, but do you.

to know some other model and she says what it is what is it right even gemini right like google right so they have a strong brand they have a strong product but there's a balance between the product and the models honestly so This is Gary from Fluidstack, actually, who told me that his mentor model in terms of model providers, they'll be like car makers. There's no wiener tickle. Everybody will have their own because ultimately also human knowledge is everybody has everything.

So we're converging. But I like that analogy. Yes, Deep Seek made waves, but it was waves that were amplified by the media and the narrative and the drama. Do you think export regulations inhibit China's ability to compete in any way? Today, maybe. Tomorrow, I'm not sure. They're a bit late in terms of ASIC. There are like A100 level. But they have probably, I would say, one of their unfair advantage is that it's like, you know, when you do exercise in the water, right?

So this is their state. They are constrained, so they are bound to do better. They can just not buy their way into better compute. So I think it hinders their success. But I think it's short term to think that way. Are you fearful that Europe are going to regulate ourselves into constraints in a world of AI? No, I don't care. I have zero. This is something I...

It makes me wonder sometimes. I understand the narrative and so on, but I am absolutely not fearful. Let's be successful first, and then we'll talk about... the politics i have so far you know but again i'm not mistral i'm not you know i'm not building gigawatt data centers and so on so if you build gigawatt data centers you run into these problems but maybe you run into these problems but

The thing is, if you're successful, everything flows from there. Steve, I'm being directed here, but I'm asking you for the pros. Everyone says Mistral just doesn't have enough money to compete. That is kind of word on the street. To what extent is that fair? They are very competent. I think it's easy to spread FUD. There's a lot of FUD going around, especially about regulation and everything. But here's the thing. I look around me and I don't see.

you know what i read right so i am hardly convinced about you know everybody was saying that they were dead and boom they came out with their their release and it was insane so what i know is that i hope they don't have too much money that's for sure you want to be clever right final one before we do a quick i've so enjoyed this steve final one before we do it stargate was uh you know a 500 billion dollar announcement how did you evaluate that my first impression was that i don't buy it

I would say, you know, American style, right? You start with the claim and we'll figure it out later. I don't buy it. And ultimately, ultimately, I'm not sure I cared that much about it. Let's imagine it's true, right? Congratulations. Amazing. But it is more of the same. It is a vertical scaling. And as you know, my days are spent on efficiency. So I look at these things as being like, all right, this is a bigger, this is an American car of AI.

it's big it consumes a lot of gas but ultimately you know it's not a good car right i think there has to be you know sufficient capital but at some point i'm not sure it is really a differentiator That was prior to DeepSeek, then DeepSeek came. That was always, you know, my thesis, but, you know, you need money, you need infrastructure, you need... But what is ultimately probably the two limiting factors today is talent.

and energy. That's it. The rest, you know, yes, of course you can buy 500 billion of GPUs. By the way, 90% margin. So if we work on that margin, we can, you know, shrink that number probably. I'm not easily entertained by these numbers. I've seen how the sausage is made way too many times.

Dude, I want to do a quick fire with you. So I say a short statement. You give me your immediate. Sure. If you had to bet on one major shift in AI infrastructure over the next five years, what would it be? Oh, yeah. Latency reasoning. Definitely. This year. What does that mean? So the shift from throughput, so how speed my answers to how long it takes for my answer complete to appear.

That is probably one of the fundamental, like this year, right? Longer term, I'm very rooting for non-transformer models that will change the compute of the landscape. And of course, you know, world models, right? Yes. And or energy-based models. What's one piece of advice you'd give to AI startups navigating the changing landscape of training, inference, and hardware? Probably the number one thing I would say is do not resell compute if you can. A lot of AI startups...

that are building on top of AI are trying to make a margin on top of a very big cake. And ultimately what they sell is compute. If you look at the dollar of spend, you know, for $1 of spend, maybe 98% of it goes to... somebody else's margin so if you do ai as much as you can try to verticalize on on the product but not on the compute if you are you know if your business model implies buying a lot of tokens

It's a very hard circle to square, to, you know, put that into $20, right, a month. So, you know, I always say, like, please, you know, look at it from that angle. And if you can, try and avoid it.

what's the biggest challenge that jensen huang faces today the highs are very high but they don't last forever so probably it's how to navigate the downslope black whale is probably something that keeps them awake that keeps him awake at night why would that keep him awake at night would that not re-energize him more orders new enthusiasm new product baby because orders are getting canceled

Why are they getting canceled? They have a lot of problems with these chips. So a lot of people are canceling their orders. These chips are on the frontier of scaling. And so they were supposed to come out. last summer but that heat dissipation and you know matter bending problem it used to be called the people who are very privy to silicon told me this is what we call a pretty big fucking problem right

End quote. Probably how to navigate the downslope. Maybe you don't know, but the supply of H100 was actually smoothed out over the year. so that they decided so that they didn't have like a big spike in deliveries and then a quarter less right which pissed a lot of people mind you who bought a lot of them uh some of them even haven't received their order from last year

and they already see like the new chip, the B200, and then the one after, you know, and they're super pissed. There will be a downslope at some point. The question is, you know, when, how, like if there's like the H100 bubble, of course it will impact Nvidia. But Black Whale is, I'm probably going to get a lot of flack for this, but, you know, I've seen some very worrying numbers about it and varying testimonies about people who operate these things, right? So that ride will stop.

Or at least slow down. Steve, I'm not sure I've ever learned quite as much in one episode. Seriously. We said before. Oh, wow. No, I love what I do because I'm able to ask anything to the smartest people in their business. And I so appreciate you. packing so much with me today man um i'm thrilled to say that i actually finally get what you do after but you've been a star so thank you man thank you appreciate it thank you you

I mean, I said it there. I think I learned more in that episode than I have done in the last 1,000 when it comes to technical specs and the future of AI. Steve was incredible. If you want to watch the episode, you can find it on YouTube by searching for 20VC. That's 20VC.

But before we leave you today, turning your back-of-a-napkin idea into a billion-dollar startup requires countless hours of collaboration and teamwork. It can be really difficult to build a team that's aligned on everything from values to workflow, but that's... What Coda was made to do Coda is an all-in-one collaborative workspace that started as a napkin sketch now just five years since launching in beta Coda has helped 50,000 teams all over the world get on the same page

Now, at 20VC, we've used Coder to bring structure to our content planning and episode prep, and it's made a huge difference. Instead of bouncing between different tools, we can keep everything from guest research to scheduling and notes all in one place. which saves us so much time with kodi you get the flexibility of docs the structure of spreadsheets and the power of applications all built for enterprise

And it's got the intelligence of AI, which makes it even more awesome. If you're a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time. To try it for yourself, go to coda.io slash 20VC. today and get six free months of the team plan for startups. That's coda.io slash 20VC to get started for free and get six free months of the team plan. Now that your team is aligned and collaborating, let's tackle those

messy expense reports. You know, those receipts that seem to multiply like rabbits in your wallet, the endless email chains asking, can you approve this? Don't even get me started on the month-end panic when you realize you have to reconcile it all. Well, Plio offers smart company cards, physical, virtual, and vendor-specific, so teams can buy what they need while finance stays in control. Automate your expense report.

process invoices seamlessly, and manage reimbursements effortlessly, all in one platform. With integrations to tools like Xero, QuickBooks, and NetSuite, Plio fits right into your workflow, saving time and giving you full visibility over everything.

every entity, payment, and subscription. Join over 37,000 companies already using Plio to streamline their finances. Try Plio today. It's like magic, but with fewer rabbits. Find out more at... plio.io forward slash 20vc and don't forget to revolutionize how your team works together rome a company of tomorrow runs at hyper speed with quick drop-in meetings a company of tomorrow is globally distributed and

fully digitized. The company of tomorrow instantly connects human and AI workers. A company of tomorrow is in a Rome virtual office. See a visualization of your whole company, the live presence, the drop-in meetings, the AI summaries, the chat. It's an incredible view to see. Roam is a breakthrough workplace experience loved by over 500 companies of tomorrow for a fraction of the cost of Zoom and Slack. Visit Roam, that's O-R dot A-M, for an instant demo of Roam today.

Nobody knows what the future holds, but I do know this. It's going to be built in a Roam virtual office, hopefully by you. That's RoamRO.AM for an instant demo. As always, I so appreciate all your support and stay tuned for an incredible episode coming on Wednesday with Oscar found at Glovo on turning Glovo into a $2 billion business.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.