Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference C

⁠¶ Intro / Opening

00:00

🎵 Music

00:05

Hi listeners. Today Aladdin I are here with Tuhim Sribastava, the founder and CEO of Base 10, the AI Inference Cloud. We're here to talk about capacity constraints for AI compute. why inference is the last market, how the workload is changing, the open source and perhaps multi-chip future, and what 30x scale in a year looks like. To hen, welcome back.

00:28

Aye. Good to see it.

00:30

Thanks for having me.

⁠¶ Baseten growth

00:31

All right. You are in one of the uh craziest markets, AI inference. Uh uh, it's very important. There's a lot going on. You guys have grown thirty X over the last year. And I think I can say you're expecting to do more than a billion dollars in revenue this year.

00:46

Mm-hmm.

00:46

What's going on? Tell us about scale.

00:49

Yeah. No, it's been it's been nuts. I th I think what's happened over the last Honestly at twenty four months, but just kind of keeps getting bigger and bigger is that I think everyone is re realizing that you can put AI everywhere. You have you have all these great options available from closed source to open source models.

01:09

the open source models have crossed some sort of chasm in terms of their baseline capability. And then I think ROL techniques and post training is um for specialized models um has become mainstream enough and you know, there's enough examples of it work of it working, the customers realizing they can, you know, kind of own their infrared um more and more. And what that's meant for us is more, you know, the long tail of models.

01:39

coming true, customers in housing a lot of that intelligence themselves. And as the application layer just gets, you know, bigger and bigger and bigger. And that's growing, when we are we are just someone indexing on that and we're being around to be able to collect the demand.

⁠¶ Why the app layer wins

01:55

There's an existential question in here that I think everybody is uh continually asking of does the independent application layer get to exist at all versus the labs? Like how do you you have to believe this? Why do you believe it?

02:07

Yeah, look, I I I think it'd be it'd be a sad thing if it didn't exist in general. And I think that's like my but you know, sa sadness is fine. Um the uh

02:16

I said all the time.

02:17

Oh yeah, sadness is fine. Um but like that that that's not the reason why I think the application layer will exist. I think the application layer will exist for a number of reasons. One is because You know, I think this idea that what is what is valuable to a company is, you know, the the user signal that they can gather, that only they can gather. Um and to the extent that that is encoded.

02:45

um in a model, I think a lot of their business will um be at risk. But to the to the extent that it is encoded in workflows, um, that is where they will be able to develop mode. So a good I think a good example of that is say a company like a bridge where the clinicians edits of the notes and what they do with those notes after the fact and the um the thing that happens in um inside the EMR three steps down, you know, that becomes a workflow that only

03:11

Can you explain what a bridge does?

03:13

A bridge is a um ambient um scribe um that is used by physicians in um Almost all hospitals in the US. I think a lot's an investor. Um great ship's amazing, great great company, great team, um, great product. Um and you know, they they've basically uh you know, got this very, very deep integration into into hospitals, into clinician workflows.

03:39

And my argument would be here is that actually, you know, it's very, very hard for um a frontier model company to be able to eat it away at that'cause they just don't have access to that user signal. And what will happen over time is the folks who have access to that user signal can start to post train models on that reward signal and and start to get long, long horizon agentic.

04:00

models running that. And I think to the extent that that is possible and that signal is differentiated and unique and in and is um somewhat um rare to to get access to, there will be an application layer. And I think, you know, it's like support companies is another example of that where, you know, a support the a support task isn't one shot at

04:23

Usually at a company like Base Tent, when a ticket comes in, there's like what, like one, two, ten, twenty actions that get taken. And that is where, you know, someone can develop a specialized model.

04:34

So th there's almost two versions of this then. There's new companies like A Bridge or Decagon or some of these other things that you mentioned that are doing these new types of applications that are using AI and they sell it to customers.

04:44

The other is enterprises building things in house or building their own models. What proportion of the market today do you think is um these new application companies versus enterprises just adopting AI? Yeah. And how do you think that looks in a couple of years? Yeah.

04:58

I I think that's a that's a you know we did I think you asked me the same question two years ago. Oh yes uh on on the

05:02

I had to be repetitive.

05:03

Oh it it is crazy.

05:05

Yeah.

05:06

The answer is just that it's crazy that the answer is too I think I think if you looked by inference count, it'd be ninety-nine percent. The former. Yeah. Um, I think that is uh kind of represents the scope of the opportunity here. Is that the majority of the market hasn't come online and and and um and added AI.

05:25

Yeah, most of enterprise adoption is well ahead of us, and I think that's one of the very exciting things about AI. Yeah, also there's just so much still to come, and people are underestimating that.

05:33

Hundred percent. And like you but but what's cool is that we're seeing the transition happen, right? Before it was like, hey, are they are they using AI tools? I don't think that was immediately obvious two years ago. I think that's obvious now that yes they are. Um are they using closed source model API?

05:47

Um, I think they're starting to get there. I think once you do that and then you kind of see what is possible, then comes the whole custom model adoption. I think that is all that is ahead of us today.

⁠¶ Serving frontier customers

05:57

So if the majority of your customer base today is as you described the the former like application companies, AI natives, the um fast growing I mean, some of them are at considerable scale now, like the abridge cursor. Um Open evidence is of the world.

06:14

You know.

06:15

What do they what do they teach you? What does that push the company to do? How do you think about serving them versus evolving for the enterprise?

06:21

Yeah. Um I I think firstly like you just learn a lot by building with the companies at the greatest scale, doing the most interesting things. Um we we think of it Two ways. Like I think there's like the the the most obvious way, which is just build for the highest scale um, you know, most

06:41

uh the the customers that will push you the most from technologically and everything kind of will fall into play. I think the the stripe evolution as a company showed that which like stripe now like sort of like so many enterprises, but twelve years ago that wasn't the case.

06:55

But they just built for the frontier and kind of went with them. Um, I think the second way we think about that this is to just think about building for companies that are serving enterprises. So yes, we don't serve the enterprises, but our customers serve enterprises. Um abridge service evidence, open avenues, decagon, um all these writer gamma, all these clay, all all these companies serve enterprises in mass. And what we actually get is like a translation of the requirements.

07:22

from them, which is like, you know, they're like, hey, we need this sort of data retention. We need this type this where models need to be deployed. Um, this is the types of GPUs or the latencies they're okay with. This is the model requirements from like a transparency perspective that they care about. And so I think that is actually the more nuanced answer. It's that

07:39

If you listen to what their needs are, we actually get a full translation of what the enterprise would require. Like I would say that by serving companies like a bridge. And open evidence were probably pretty well suited to go serve the healthcare system given that they are selling and latent health, given that they are selling to them.

⁠¶ Open source model mix

07:55

How how much of a shift are you seeing in terms of the types of open source models that are being used? And so I think We've seen an evolution where two, three years ago, I think the main thing was kind of Mistral and then a few other things. And then Meta kind of came along with Lama and then it kind of really shifted in terms of the most performant models or of Chinese origin in different ways. Do you see that sort of mix reflected in terms of what's being used by our customers?

08:16

Yeah, I I I think customers, at least the customers we are serving, are very and these are like the fastest growing AI companies in the world that are very forward thinking, they they want to use the best model. And they they are optimizing. I think there is There there are a there's a subset of tasks which I think is small today, where people really start to start with cost.

08:38

Mm-hmm. Um, but everyone comes for capability first because that's really where the economic growth is being unlocked, where the value is being delivered, and then they optimize. And I think that's like actually being You know, and so with that in mind, you know, you you you name like you name it f everything from GPT OSS all the way to um Moonshot or to Deep Sea. um to um canopy w uh orpheus, which is like really good text of speech um models. Um

09:07

Customers generally want to use whatever's at the frontier. And and I think the um the difference is just being um I think we have a lot more visibility into how to run these and how to run these really well. And secondly, that they're good now.

⁠¶ Chinese models and geopolitics

09:22

There have been a number of different concerns raised about the use of Chinese models. Yeah. In particular security, or is there something embedded in the models or, you know, Trojan horses or other things? Um, A, do you think there's any real concern there? And B, you know, people often talk about how there should be like US counterweights to this.

09:38

From a geopolitical perspective, do you think that's something that's legitimate or something we should be worried about? Or how do you think about the

09:43

Yeah, it's sort of like a

09:44

origins of these models versus their uses.

09:46

Yeah. Look, I I I think these these models firstly are fantastic. They're amazing. We work with these teams. They're truly awesome. Um, I'd say Look, I I don't it it is hard for me it is hard for me to see And I I could be wrong, but like, you know, if if I if I neck if I network bound these models that they're not magically, you know, gonna be able to cross those network boundaries. And so data is that and, you know, I I don't and we I've never seen any real evidence.

10:15

Um, except from some very early models that I think people picked up on very quickly that there is some agenda or bias bu built into these. Um I do think that to some um to some extent I I think there is importance to the US that we develop our own models. I think that that would be a massive loss if that there are five companies. you know, five different labs in China that are creating open source models. Um, and we're struggling to get one set up. So it's necessary. Um

10:49

I also think it's inevitable. Um and you know, like the to deep seek the deep seek moment a year ago. Um, I remember Someone saying to me, and I thought it was like very well said, which is like And the world's changed a lot, but they said, Hey, you know, we should kind of just forget that this is a Chinese model. We should just act like this came from from Meta and w and build and build with that in mind. Mm-hmm. It's like, you know, I I think you're kind of missing the forest from the trees.

11:16

Like th there's two there's two scenarios, right? E either America does not ever come up with good open source models, I think there's probably a fundamental problem there, or we will get there and we need to be ready for that world.

11:28

Yeah, that makes sense. It's interesting because um, you know, like you, I I think it's very important for the US to have a strong open source footprint here. Um, at least for now, it looks like effectively the Chinese government is subsidizing at least a large subset of these models.

11:42

And that subsidy or surplus is effectively just being passed on to US enterprises who are adopting these models. In other words, it's a way for the Chinese government to effectively subsidize US enterprise. Yeah. In an indirect manner. And I think that's a little bit lost right now. Um But you know, it's always interesting to weigh that against some of the other concerns that are raised. So I appreciate your your comments on that.

11:59

Well well and and I I think the concern also just there just becomes it's like what happened if we aren't able to Like i i i if it is fun like I I think if you think about the economics here, which is d deep deep seek by most deep seek's a good a very good model. Um, you know, like and like

12:15

you can argue whether it's at the absolute frontier or not, but like let's let's go back three months and it's there. Yeah. And so think about everything. We were doing a whole lot of things three months ago. Yeah. And so let's just think about that one. Um You know, if it you could run deep seek

12:31

probably twenty percent of the cost of running open anthropic models um in production um with comparable better latency probably better reliability. If we don't have access to that intelligence in that form, I think it's just a massive loss. Um and as a country, like we won't be able to innovate as fast because like the the cost of intelligence going down and control of intelligence, what we have seen just means more intelligence. Yeah. More intelligence being embedded in more places. Yeah.

12:56

One note uh here that we didn't mention explicitly is that the state of the art models, the ones that are most far ahead on the frontier, are actually still the closed source anthropic, open AI, Google, et cetera. Yeah.

⁠¶ Custom inference dominates

13:07

What has been actually maybe you can just characterize like workload a little bit, like how of tokens being served on base 10, like how many of them are uh from custom models of some kind versus like vanilla open source today?

13:20

It is all custom. It's basically... Okay.

13:24

So like ninety five percent.

13:26

Nine per ninety five percent. Like and I think that's really cool, to be honest. Like look we have we have two businesses. We have we have We have three business we have four we have three businesses right now.

13:36

So we have like dedicated dedicated inference, which is basically custom model inference. Um your SLA is your SLA. Then we have shared inference, which is shared inference endpoint, um, shared SLAs, um, and then we have a training business. Um I'd say ninety five percent of the tokens today are on the The first business. And almost all of them. Um, there's probably a of yeah, for almost all of them, the customer is making some modifications to the model with their with their own data.

14:07

um specialized for the use case. And I think what's even more important is um they might be compiling in different ways. No one is just running the vanilla open source weights. Like you you might be customizing it for quality, but you mostly might be customizing it for performance.

⁠¶ Post training acquisition

14:22

You made an acquisition of a research team a few months ago. You've mentioned uh post training, customization. Uh what was the rationale behind the acquisition? What is that team doing today?

14:33

Yeah. Um so the r the rationale around around the acquisition was, you know, w we we are infrastructure and product people. Um, we are product people, um, and now are really good infrastructure people. And the um And we didn't have much of a um research. capability ourselves. And and what we saw was um the market moving heavily and heavily like that we could accelerate the market itself.

15:01

um with post-training um resources, um either productized or aren't they even just as resources for that market. Um so PASD was a a company that was a based end customer. So there were post-training models and running them on basic on base ten. And I think what they realized was um that They would eventually need to become an inference company. Um and what we realized was like, Hey, we we really need it.

15:30

that expertise because it is it represents a way for us to get closer to the customer earlier um and you know be able to support them all. And just made sense um as a like pairing them together and Um, just as as I said in the opening statement here, which is, you know, as more and more post-trained models have come up, we've realized that like demand the demand for people to for people to either Um for software loop.

15:59

to do post training or for post training expertise is very high and we're we're really, really investing. um in that. Um there are also a bunch of Australians. Um, you know, w I like to think that we had a but a bit of alpha there. Um, but yeah, that that's been fantastic. They're working with all sorts of customers. Um and it it's also very interesting when you when you start you know, we were doing a lot of research on the performance side and less so

16:27

on the post training side. Um, it's interesting as we've started to do a lot more research on the post training side, um, you start to see how linked inference and post training are. And like, you know, even e even when you think about stuff like quantization. And when you should do that. And like, you know, how how how

16:43

how training, um, how how you train the model affects how you need to quantize it for inference and how paired these problems are um has become like very apparent. And more and more we like in the post training inference are kind of both sides of the same problem.

16:58

So because inference will ideally will beget more post-training where inference creates data, you do evals, you can now post-train post-train on the um on that reward function that you that you found with those evals and and hopefully just set up the entire look.

⁠¶ When to invest in custom models

17:10

Plenty of folks from uh Ant and OpenAI, uh Sam, Greg, et cetera, have said in recent months that like inference is super strategic, inference talent is strategic, capacity is strategic. Uh so between that and post training, these are uh very uh difficult to gather like capabilities.

17:30

Um, I imagine that lots of your customers go to you guys for advice on like how to do this progression of moving to custom models. Like, what do you tell people about the life cycle and when they should invest in that?

17:42

Yeah, I I think it's hey, go find go prove to yourself with the best in class model that you have something worth optimizing. Um, and and I think, you know. A lot of you know if a customer comes to us It's like no post training pre product market fit is what I is what I'd say now.

18:07

Okay, you're

18:07

working with here are very s very at scale first. Yeah.

18:10

Yeah, they they they they have a user signal that they know how to optimize and they've shown that they can, you know, they can serve customer value.

18:16

And that value and that they have something special around that value. And once you have that value, it's like, okay, now how can I do that better, faster and cheaper? With the idea being that, hey, if you need to be very good at customer support, you can you maybe don't need to be that good at coding and the a specialized model might be a better fit for that problem and you can do it better, faster, cheaper.

⁠¶ Supply crunch and data centerse

18:35

What about the capacity side? You uh started with unifying capacity across all the clouds and neo clouds. How do you think about this when everybody keeps talking about a supply crunch and a multi-year supply crunch?

18:47

I think um You know, th there's so much narrative around the supply crunch. And no matter like as much as we hear about it, I don't think people realise how bad It really is. Like there is You know, there's very, very little Slack compute available. Like you know, we we run pretty large clusters. ourselves and we run them at like uncomfortably high utilization. You know, we're when I'm saying we're like mid nineties utilization um most of the time. Um there is we have made we we have we sit in

19:24

18 different clouds now. We have 90 clusters around the world across 18 different clouds. And like, you know, initially we started, we like built this technology to be able to like kind of create one runtime fabric. that spans all these different clouds and try to abstract that away from our customers as a way to think about reliability, latency, failover, all these things that we think are gonna be very important for very mission critical use cases.

19:48

Um, that same technology, like just our ability to get compute wherever humanly possible, um, has been really, really helpful in our ability to get supply. And and what what I mean by that is we can be introduced to a new provider. in a different country um and have it up and running with the whole base 10 inference stack.

20:10

As part of the fabric.

20:11

Hard fabric in Half a day, half a h ma maybe less. Uh um even for and that gives us enormous flexibility. Um Even for us, it is hard for us to grow. We have a we have a I think it's yeah, I'll say that. Um we have a a four PM standing meeting for the company where we basically like

20:35

like how do we like how do we how do we manage capacity for the demand right now. I think the second part which people don't really um the two the the the second part that people don't really understand is um that there are also a lot of

20:50

20:51

suppliers right now. Um that's it's kind of grifty. You know, like the I I th I think, you know, they haven't run they haven't run data centers before. You know, they don't understand SLAs for they especially for inference. Um and so like, you know, even when there is capacity available, um, there's a lot of dil like there's probably

21:13

We we run a lot more than this and we have redundancy and so it's fine. But if you you know, there's probably like a dozen good Clouds and I probably like put like three or four of them in like the the gold tier. Um, and I think that just means that like suppli like not only are we supplier crunched, we're supplier and operationally crunched onto people who can who can run these data centers as well.

21:39

How far ahead can you actually buy capacity right now? In other words, like is there

21:43

Right.

21:44

Any any slack in the market if you buy two years ahead or five years ahead, you know what?

21:48

You mean like actually like contract length or actually like, Hey, I want this in January twenty eighth?

21:54

Either one, yeah. Yeah. I mean it's more the I want this in January twenty eight or at least I have some visibility into my future.

22:03

But you gotta also remember how quickly the market is how quickly the market is moving. Um and like, you know, that gets balanced somewhat off like the fact that the H one hundred is such a great chip. Yeah. Um and like and then it's you know.

22:17

It's crazy. It's four years, four and a half years old. The price is going up still. Yeah. Maybe it has a useful life for nine years. Yeah. Um, so um, you know, that that's that's good. But at the same time At the same time, you know, yeah, yes, you can do that.

⁠¶ Longer GPU Contracts

22:31

Um, but you know, you're making a lot like it you're making a lot of bets. Yeah. Um, as part of that. And then in terms of I think that's the big thing that's changed over the last six months is that the term length that people want. Um has just gone up. So if you if you wanted Um a tho a thousand 1024 B200s. Which is, you know, um from a good cloud. Right now, you're not getting that less than a three to five year contract right now with a probably a 20 to 30 percent TCV TCV prepay.

23:06

Um, so like actually what becomes important when acquiring capacity um is you need to have enough demand to supply it um to server, but then you also need like a low cost of capital, um, which is which is actually changing the dynamic pretty significantly.

23:20

Does that does that impact how you would think about going public as a company? Because arguably Yeah.

23:24

Yeah. Yeah. I think you'd go sooner. Yeah, exactly. Yeah, I I think you need like I I think the and I think there was demand for that. Um but I think, you know, the pool the It all it also, you know, one one of our one of the one of uh the realizations that we had recently and we're s we're software people, um, and so we don't we don't think like this all the time is that

23:46

You know, a a business has like very interesting working capital um requirements. Like we know, um and and I and I think uh even and that as a result of that It has very interesting financing um requirements. And we're not at least right now we're not even going down to the down to the

24:02

There's also things you could do in terms of data or other structures that, yeah.

24:05

Yeah. And yeah. I've learned a lot about debt. Yeah. Um rec recently.

⁠¶ What Makes a Winner

24:10

Given the uh supply crunch, uh inference being one of, you know, the top couple markets you're going after, you have um plenty of people who understand this problem and therefore, you know, some competition. How do you

24:24

24:25

Uh how do you think about like what are the factors that create a dominant player here or a winning player? Is it, as you mentioned, cost of capital, is it access to supply? Is it software? Is it demand?

24:36

Yeah.

24:37

Just being excellent everything.

24:39

Yeah, it it's um Look, I I I think what's so interesting about inference is Yeah, I I think so yeah. I think like GPUs as a service is not sticky. I think that's been seen. Like pe uh customers generally just see that as as commodity. imprint with the software they're included is incredibly sticky. Um, you know, like just just like, you know, none of our top thirty customers have ever churned. You know, we're talking

25:07

four hundred percent annual NDR uh uh um around our business. And so it's like very it's um it's very, very sticky. So I think that software layer is very important. the optimist in me is like, oh, there's so much value in the software and I I we will build the best software layer for inference that exists, I think. you know, as I think is becoming clear now, access to inference computers

25:29

Yeah.

25:30

is a strategic strategic advantage and I think that is like the I think that is the um strategy that even the labs are going after, which is like if we have the if if we have all the compute, good luck running inference.

25:43

Yeah, yeah. And in a world of constrained compute, the number one thing to own is compute. Yeah. And so, you know, just owning it in and of itself is an asset. And I think people underappreciate that.

25:52

you can't you can't make a good hot chocolate without milk

25:56

Unless you're a vegan.

25:57

I let you vegan. Say no one wants vegan inference.

26:01

Well, I gotta ask you, people might want um they might they might want alternative milk, right? So okay, like when you you the H one hundred is a great chip. People, you know, want a B two hundred, they want G V two hundred, they want, of course, tons and tons of NVIDIA.

⁠¶ Multi Chip Future

26:16

Um when you think about making a bet, you know, several years in the future, do you believe that there's a like multi-chip world? Like what do you what do you think happens from a compute perspective? Um, on the chip side.

26:28

Yeah. Um I think I think, you know, like diversification everywhere is a same way I want to water many models. I think, you know, we want to water many most things. Um and I think

26:41

You'd be sad if it didn't happen.

26:42

Yeah, and I think everyone would be sad. I will say... Um, to some extent, which is um yeah, and I think there will be inference-specific chips. I think you have like decode-specific chips. I think and we're we're looking at Yeah, I mean that's that was a whole Grok um LP thing. It's like, you know, I I think I think that is um very straightforward and and makes sense. I think people really, really, really underestimate supply chain stuff with NVIDIA. Like how good they are at that.

27:10

CUDA, how good CUDA is, the developer ecosystem around it. Um and you know, we it The ability like to me, like one of the most important things as an infrastructure company in this moment is how fast you can move. And you can move fastest with NVIDIA today.

27:30

Um and I think that is the reality and like it just like given the scale that they operate at, given the scale that they operate at, it's um it's hard to It's hard to see um a t it's hard to see the the and I'm not saying it won't happen, like the short term, like in the next couple of years, how anyone's gonna be.

27:47

Espe especially with, you know, so much of the other the other players. Like what you need, um, to be able to compete here is the ecosystem to form around you. And if you tie up all your supply with one buyer.

28:00

which, you know, a bunch of the other chip providers have done, it's actually hard for that ecosystem to form. You know, like if you if you think about if if you're a big lab and you have a proprietary deal with one chip type where you get ninety percent of the supply, it's actually in your best interest. to make sure you get ninety five percent of supply and everything just gets built for you and no one else could ever use.

⁠¶ Runtime Roadmap

28:19

When you think about reacting to the market, um, what do you think is like happening with the actual workloads? That you have to go invest in, right? Like obviously, code agents and long horizon agents over time have become a big deal. People talk a lot more about. CPU compute, video inference is different. Um, I don't know if it's that, sandboxes, like what what's important for you guys to invest in now?

28:42

Yeah, look I I I think the for for us All the runtime stuff is obviously very important. And what that means is like what chips we run on, how we run, what kind of workloads we support. Like uh do we get very good at diffusion transformers? Yes. Um, coding agents need sandboxes, we should go build sandboxes. Um there's all sorts of new speculation techniques.

29:00

to to get faster imprints. We need to do that. Um even stuff like um K V cash away routing and you know, that stuff's a bit old now, but like getting continuing to be very good at that and um somewhat disentangling prefill and decode and starting to treat them as separate problems. I think that's, you know, something we're very focused on and we're seeing massive gains.

29:22

That's at the runtime level. Um, I'd say, you know, beyond that, you know, everything we think about is how to create more of that. loop between inference post training because we think that just begets more inference. Um and so like we we will build a partner in almost everything.

29:41

there. So like, you know, we're gonna work with, you know, the best evaluation in the world to make sure that's very well well integrated, like Brain Trust um into an around base 10, you know, we will partner with or or on the sandboxes side built built the best sandboxes experience. um that will exist. Um and then we'll create the the best training APIs to make it so continual learning becomes somewhat of a solved problem. It's not just like a discrete

30:08

That's I think the core base ten product thesis. It's like how do we build that loop? And then everything af af around that becomes how do we make sure that we can do everything we can to um Ensure that gets as big as possible. That's access to compute. That's on infrastructure. Make sure we can get compute anywhere. Make sure we have access to our own compute.

30:28

Um, and then I think it's all the primitives that come after that just j that just become incredibly like margin agreed, both for us and our customers, um, which is, you know, stuff like Yeah, sandboxes and like the um async batch inference, like how do we drive utilization by having a first class batch inference experience?

30:47

To me, this is like what an inference cloud looks like. It's that you are very good at inference and then you you start to do all the things tangential or that loop into inference and partner and where necessary and build where necessary. Um, but we really do want to own Like start with that core inference story and then go down to unblock supply or create margin and go off the stack to unlock value.

31:07

What uh would surprise people about some of the issues you discover only at scale? I'll give you an example. I was surprised when

⁠¶ Scaling Edge Cases

31:16

uh you guys ran into scale limitations, like fundamental limitations with some of the hyperscaler products that you were consuming. Yeah. And I'm because I kind of think of, you know, the AWS G C Ps of the world as Supporting infinite scale.

31:31

Yeah. I mean I I think you just and like again, like w you I think very, very large companies will like that run services of big scale it's probably the same stuff. Is that all the edge cases? Um

31:42

You you actually experience.

31:44

experience them and like you know and you I like I I'll give you a few examples here like you see you know you start seeing you know yesterday we had for the first time ever we saw some kernel panic Um as that only happened because some um uh fluent bit worker was creating too many logs and i and the scale was too big and it was all into one node and it was happening two two two terms as

32:10

at the same time by two different workers. Um so you see all like the systems level and kernel level problems. Um but then you start to see, I think what the the craziest stuff is that you start to see with with LLMs. Um, that these runtimes are pretty immature. Even how we use KV cash is, you know, um you know, probably a little less sophisticated than most people see uh than most people see and we we we are starting to see the the limitations of the current and the next set of primitive

32:39

that need to be built from a scale security, a performance perspective. But I I think it's really at the runtime level and the systems level. And then, but the edge cases are, I'd say, a lot more systems level than they are LLM specific.

32:54

What are the things that keep you up at night? Answer.

32:59

Yeah. Yeah, I think capacity. I I think the other one is probably just this market's so big and it's so like it it represents um, a moment when you should be as aggressive as possible. Um and, you know, r ri really, you know, we've we've grown a ton obviously over the last 12 months and the last few months.

33:19

But the answers always just go, you know, go bigger, go faster. And I think that's really, really fun. It's also a little exhausting. And it's also like we are we are all in somewhat uncharted territory in terms of how fast and how big. you can go and how things can get. But I but I think the big one is compute. I think like there's no world in which there's enough compute to, you know, get the amount of the the amount of value that we want to get out of LMs in the next five to ten years.

33:45

Or we have to invent a lot of new stuff. Yeah.

⁠¶ Hiring and Leadership

33:48

Yeah.

33:48

Maybe m if we just talk a little bit about uh what you're learning scaling, you know. thirty X is like an aggressive thing to go through as a company. Uh you've brought in a a lot of um really amazing talent like um Danny and Samir and Stephen Day, folks on both the the technical and the um go-to-market side What do you what do you think is working about how you are recruiting and scaling or uh what's your philosophy on that?

34:19

We were very, very flat like until I don't know, eighteen twelve to eighteen months ago. I remember I went on a walk with Aladd actually and Aladdin's like, you just need leaders. And I'll and and like it's actually like so contrary to um everything, you know, as engineers you're like, oh overhead. It's all it's all everything is overhead. Everything is overhead. Um and I

34:41

You once told me I think that you you didn't you're like, Hey Sarah, Sarah, what about we just have engineers instead of salespeople? Yeah.

34:47

Yeah. We're w we're both over that but I remember like, you know, you you you said it so clearly at the time a lot and I and I think that's what we're noticeed, which is like actually having a leadership team um that you can trust, that you can trust. Um is is is so important. I I think the the two or three things that I'll say is like you want people where you can give them whole problems.

35:14

And so like, you know, if you if you are if you feel like you are micromanaging, if you feel like you need if you feel like, you know, you you have to be involved in everything. I think that's a bit of a cop out as a founder. Because you're just like, I just need to be involved in everything. It's like, no, you probably just don't have.

35:31

the right people. Um I think the second thing is um be very, very clear what you're optimizing for. Because I think when you're very, very clear what you're optimizing for, the people on and like

35:41

If it's something generic, like we want the smartest, hardworking people, like you can't do much with that. Like with us, what we cared about was, hey, actually we don't care about a lot of people who have done this before. We care about first prin people who think from first principled print first principles. Work has to be um

35:56

a high priority, but they also have to be very kind and nice and, you know, care about the collaborative environment. We don't have a hero culture. Um, you know, w very low ego. Um and you know, if you need if you need a manager, like um you it's probably not uh it's probably not the right place um to be. But I think once you have when you have that clear rubric. the the people become very apparent that will fit into it and the people that don't

36:23

um, fit into it also become very apparent. And I think what's more like we've hired amazing people like you mentioned, but I think what's a lot more interesting is like I think we've we haven't had a ton of like turnover there unnecessarily. Like peop people tend to work. Um'cause we'cause we have a ver we are very clear in what we want early on. It took us a while to get there though.

⁠¶ Operations Pager Culture

36:44

What about the idea of like an operations culture? You know, we were talking to Alyssa and Henry about this and she's like, Well, the hard thing about cloud is actually just operations. I slept with a pager under my pillow for a decade. I don't think I've seen you detached from your Slack channel.

36:58

Yeah. My phone is buzzing right now. No, no, I'm I'm I'm getting anxious. So um

37:06

And you've you've been concerned before, like do people get it? Like you know, what is distinctive about that?

37:12

I I think I think like one, I think if you've worked at an infrastructure company, like we we were once in a meeting with a bunch of AWS execs. And this was, you know, like very senior AWS folks who all their pages went off multiple times um during our forty five minute minute. You know, like it's a I I I think like it's it's it's very much like just a cultural thing.

37:33

Um, but yeah, like I I don't you know, our like inference can't go down and like, you know, we you know, the you you you you learn to l like Yeah, I what was this like I think Amir, my co-founder, when his pager goes off His seven year old said, Is that a P zero? Oh is that is that is that a P zero? And so, you know, I I I think that is you just have to get used to it and that's the culture you live in and it it it just changes the speed.

38:00

Um but also it's it's you know, becomes like a you know, a cultural thing. I I think it's very, very it reject it rejects People that don't fit into it very very quickly.

38:10

Like engineers who avoided.

38:12

Pedro did. Yeah, you know, when we when we have P Zero's but like everyone on the call. Like you know, like there's been a joke that there may as well be a siren that goes off in the office.

⁠¶ Efficiency Drives Demand

38:21

So people have been talking ad nauseum in the AI community about Jevin's paradox. Yeah. Um, where if you decrease the cost. If you decrease the cost of a good, say intelligence as a good. Um people actually consume more of it. Um like the personal or business ROI of it, the demand for it goes up, not down. Um do you see this? And are you are you working against yourself trying to make these models more efficient? Do people just use them? More or less?

38:54

Yeah, I I think you could think about this from a developer's perspective and a consumer perspective. I think like I think consumers just want the best answers and the and the and the best experience that's somewhat. Um governed by Yeah, more intelligence to some extent. I think when you go to the developers from the developers perspective, um they will insert more intelligence if you make it cheaper. Like that's yeah. And they will they will they will insert

39:19

more intelligence anyway. But if you make it in more cheaper they'll they'll insert a hell of a lot more intelligence. And you see this with agents, it's like agents are just longer running. Now. And I think that's what we have seen with the cost of inference going down, which is, you know

39:35

folks are just like, okay, we can we can run this for longer, or we can make it do a bit more work and we'll get to a um a larger end. I think like it like compute scales from an inference perspective as well. Um and, you know, I think we are seeing that with almost all our customers, which is Yeah, they either they either start with like this is the quality event, right?

39:54

I need to get to. And this is the amount of inference I I need to do to get there. Or this is the base level model that I can start with, or that I can work with to get there. And I think the more we drive down the cost, um, what they realize is. Um more intelligence just means better user.

40:11

I just want a better answer.

40:12

Better answers, better experiences, more dollars. More dollars, uh more revenue. So yeah, I th I think inference going down just begets more inference. I like it it is it is truly like I think we're kind of in a world that is you know, it is the last market, right? Like even if there's AGI, all that's left is inference.

40:29

So you do not see in your customers a uh like This this answer is enough and this action is enough.

40:39

Yeah, it's gonna keep going for a long time, it looks like. Yeah. Uh how do you view all this kind of evolving towards the future? So basically this is one of the it it seems like it's gonna be one of the biggest markets of all times. Yeah. We have this massive shift.

⁠¶ Concierge Everything Future

40:49

Where we're moving from software and seats and digitization into actual intelligence, selling units of cognition, selling agentic workflows. What does this all look like in a couple of years? Like what is your view of this future world?

41:01

I think for the c for consumers it's it's the best possible thing, right? Like every everything is somewhat Smarter, you know, your doctor you get better care because your doctors have access to better um better tools. Um there's more, you know, like there's all this stuff about there being less software engineered and I think we just build more software.

41:20

Mm-hmm. I think we just build a ton more software and like, you know, like I I see, you know, we're not slowing down hiring of software engineers, we're just building more things. Um and that for the consumers, that just means better tools, more software. Um, all those all those good things.

41:34

It's almost like everybody has their own team for everything, right? You have an agent which helps with your doctor, you have an agent that helps you learn stuff, you have an agent that helps you organize.

41:40

It's concierge.

41:42

Yeah. Yeah. Concierge everything for everyone.

41:44

Yeah. And and and and I think like what that means, what that so that's amazing. I think that's great. And I think that the the and then education, same thing. You have continued education. Like ev you get personalized access to everything. I think then you go one step back and how it um affects developers, I think, you know, um and and companies, I think if you don't

42:04

embrace this. Mm-hmm. I think it's the extinction extinction moment for for a bunch of folks, which is like, you know, you everything needs and I I d I don't think that means that, you know, Thought design needs figure out. I think that's a thing. I I I think like what what's more what's more interesting is just like, you know, all these workflow and software companies need to figure out what is the intelligent or intelligent inserted versions.

42:28

that that drive the amount the all that user value for those and consumers that we talked about.

42:33

Yeah, very exciting. Thank you so much for joining us today.

⁠¶ Conclusion

42:36

There's this.

42:36

🎵 Music

42:38

Find us on Twitter at No Priors Pod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts. Spotify or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at nodpriors.com.

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud

Summary

Episode description

Transcript

⁠¶ Intro / Opening

⁠¶ Baseten growth

⁠¶ Why the app layer wins

⁠¶ Serving frontier customers

⁠¶ Open source model mix

⁠¶ Chinese models and geopolitics

⁠¶ Custom inference dominates

⁠¶ Post training acquisition

⁠¶ When to invest in custom models

⁠¶ Supply crunch and data centerse

⁠¶ Longer GPU Contracts

⁠¶ What Makes a Winner

⁠¶ Multi Chip Future

⁠¶ Runtime Roadmap

⁠¶ Scaling Edge Cases

⁠¶ Hiring and Leadership

⁠¶ Operations Pager Culture

⁠¶ Efficiency Drives Demand

⁠¶ Concierge Everything Future

⁠¶ Conclusion

Baseten CEO Tuhin Srivastava on the AI Inference Crunch, Custom Models, and Building the Inference Cloud

Summary ✨

Episode description

Transcript ✨

Summary

Transcript