#199 - OpenAI's 03-mini, Gemini Thinking, Deep Research, s1 - podcast episode cover

#199 - OpenAI's 03-mini, Gemini Thinking, Deep Research, s1

Feb 12, 20252 hr 38 minEp. 239
--:--
--:--
Listen in podcast apps:

Episode description

Our 199th episode with a summary and discussion of last week's big AI news! Recorded on 02/09/2025

Join our brand new Discord here! https://discord.gg/nTyezGSKwP

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

In this episode:

- OpenAI's deep research feature capability launched, allowing models to generate detailed reports after prolonged inference periods, competing directly with Google's Gemini 2.0 reasoning models.  - France and UAE jointly announce plans to build a massive AI data center in France, aiming to become a competitive player within the AI infrastructure landscape.  - Mistral introduces a mobile app, broadening its consumer AI lineup amidst market skepticism about its ability to compete against larger firms like OpenAI and Google.  - Anthropic unveils 'Constitutional Classifiers,' a method showing strong defenses against universal jailbreaks; they also launched a $20K challenge to find weaknesses.

Timestamps + Links:

(01:33:16) Anthropic offers $20,000 to whoever can jailbreak its new AI safety system

Transcript

Hello and welcome to the last week in AI podcast. Where can you guys chat about what's going on with ai? As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. And as always, you can also go to last week in.ai for the text newsletter with even more news and also the links. For this episode, I am one of your hosts, Andrej Korenkov. I studied AI in grad school and I now work in the startup world.

And I'm your other host, Jeremy Harris, co founder of Gladstone AI, a national security company. And I don't have my usual recording rig. I anyway, left it at the other location that I tend to record from. And anyway, we were just talking about earlier as well, how Andrej is. It's like 6 45 AM where he is. He got up at six or before six 30 anyway, which was our scheduled recording time. I had baby things that took over for like 15 minutes. And so he is like getting up bright and early.

So, so hats off to you, Andre, for this extra bit of commitment that you're showing to to the community. To our listeners due to your love for the sport. So thank you. Yeah, that's my excuse in case I make any mistakes or say anything wrong with this episode, but I think last few weeks, it seems like you've had a rough time of it in terms of dealing with all the news coming out of. And your administration, it seems like you've had a really busy time at work. So it was much worse.

No, I mean, it's, been just busy with, with travel and stuff. So looking forward to it, maybe slowing down a bit. But yeah, this morning it was, it was more so the the baby stuff. I got handed a baby cause my wife needed to do something. And anyway, she she's normally so, great at, at That's supporting me and all these things. I felt it was the bare minimum I could do. But now the baby is handed off. She will not be making a cameo this episode.

But who knows someday, someday, well, let us preview what we'll be talking about in this episode. We start this time with tools and apps and have sort of. Follow up to DeepSeek R1, you can also say a lot of companies are jumping on the thinking AI bandwagon, and there's also some other non NLM use, which is going to be refreshing, a lot of funding and the hardware and Business stories. We do have some interesting open source releases that maybe have gone a bit under the radar.

Then quite a few papers this episode, which will be exciting. Some on reasoning, some on scaling, that sort of thing. And then in policy and safety actually also a couple of papers and more technical approaches to alignment. So, yeah, kind of not a super heavy episode in terms of big, big stories, but a good variety of things. And I guess nothing as deep as Deep Seeker 1 from last week where we spent like half an hour on it.

But before we get there, do you want to acknowledge some listener comments and reviews? We had a very useful one, good feedback over at Apple Podcasts. Great content, good hosts, horrendous music. And I think I've had, I've had, I've seen this take on and off. For like a little while now, I've been doing this AI music intro where I take the usual song and then make variations of it. Not everyone's a fan. Some people seem to like it. So I'll consider maybe rolling it back.

It also takes a while, you know, surprisingly, it actually takes some work for me to do it. So maybe I'll just stop. Yeah. And then we did have one question on the discord that I want to forward. There was Mike C asking that it seems like you made an app comment a while ago where you didn't think RAG, Retrieval Augmented Generation. Yeah. Retrieval Augmented Generation. And yeah, you maybe were not Too big a fan of it.

And so this comment is basically following up seeming to ask, what is the state of rag? Is it round? Yeah. Yeah. So I wouldn't say I'm not a fan of it. I more that I view it as a transient. So yeah, it's, it's not something that I think will be with us forever. So one of the things with rag is that, and so I'm just flipping to the comment now. it says, I've certainly experienced diminished need for rag with Yeah.

With larger context windows with Gemini, but that doesn't address large data sets that might be hundreds of millions of tokens or more. and this is actually exactly The case that I've been making to myself. So yeah, I expect that you will continue to see diminished need for rag with larger context windows and ultimately that that would include we'd have to assume large data sets regardless of size.

And so one of the really counterintuitive things about the pace of progress and the way that progress is happening in this space is it is exponential. And so today, it might make sense to say, well, you know, hundreds of millions of tokens. Surely that's an accessible to the, you know, the context windows of non rag systems. And and the only thing I would say to that is like, yeah, but just give it time, right?

We're writing so many compounding exponentials right now that ultimately, yeah, expect everything to reduce to just you know, querying within context. I should be clear. That's like, One mode of interaction that I expect to be quite important, but you could still have specialized rag systems that are just designed for low cost, right? Again, that's more of a transient thing.

Like you could imagine a situation where like, sure, the full version of GPT five can just like, you know, one shot queries just in context without any rag stuff. But Maybe that costs a lot because you're using this model with a larger context window. It's having to actually use the full context window for its reasoning and all that. Whereas, you know, maybe you can have a smaller purpose built system for RAG.

That's already sort of the not microscopic or mesoscopic, that kind of mesoscopic middle, middle ground. That's sort of where we are right now in a lot of applications already. There are still other applications that are totally impossible. And so it's not just a price thing. They're totally impossible with just using what's in context. But But anyway, that's kind of the way I see it.

I mean, I think we're heading inexorably towards a world where the cost of compute and the size of the context windows kind of tend towards zero and infinity respectively. Right? So, so cost of compute, if you're kind of. Way over, if it's way overpriced for your application right now, wait a couple of years. If the context window is way too small for your application right now, wait a couple of years. Both those problems get solved with scale. That's kind of my, my prediction.

Yeah. Yeah. So at the limit, it seems like you might as well just throw it all in to the input and not worry about, Exactly. Because one of the problems of rag and the reason you might see it go away is that it was these finicky extra things to worry about of how, like, how do you embed your query, how do you create your data set and so on and so on. So if you can just put it all in input context and real, I'm just like. Does well, then why even bother with RAG, right?

Aside from engineering considerations. So yeah, in that sense, it's a fair comment to see. Certainly it'll become less important for smaller data sets in even, you know, it used to be you couldn't do a million token input and expect it to be useful. Well, at least in some cases you certainly can. Right. Alrighty, well, thank you for the question, I'd love to go into some of the discussion. Now we shall move to the news, starting with tools and apps, and we begin with O3 Mini.

So just after our last recorded episode at the end of January, OpenAI did announce the release of O3 Mini. Which is the latest entry in their reasoning LLM, series starting with O1, I don't, I don't remember O2 was copyrighted or something. So we stick to O3 and now we have also O3 mini, which. does really well at smaller scales, smaller cost faster inference, all of that sort of thing. It's actually, to me, quite impressive, the kind of jump in performance we've seen.

So on things like frontier math that we've covered, which is a very, very challenging, math data set benchmark introduced with, you know, actual. Mathematicians creating these problems without known answers. Elfrimini is outperforming O1 and is able to answer 10%, roughly 9 percent of the questions in one try. And one fifth of them in eight. So still, you know, the, top of repetitions can make some tough problems for it but the reasoning models are improving fast.

It used to be, you could only get like 6 percent or 12 percent with all one. And then on the app front, Everyone can start using it. That was kind of a wide release. We did see a preview earlier mimicking R1 to some extent, OpenAI did give a more detailed peek into the reasoning process. So as you do the query now, you'll see.

The longer, more detailed summaries of intermediate steps that it takes, which, you know, certainly you could argue is in part in response to R1 and how that displays the reasoning for you. Yeah. And I think we've talked about O3 in terms of some of the benchmarks that were published early before the model was released. You can go back and check that out. Check out that episode.

But one of the things we will go into a little bit today there's a paper with a kind of new benchmark probing at just the pure reasoning side of things. So trying to abstract away the world knowledge piece and from the reasoning piece, which is really hard to do. And, and just look at like, you know, how can, how well can this thing solve puzzles that don't require it to know, necessarily, you know, what the capital of Mongolia is, that sort of thing.

And on that, when one of the interesting things is they got some indications that maybe oh three mini is pretty well on par with oh one and maybe a little bit behind a one in terms of general knowledge reasoning which was kind of interesting, but in any case, we'll, we'll dive into that a little bit. Certainly is a big, big advance.

And of course that is the mini version relative to the oh one, the full oh one, so that's maybe not so surprising in that sense but yeah really, really powerful model, really impressive and very quick and cheap for the amount of intelligence that it packs. We are getting, as you said, like this announcement that, Hey, we're going to let you see into what they call all three minis thought process. This is a reversal, right? Of opening eyes, previous position.

So famously when Oh, one was first launched, you couldn't see the full chain of thought. You were seeing some kind of distillation of it, but a very short one. Now they're coming out and saying opening eyes that they've, Quotes found a balance and that O three many can think freely and then organize its thoughts into more detailed summaries than they'd been showing before. You might recall the reason they weren't sharing those summaries previously.

They cited part of it at least was competitive reasons, right? Presumably they don't want people distilling their own models on the chains of thought that one was generating now, maybe taking a bit of a step back, but not completely. What they are saying is, look, the full chain of thought is still going to be obfuscated. That is by the way, really difficult to understand. So a lot of the writing there is garbled, maybe a bit human incoherent, which we talked about that.

I think again on, on a previous episode, but the risks associated with safety for a model like that, where increasingly, yeah, you, you want to expect it to be reasoning in kind of opaque, non human interpretable ways, as we saw with our one zero. That's sort of a convergent thing we should expect. But in any case, they're generating summaries on top of that. Those summaries are just longer than the summaries that they previously been producing with a one.

So presumably this means that to some degree they're willing to relax a little bit sharing those chains of thought they presumably don't think, you know, maybe that they're as that there's kind of like a diminishing value of protecting those chains of thought from a kind of replication and competition standpoint. One other thing that this points to, and this is increasingly becoming clear. You could have fully open source frontier AI models today.

And the winner, because of inference time compute, the winner would still go to the companies that have the most inference time hardware to throw at the problem. Right? Like if you have a copy of R one and I have a copy of R one, but you have 10 times more inference. Flops to throw at it because you own way more data centers, way more hardware. Well, you're going to be able to get way better performance out of your model.

And so there's a sense in which even if you open source these models, you know, competitiveness is now rooted more than ever in AI hardware infrastructure. And that's kind of, I think what's actually behind a lot of Sam Altman's seemingly kind of highly generous statements about we're on the wrong side of history.

We're on the wrong With open source, it's like, it's awfully convenient that those are coming just at the moment where we're learning that inference time compute means, you know, it's not like you have a gun and I have a gun. If I have a bigger data center, my gun is just bigger so I can open source that model and I don't give away my entire advantage. I don't want to. Like kind of lean into that too hard. There is absolutely value in the algorithmic insights.

There's value in the models themselves. Huge value, really. But it's just that we have a bit of a rebalancing in the strategic landscape. It's not decisive. It's still absolutely critical from a national security standpoint that we lock these models down, that the CCP not get access to them, blah, blah, blah. But there is this kind of structural infrastructure component where, you know, if I can't seal your data center, But I can steal your model.

I haven't stolen your whole advantage in the way that might've been the case looking back at, you know, GPT or, or, you know, the days of just the pre training paradigm. So kind of interesting and maybe reflected somewhat in, in this announcement. Yeah, that's an interesting point. We've seen, I guess, a shift now, right. Where. Going back to Chad's GPT, when it first hit the scene a lot of these companies, Meta and Google were kind of caught a bit, behind, right.

Not prepared for this era of trying to train these models first of all, and also compete on the serving them. And it seems like now. They've started to catch up, or at least the competition is now definitely on who can lead the pack in terms of infrastructure on the model side. Yeah. Open the eye on fabric are still the best ones. But again, the actual differences between the models In many cases on many problems, not that significant.

And the difference now, as you said, is, is with the reasoning models where being able to just throw a computer, it is a big part of it. So, O3 mini, I would say pretty exciting for Chad users competition as ever is good for the consumer. So. that's nice as a ChagPG users. I'm excited. And the next story also about a reasoning model, but this one from Google. And so this is part of their Gemini 2. 0 rollout. There's been several models.

We have Gemini 2 flash, which is their smaller, faster model. They also have a Gemini 2. 0 pro, which is their big model, which Isn't that impressive? Surprisingly, not kind of as performant or as big a splash as you might think. And then they also have a Gemini two flash thinking as part of a rollout. So that is their answer to oh one. At the seemingly, it's the one that's better at reasoning better at, the kind of task you might throw at all one.

From what I've seen, I haven't seen too much kind of evaluations. I don't seem to be competing too much in kind of the landscape of this whole, you know, we have a really smart LM that can do intermediate thinking. But yeah, a whole bunch of models, including also 2. 0 flash light from, Google. So they're certainly rolling out quite a few models for developers and for users of Gemini. Yeah, this is really consistent with just at a macro level.

What's, what's happening with, reasoning models, right? There's so inference heavy that it just makes so much more economic sense to focus on making a model that's cheap to inference, right? So you see this emphasis on the. The flash models, the flash light models. So, you know, flash was supposed to already be the kind of the quick cheap model. Now there's flash light. So extra cheap, extra, extra quick.

If you're taking one model and you're going to be running inference on it a lot, if it's a big model, it's going to cost you a lot of money. That's just like the larger, you know, the, the larger. The number of parameters, the more compute has to go into inferencing. And that's where you're also starting to see a lot of heavy duty overtraining, right? So by overtraining, what we generally mean is for a given model size, there's a certain amount of compute that is optimal.

To invest in, in the training of that model to get it to perform optimally, right? Now, if you make that model smaller, the amount of compute and the amount of data that you should train it on, according to the scaling laws to get the best performance for your buck is going to drop, right?

Cause anyway, those three things are typically correlated, but If you know that you're going to be inferencing the crap out of that model you might actually be fine with making it extra small and then overtraining it, training it on more data with more compute than you otherwise would it's not going to perform as well as it would have if the model were bigger with more parameters, given the compute budget. But that's not the point.

The point is to make a really compact, Overloaded model that that's really cheap to run, but has as much intelligence as you can pack into a small number of parameters. That's really what you're starting to see with, you know, flashlight and so on. Just because of the insane inferencing workloads that these things are being called upon to run. And so part of Google diving into this.

Obviously, Google's also had the release of their deep research model to which Came out the same day is opening eyes. Deep research. You know, we'll talk about that. But it's sort of interesting. Google's actually had some pretty decent models. And I will say I use Google Gemini quite a bit.

Gemini 2. 0 partly because it's the only pro tip, by the way, if you're, are you looking to write anything that that you're going to get paid for or Or that you need to be able to use professionally with a permissive license, like Gemini 2. 0, as far as I know, is the only model I was able to find where you have licensing terms that allow you to do that, to just use the outputs commercially without attribution and so on and so forth. And so you're sort of forced into it.

So even though I find Claude works best for that sort of that sort of work, Gemini 2. 0 has done the trick and improved a lot. So Google has great products in this direction. Interesting challenges. People don't seem to know about them, right? Opening has the, the far kind of greater advantage on the splashiness. And then when you get into the niche area of like, you know, people who really know their LLMs, they tend to gravitate towards Claude for a lot of things.

And so yeah, they're sort of between a rock and a hard place. And I don't know if the solution is, Quite marketing, but I don't know. I've been surprised at, at sort of how, little the, the Gemini 2. 0 model, at least I've seen used given how useful I've found it to be. Yeah. I think their current approach seems to be competing on the pricing front. They do seem quite a bit cheaper. And of course they do have a bunch of infrastructure.

I'm sure some, people are using their cloud offerings that are in this sort of Google, dev environment. So yeah, I think another example where, you know, we hear about them less or certainly they seem to be getting less credit as a competitor with Gemini, but they're continuing to kind of chip away at it. And, and from what I've seen, a lot of people who use Gemini to flash are pretty positive on it.

And moving on, as you mentioned, there was another kind of exciting thing that came out a bit, you know, not quite as big a deal as O1 and R1, but to me also very interesting, and that was deep research. So this is a new feature that came out in both Gemini and OpenAI chat GPT about the same time. The idea being that you can enter a query and the LLM will then take some time to compile the inputs to be able to produce a more detailed, more informed output.

So in, in OpenAI, for example, you can input a query and then it can do, It's think for like 5 to 30 minutes and you just have to leave it running in the background, so to speak, and then eventually it'll come back to you with a quite long report pretty much about your question. And so this is, yeah, it's a bit of a different paradigm of, Reasoning of output that we haven't seen up to now.

And a lot of people seem to be saying that this is a very significant or very impressive in terms of this being actually something new, something that agents are capable of doing that otherwise you would have had consultants or professionals doing this for you. Yeah. This is actually a pretty wild one. So I watched the, the live launch when it came out because opening. Posted a tweet saying, Hey, we're, you know, we're launching it right now and saw the demo. it is pretty wild.

it's also the kind of thing that I've seen people talk about as, as like being that, you know, the demos kind of live up to the hype. So the idea here is because you're having the model take five to 30 minutes to do its work. You actually are going to step away. So you, you'll get a notification when the research is done.

and then you end up, as you said, with this report, so it's kind of this new user experience, that problem that they have to solve where, you know, you're just letting the thing run and then like a microwave that's finished nuking the food, it'll, you know, kind of let you know it is pretty impressive. So just to give you some examples of the kinds of research queries that you might ask, these are taken from opening eyes introducing deep But these are like potentially highly technical things.

So, for example, help me find IOS and Android adoption rates the percent of people who want to learn another language and change in mobile penetration over the last 10 years for top 10 developed and top 10 developing countries by GDP, lay this info out in a table and separate. stats into columns and include recommendations on markets to target for a new iOS translation app from chat. GBT focusing on markets chat. GPT is currently active in.

So one thing to notice about at least that query is you're looking at something that is pretty detailed. You yourself have a general sense of how you want to tackle the problem. So it's not fully open ended in that sense, but this does reflect the way that a lot of. People operate in their jobs, right? You already have the full context. You need, if you could hand it off to a very competent graduate student or intern or somebody with like a year or two of experience and have it executed.

This is the sort of thing I might do, right? I mean, you know, the problem, you don't know the solution and, anyway, it's, it's quite impressive. They give examples from medical research queries, UX design, shopping, and general knowledge. The question is as simple as what's the average retirement age for NFL kickers, right? Which you, you wouldn't necessarily expect that number as such to appear anywhere on the internet.

It does require a couple of independent searches and then the aggregation of data and it's kind of processing a little bit. So that's pretty cool. And just, I'll just read out the answer to that question too, because it is pretty indicative of the level of detail you get here. So it says, you know, determining the exact retirement age for NFL kickers is challenging, blah, blah, blah, blah. However, kickers generally enjoy a longer careers compared to other positions in the NFL.

The average career length for kickers and punters is approximately 4. 87 years, which is notably higher than the league wide average of 3. 3 years. So it's doing some ancillary work there, right? That's a lot going on here.

So anyway, one of the, the last, actually the last thing I'll just mention here other than to say this is a really qualitatively impressive thing quantitatively on humanity's last exam, the sort of like famous benchmark that Dan Hendricks has put out recently, I think it's part of the case, like center for AI safety, maybe, or, you know, Anyway, it's something that Dan Hendricks is working on. So this is looking at essentially a very wide range of expert level questions.

You know, we talked about this benchmark previously, but essentially it's, it's meant to be really, really fucking hard, right? Think of this as like. GPQA on steroids. in fact, what you do see with older models, you know, GPT 40 just scores 3 percent on it GROK two scores, 4 percent CLAWD 3. 5 sonic scores, 4 percent and so on. Oh, one scored 9. 1%. So that had people going like, okay, you know, maybe we're getting lift off opening ID, deep research, 26.

6%. I mean, it is genuinely getting hard to come up with benchmarks right now that these models don't just shatter right out of the box. Half life of benchmarks in this space is getting shorter and shorter and shorter. And I think that's just a reflection of the, you know, how far along we are on the, on the path to AGI and superintelligence. And so, you know, you expect this kind of hyperbolic progress. it's just interesting to play it out on the level of these quantitative benchmarks. Right.

So. Yeah, a lot going on with this. Maybe once you peel back the layers, I think it's interesting that this is, I would say the first demonstration of agents actually being an important thing. So you, the idea of agents of course is. Basically that that you like tell it to do a thing and then it does it for you on its own, sort of, you don't need to tell it to do every single step. And then it comes back to you with a solution right. And so with O1, et cetera, right.

That was kind of thinking through it and, and, Arguably doing kind of a series of steps, but here it's browsing a web, it's looking at websites and then in response to whatever it's seeing, it can then go off and do other searches and find other information. So it very much is kind of, you know, Getting this agent to go and do its own thing and then eventually come back to you.

And so, you know, in the past, we've seen a lot of kind of examples of a book, a ticket for me to go to X and Y and Z, which never kind of seemed that promising. This actually is showing a very. impressive demonstration of how agents could be a game changer. Unfortunately you do have to be paying for the 200 a month tier of chat GPT to be able to try it.

So not many people, I guess, are going to be able to, and just to compare between chat GPT and Gemini, my impression is a chat GPT goes a bit deeper, it does. Get you kind of a well researched and more thorough response. But I've also seen, some coworkers of mine that used it have also found that Gemini Deep Research is able to answer questions quite well, better than things like web search. Yeah, it is pretty remarkable.

It's also got these interesting or invites new kinds of, of scaling curves or invites us to think about new kinds of scaling curves. There is one I think that's that is worth calling out. So they look at max tool calls. So the maximum number of tool calls that they have the model perform as it's doing its research versus the pass rate on the tasks that it's working on. And yeah. What's really interesting is you see a kind of S curve.

So, early on, if it does relatively few tool calls performance is really bad. Its performance starts to improve quite quickly as you increase from there the maximum number of tool calls. So call us to different APIs. But then it starts to saturate and flatten out towards the end, right? So as you get around to like 80 or 100 max tool calls no longer very steep improvement. And this itself is kind of like.

I mean, assuming that the problems are solvable, this is the curve to watch, or at least a curve to watch, right? The more the model's calling its tools, essentially the more inference time compute it's applying, the more times it's going through this thought process of like, okay, you know, what's the next step? What tool do I now need to use to kind of get the next piece of information I need to move along in this problem solving process?

And, you know, you can imagine, key measurement of progress towards agentic AI systems would be how long can you keep that curve steepening? Can you keep that curve going up before it starts to plateau? That really is going to be directly correlated with the length of the tasks that you can have these problems go out and solve, right? So they talk here about five to 30 minute tasks. what's the next beat? How do you get these, these systems to go off and think for two hours, five hours?

and I think at that point you're, you're already really getting into the territory where this is. Absolutely. Already accelerating AI research at opening itself. I guarantee you these kinds of systems are now being used to accelerate and better versions of these systems with fewer caps on obviously inference time. Compute are being used to accelerate their internal research is something that we've absolutely heard and makes all the sense in the world. Economically.

So this could yield compounding return surprisingly fast. This is, I think, behind as well. A lot of Sam Altman statements about you know, I think fast takeoff is actually more likely than I once thought and so on and so forth. So, you know, this is quite interesting. It's got a lot of implications for national security. It's got a lot of implications for the ability to control these systems, which is dubious at best right now.

And so we'll see where these curves go, but I think it's, it's definitely a kind of KPI that people need to be tracking very closely. It's got a lot of implications. Well, you've talked about OpenAI, we've talked about Google. Let's go to some other companies starting with Mistral, the LLM developer out of Europe, you can say, and one that is trying to compete at least on the Frontier model training front with models like Mistral Large.

And as of a few months ago, they're also throwing their hat into the product kind of consumer arena with Le Chat and then the, they Have a separate thing for assistance. Anyway, they have released a mobile app now on iOS and Android also introduced a paid tier at 15 a month. So they're continuing to push into trying to get people to, I guess, adopt the chat as an alternative to chat GPT. As we've kind of said in the past, I think we're discussing it.

I'm not sure how easy it will be for them to compete on this front. And it's interesting that they're trying to fight in this area. That's very much like pricing and, speed and so on that on Frawpik and OpenAI have, let's say some advantages in.

Yeah, kind of pre register my standard and by now very tried prediction that Mistral is going to be in as much trouble as Cohere, these sort of like mid cap companies that are going to go ultimately, I think may yet prove me wrong, and I may sound very stupid in retrospect Yeah, I mean, look, it's these are mid cap companies that are competing with with the big boys, right?

Companies that are raising like tens of billions of dollars for AI infrastructure and that have brand recognition coming up their butts, right? And so the advantage here is if you have really, really good brand recognition, if you are just The default choice for more people, then you get to amortize the cost of inference across a much larger fleet of AI hardware. It lets you build at scale in a way that you simply can't if you're, if you're mistrial or any other company.

and that just means that competing with you on inference costs is a bad idea. And this is doubly true for mistrial because so many of their products are open source, which means they're literally just pricing based on the convenience that you gain by. Not having to deploy the model yourself. So their margins come entirely from, you know, how much easier is it for them to just run the hardware rather than you? That turns them basically into an engineering company.

It makes it really difficult to develop the margins. You need to reinvest now with mistrial. They are kind of French national champions. So there's maybe some some amount of That's up there for them to grab. It's not going to be that much. And France doesn't have the resources to throw at this, that, the capital base of the United States has. So I think ultimately this ends in tears, but in the meantime, some, VC dollars is going to be lit on fire and, and, and.

And I think there are going to be some people who are really excited about the implications for kind of having another competitor in the race. But think this is, again, it's kind of like cohere you know, startups that sound like great ideas and are easy to get funded because of the pedigree of the founders.

But when you actually step back and look at kind of the fundamentals, the infrastructure race, that's really underpinning all this, again, there's a reason that Sam Altman started to talk about like, Hey, we're on the wrong side of history with open source. It's not because he thinks. Companies like Mr. I'll have some sort of advantage. It's because he thinks he has an advantage with his giant fleet of inference time computing. So yeah, I mean, we'll see.

I mean, I fully expect to sound like an idiot for any number of reasons, but I do like to make these predictions explicitly because at the very least it keeps me honest if I end up being wrong. similar development, a long line of, of these developments from Mr. All I, at least I think from where I'm standing. Right we should acknowledge they do have one differentiator they are highlighting, which is their ability to get really fast inference.

So they are claiming they can get 1, 100 tokens per second, which is roughly 10 times faster than Sonos 3. 5, 4. 0, R1, et cetera. Apparently they partnered with Cerebris for some cutting edge hardware, specifically for this level of inference. That's very, very fast, a thousand. This says words per second, unclear if it's tokens or words, but regardless, that could be, if you need that, could be one reason you might adopt Mistral.

Yeah, it is unclear to me how lasting of an advantage that is given that, if you look over it opening, I, for example, literally has partnerships going to build out custom ASICs for their systems. Right? So like, you know, expected any advantages like these to wear out pretty quickly. But, but still It'll be cool to have at the very least lightning fast inference. It's not the first time we've seen numbers like this as well, right?

The, the grok chips running similar models have shown similar results. There's always controversy around how much that actually means given. Anyway, details like the, the number of, queries you can get simultaneously given the way these chips are set up and so on. There's always asterisks is you never get a giant leap like this quite for free. it'll be interesting and, and hey, I'll hopefully for them, they they proved me wrong and I'm eating my words in a few weeks.

Moving away from LLMs, couple of stories on other types of AI, starting with AI music generation and the startup Riff Fusion, which is now entering that realm. They have launched their service in public beta and similar to UDO and Suno, they allow you to generate full length songs from. Text prompt, as well as audio and visual prompts. So interesting to see another competitor in this space, because.

Suno and UDO to my knowledge have been the two players pretty much and have made it so you can get very close to human sort of indistinguishable music generation. At least if you kind of, use it a bit. So diffusion. Offering something along those lines, they do say that they are collaborating with human artists through this trusted artist agreement, which is allowing access to artists So another entry in the AI to. Song, I guess space.

Yeah. So it seems like this trusted artist agreement is kind of one of the most interesting parts of this, right? I mean, what precedent are we setting for, for the exchange of value here? Right. when you're setting this up, big challenge too, is the vast majority of highly talented artists just aren't discovered. And so they make next to no money, which means they're very easy to pick off.

If you're an AI company looking for people to, you know, for even a, you Modest amount of money help you to support the development of your system. So, you don't necessarily have to have a deal with Taylor Swift to train your model to produce a really, really good quality music, I guess. So the deal here apparently is trusted artist agreement gives. Artists early access to new features and products in return for their feedback. So, unclear to me, like how much value there is there.

It's great to have more tools in your arsenal, but it's certainly not, from a tragedy, the common standpoint, basically you're, you're helping to automate automate away this whole space. So, it's kind of a, an interesting trade off, right. And I will say you know, we won't get into.

A whole new story about it, but I have seen some coverage of Spotify, apparently starting to fill out some of their popular playlists, like kind of lo fi chill, hip hop, et cetera, with seemingly some generated music, not coming from human artists. And as a result, human artists are starting to lose some money.

So it seems like AI music generation hasn't had a huge impact on the industry yet, but it also seems like it is definitely coming, given that you can now do some very high quality generations. And last story, going to video generations and PIKA Labs, they have now introduced a new fun feature called. Pika additions. I guess that's how you call it. Pika additions. So this is coming within the Pika turbo model and it allows you to insert objects from images into videos.

So they had Some fun examples of it, where, you know, you might have a video of you just doing some normal stuff and then you can insert some animal or some other actor or whatever insert a mascot and it looks pretty, you know. Yeah, at least in some cases, realistic and makes it much easier to alter your video in some interesting ways.

Yeah, this is the kind of thing I could imagine being quite useful for, you know, like making commercials and stuff like that, because anyway, some of the demos that they have are, you know, are really impressive. It's obviously demos are demos and all that, but certainly seeing things that that approach. Anything we've seen with Sora. So yeah, really cool. And Pika labs seems to keep pumping out relevant stuff despite all the competition in this space. So kind of cool.

Yeah. I think one of the big questions with video generation has always been how do you actually make it useful in practice? Right. Sure. You can get a short clip from some text description, but is that something people need and Pika is.

Introducing more and more of these, you know, different ways to use video generation that aren't just generate a clip and, and examples like this, where you are taking a clip you have, and then essentially doing some VFX additions to it as you would with CGI previously. personally, I think that that is a much more promising, way to commercialize video generation, basically cheaper. Easier VFX and CGI. Well, not, I guess it's computer generated, so you could call it CGI.

But yeah, the clips are fun, so you should check them out. And moving out to applications and business. We begin yet again with OpenAI and OpenAI Fundraising News. And it's about how SoftBank wants to give OpenAI. Even more money. So South Bank is saying they will or at least seeming like they're planning to maybe invest 40 billion in open AI over the next couple of years, that's at a very high valuation of 260 billion.

Pre money, this is also in the midst of an agreement to bring OpenAI tech to Japan where Soudbank came out of. So yes, I guess Soudbank is now the biggest backer of OpenAI. They're of course, one of the players in the Stargate venture, and they seem to really be, It's banking on opening. I to continue being a top player. Yeah, for sure. And I think, you know, the, yeah, the article says they're the, the biggest backer. It's a little unclear to me.

They must mean in dollar terms and not equity terms because Microsoft, you know, famously owns around 49 percent of of open AI or had so there's no way that the fundraisers that they've put in so far. The dollars they put in so far add up to more than that equity, but on a dollar denominated basis, certainly that has to be the case, right? If they're putting in 40 billion here. So yeah, really interesting. Soft bank Masayoshi son in particular seems to be a really big Sam Altman fan.

Which is also interesting because he's, you know, he's not particularly dialed in on the, even the most basic kind of like A. I. Control issues. There's a panel he was on with Sam Altman where he was like he turns to Sam at one point and he's like, yeah, so like, obviously this like concern over losing control. These systems makes no sense because we're made of protein. And like, why would an A. I. System? system ever want to like, you know, they're, they're not made of protein.

They don't eat protein. So, and Sam Altman is like forced to sheepishly look back at him. And obviously he doesn't want to, he doesn't want to contradict him in the moment, cause he's raising 40 billion from this guy. But he's, he also knows that like entire world technical field, at least knows that he knows. That the real answer is a lot more nuanced than that.

But that was very kind of clarifying and illustrative, I think for a lot of people that there's some, some just embarrassingly fundamental things that SoftBank is, is missing out here. It's also worth flagging SoftBank. Not, not just a kind of Japanese fund. So there's a lot of sovereign wealth money there in particular Saudi money that makes up.

It's hard to know, like I don't know, rather off the top, but it could easily make up the lion's share as we've talked about of the funds that they put in. So it's a very weird fund and interesting choice for Sam Altman to partner so deeply with somebody who misses a lot of the technical fundamentals behind the technology. Not hard to get the big scaling story, of course. Backing open AI is an obviously good move if you believe in that scaling story. And I, I certainly do.

But it's an interesting choice, especially on the back of the partnerships with, you know, Satya and Microsoft, which are very, very kind of knowledgeable, technically knowledgeable and technically capable investors. So that's very interesting. It's. Possibly you could view it as Sam starting to build optionality. This moves him away right from Microsoft, from dependency on them and gives him, you know, two people now to sort of play off each other between Satya and Masayoshi son.

and so a little bit more leverage for, for him. We do know that a trench of this 40 billion investment is going. To like is being filed under the Stargate investment. And so in a sense, it's kind of like there's 15 billion to Stargate. And then maybe the, the balance over to opening itself a little unclear how that shakes out. But this one last thing to flag here too, at this point, opening eye is raising funds from sovereign wealth funds, right? That that's what this is.

And you know, for the reasons we talked about, that is the last stage. There's no more. Like giant pot of money waiting for you. If you're trying to fundraise and remain a privately held company, the sovereign wealth funds are the last stop on the giant kind of money train when you're raising the tens of billions of dollars. And so what this tells us is that either opening, I expect to go public and be able to raise somehow even more money or.

They expect that they're going to start to generate a positive ROI from their research or hit super intelligence fairly soon. Like this is consistent, very consistent with short timelines view of the world because again, there's, there's nothing more to be done. This, this is it, right? If you, if you're raising from sovereign wealth funds, maybe they can go directly to, you know, Saudi Arabia, the UAE or, or something else. But, Ultimately, you're kind of tapped out.

And and this tells us, I think, a lot about I timelines in a way that I'm not sure that the kind of ecosystem is fully processed. So interesting for a lot of different reasons. And definitely again gives Sam another another arrow in his quiver in terms of the relationship with Satya and how to manage that. just a few more details on this business story, I guess, with South Bank. There are some additional aspects to it.

They are saying they'll develop something called Crystal Intelligence together. South Bank and OpenAI are partnering for it. Kind of ambiguous what it actually is. The general pitch is it's customized AI for enterprise, And that's kind of all there is in the announcement for it. There's also another aspect to this, which is SB OpenAI Japan, which will be half owned by OpenAI and half owned by South Bank South Bank also going to be paying 3 billion dollars.

Annually to deploy OpenAI solutions across its companies, and that's separate from investment. So lots of initiatives going on here, SoftBank I guess really banking OpenAI and VEM working together on various ventures. Next up, yet again, talking about data centers this time in France with the UAE planning to invest billions of euros, actually tens of billions of euros.

On building a massive AI data center in France there's an agreement that was signed between France's minister for Europe and foreign affairs and the CEO of Mubadala investment company. So yeah, it's already. Planned for. And I think we haven't covered too many stories along this front. We've seen a lot of movements across the U S of companies trying to make these massive data center moves. And now it seems to be starting to happen in Europe as well. Yeah, this is interesting, right?

It, and it is a relevant scale. So we're talking here about one gigawatt or up to one gigawatt of capacity. They anticipate the spend to be about 30 to 50 billion. Euros. which tracks. Just for, for context though, like, one gigawatt is, you know, like the big Amazon data center that was announced a few months ago is like 960 megawatts. So, you know, basically already at that scale You got meta dipping into one in two gigawatt scale sites. you have likewise, you know, open AI with.

Plans for multiple gigawatt level sites. So, and that's all for like the 2027 era. So, is going to be something, but you have individual hyperscalers in just the United States tapping into multiples that scale, like low multiples, but you know, one to two gigawatts say so, it's important. It's, the bare minimum, I would say of what, you know, France would need to remain competitive or relevant here. but not clear what that really buys you in the long run.

Unless you can continue to attract, you know, 10 X, that scale of of investment down the line for the next beat of scale know, at least five or two X or something, but that's it. And one thing to flag too, is this is actually coming in from, because it's the UAE you better believe Sheikh Mohammed Bil Zayed Al Nahyan is. Going to be part of this. So he is the head of G42. We've talked about them an awful lot. They're also kind of basically MGX is G42 in a trench coat.

MGX is the fund that invested in Stargate. So so this is really this guy sort of like a big national security figure in the UAE. I think the national security advisor to their leader. that's certainly his role. And these investments are happening all over the West, not just the United States now. And these are, these are big dollar figures. So kind of interesting again, remains to be seen how relevant this will actually be.

But we are being told that it is the largest, cluster in Europe dedicated to AI. So, That's something but it is, again, it's Europe. Europe struggles to, with kind of like R& D and CapEx spend. And so it's, it's interesting to see them keep up at least at the one gigawatt scale. And speaking of big money, the next story is about another fundraise.

From let's say open AI formally affiliated person, the chief scientist, former chief scientist, Ilya Suscover, we saw him leave and start safe superintelligence last year. They had raised 1 billion at the time with kind of nothing out there as far as products goes. And now they are in talks to raise more money at four times the valuation. That's all we know. So no stories as to, I guess, any developments, any work going on at the SSI, but somehow it seems they are set to get more funding.

Yeah, there is more of a reminder here, but they have some wicked good investors, right? Sequoia and recent, pretty much top of the line. And then Daniel Gross, anyway, famous for doing stuff at Apple and then being a partner at Y Combinator. like a lot of secrecy around this one. I got to say, it sort of reminds me of Miramarates startup. And like, you know, there are a couple of these where we have Yeah. No clue really what the direction is.

So I'm curious to see straight shot to super intelligence with no products is a, an interesting challenge to sell, but it's at least plausible given this state of the scaling curves next up covering some hardware. We have a sml. Set to ship first, second gen high NAUV machines in the coming months. And Jeremy, I'll just let you take over on that one. Jeremy Oh yeah.

I mean, so we talked about this in the hardware episode, but you know, you've got, so for at first there was the DUV deep ultraviolet. And then. Lithography machines. These are the machines that generate and, and kind of shine and collimate the laser beams that ultimately get shot onto wafers during the semiconductor fabrication process.

These beams etch or don't etch themselves, but these beams essentially shine and lock in the pattern of the chip onto that substrate and they're super, super expensive. Deep UV lithography machines are the machines that China currently can access. It allows you to get down to around seven nanometers, maybe five nanometers of effective resolution. Let's say EUV machines. The next generation after that, the next generation after EUV though is high numerical aperture UV.

So basically these are EUV machines with bigger lenses. And you might think that that doesn't do much like, Oh, big deal. There's a bigger lens, but actually when you increase the size of a lens in one of these machines, they are so crazy optimized in terms of space allocation and orientation, geometry that you fuck a ton of things up. And so anyway, these are super expensive.

Intel famously was the first company to buy essentially the entire A stock that ASML print plan to produce of high NAEUV machines. And and, you know, remains to be seen what they'll do with that now that they're sort of fumbling the ball a little bit on their on their fab work. But yeah, so apparently Intel. You know, we'll actually be receiving this this first high NA EUV machine in the coming months. And the actually to be, to be used in mass production only by TSMC in, in 2028.

So there's this big lag between Intel and TSMC. This is, you know, Quite defensible. Historically we've seen companies like Samsung fall behind by generations in this tech because they move too fast to adopt the next generation of photolithography. And so this is going to go right into Intel's 14 a node, they're 1. 4 angstrom node, or if you're thinking in terms of TSMC terms, I guess that would be 1. 4 nanometers effectively. That's, the next beat for them.

But anyway, so there's a lot of, a lot of interesting technical detail, but bottom line is these things are now shipping. We'll start to see. Some early indications of whether they're working for Intel in the in the coming year or two. And final story. And it is about a projection from Morgan Stanley. It was, revising its projection of Nvidia, GB200 and VL72 shipments going downward due to Microsoft seemingly planning to focus on efficiency and lessening of CapEx.

But it appears that Microsoft, Google, Meta, Tesla, et cetera are still investing a lot in, hardware. And so you, despite Kind of a projections NVIDIA is still going strong. Yeah. I mean, pretty much just refer to a couple episodes back when when deep seek I guess, what was it? Not deep sea. Yeah. I guess it was our one when people really started talking about this.

I think we touched on it with V3, but when our one came out and everyone was like, Oh shit, like NVIDIA stocks going to do really badly because now, we found ways to Reach the same level of intelligence using a 30th of the compute and at the time and repeatedly We have said over and over this is not right. This is exactly backwards, right?

If what really this shows is okay Well suddenly NVIDIA GPUs can pump out 30 times more effective compute at least at inference time than we thought they could initially That sounds not like a bearish case for NVIDIA. That sounds like a fricking bullish case for NVIDIA. And I don't mean to throw a shade on Morgan Stanley full, you know, full respect for the Morgan Stanley folks. But this was, you know, I think pretty obvious call. And we're seeing it play out now. So never bet.

Against Jevin's paradox in this space a good model in your head to have is that there is a close to infinite demand market demand for intelligence. And so, you know, if you make a system that is more efficient at delivering intelligence, demand for that system will tend to go up. That's at least the way the frontier labs that are racing to super intelligence are thinking, and right now they're the ones buying up all the the GPU. So that's the way I've been thinking about it.

And, feel free to, to throw shade and cast doubts if you disagree. And onto projects and open source, we begin with AI2 releasing Tulu, I don't know how to pronounce this Tulu 3. 4. 0. 5b. So, this is a post trained version of LLAMA 3. 1, with a lot of enhancements for scalability and performance, to saying that this is on par with DeepSeek v3 and Jupyter 4. 0. Oh and as usual with AI2 releasing, it's very openly and releasing a lot of details about it.

So again, another demonstration that it seems more and more of a case of that open source is going to be essentially on par or close to on par with frontier models. Which hadn't been the case just until a few months ago, really, until these four or five B gigantic models started to come out. one of the, the interesting kind of breakthroughs here is their reinforcement learning with verifiable, verifiable reward structure, the RL VR structure, which is a new technique.

It focuses on, where you have verifiable outputs that, where you can assess objectively whether they're, they're correct or not they kind of have a little feedback loop within their training loop that, factors in that kind of base reality. And the other thing is to scale that, that technique specifically. They say that they deployed the model using 16 way tensor parallelism. So like essentially chopping up the model.

Not just chopping it up at the level of like, say, transformer blocks or layers, but even chopping up individual layers and shipping those off to different GPUs. So there's a lot of scaling optimization going on here. yeah, it's obviously a very, very highly engineered model. And so, yeah pretty cool and, a good one to, to dig into as well on the, the hardware side and kind of engineering optimization side.

Exactly. The report or paper I put out is titled Tulu Free Pushing Frontiers in Open Language Model Post training. The initial version actually came out a couple of months ago and was just revised, and it goes into a lot of details as to post training. 50 pages worth of documentation. So in addition to having quite performant LLM models, we have more and more of these very, very detailed reports on how the training happens.

We're like black magic, you know, how the sausage gets made is now very clear. So It's yeah, more and more cases that there aren't really any sequence. There's no secret sauce. Used to be maybe the one that was a bit of secret sauce on how to get reasoning to work, but that is increasingly not the case also as we'll get into. So open source, yeah. Don't bet against it as far as developing good models. Yeah. I do think the interesting, like there remains secret sauce.

Well, there's all the secret sauce at algorithmic efficiency level and other things. But I think this just like raises the floor. Yeah. On, on where that secret sauce. Or yeah, a lot of that secret sauce is no longer seen. I don't know. A lot of secret sauce is less. And next up we have a small LM two when small goes big, paper. So yeah, it's about an acceleration of small lamb, a small language model, meaning. Free building parameters, small, large language models.

And they are focusing on, training with highly curated data. So, basically with the highly, Curated data, they're working with FineMath, StackEDU, Smalltalk, these high quality kind of data sets that leads to the ability to get even better performance at the scale, yet another kind of notch in the ability to get small models to be pretty performant if you invest a lot in kind of optimization of that size.

Yeah, kind of ambiguously argued in the article itself that might be implications for, for sort of like scaling laws here. They don't directly compare the scaling they see here to the kind of Hoffman or chinchilla scaling law paper. But you can expect like if, yeah, if your data quality is better, you ought to see You know, faster scaling. And if you're more thoughtful about as well, the order in which you layer in your training data.

So one of the things they do here, and you're seeing this basically become the norm already, but just to call it out explicitly. So rather than using a fixed data set mix, they have a training mix. Program that kind of dynamically adjusts the composition of their training set over time. And so in the early stage, you see this sort of lower quality, less refined data, general knowledge, web text stuff that they train the model on.

And you can think of that as being the stuff that you're using to just. Build general knowledge to get the model to learn even earlier, like the basic rules of grammar and, you know you know, by grams and try grams and so on. So just the basic things. Why waste your high quality data on that stage of training, which pretty intuitive. So just get it to train on Wikipedia or basic web text. And then this middle stage where they introduce code and math data that's added to the mix.

And then late stage, they really focus in on like the kind of refined high quality math and code data. And so you can see the kind of the, the instrument getting sharper and sharper, the quality of the data going up and up and up as you progress through that training pipeline, they claim that it was 10 to the 23 flops of total compute. That's. We're about a quarter million dollars worth of compute. and that's, pretty cheap for the level of performance they're getting out of the model.

So pretty cool. by the way, they do say they train it on 11 trillion tokens. And so that's way, way higher than what Chinchilla would say is compute optimal for a model that is this small. So, you know, it's overtrained, nothing too shocking there, but again, they're, they're focusing on the, the data kind of quality layering side of things and other optimizations to, to get more bang for their buck here.

So maybe an argument that says that smaller language models are more data hungry than previously thought, and that overtraining can be disproportionately beneficial, but it's not clear if that's actually true when you account for the quality of the data that you're putting in. I mean, they're always asterisks with scaling laws. Data is not just data. So you can't just like, you know, plot data along a curve and say, okay, you know, that's, you know, more data is more scale.

You do have to care about the quality and, you know, but this is a great example of that and it's a powerful model you're getting out the other end. Exactly. And by the way, this is coming from hugging face, sort of GitHub for models, so open source Apache 2. 0, you can use it for whatever you want. And a pretty detailed paper. Also releasing these datasets that they are kind of primarily highlighting as the contribution. So, you know, open source, that's always exciting.

And a couple more stories next is not a model, but a new benchmark titled a PhD knowledge, not required a reasoning challenge for LLMs that's the paper. So they are. Arguing that you know, some of the benchmarks for reasoning require very specialized knowledge that some people may not have. And as a result, it's not just looking at knowledge. So this benchmark is organized actually from 600 puzzles from the NPR Sunday Puzzle Institute.

challenge, which is meant to be understandable with general knowledge, but still difficult to solve. And they are saying that 01, for instance, achieved a 59 percent success rate, better than R1, but still not R1. I guess as optimal as you might get with top humans. So one of the interesting findings here too, is that this relative performance of O3 mini and the O1 model and R1 too, they, they all kind of, perform about the same and or at least similarly. And this is sort of.

Taken as a potential argument that current LLM benchmarks might overestimate some models, general reasoning abilities just because you know, maybe oh, one has reasoning specific optimizations that oh, three, many lacks, but sort of. More likely than not. It's just sort of a general knowledge thing. So kind of interesting. It is interesting to parse out what is general knowledge and what is like pure reasoning. That gives you to the extent you can do it.

It allows you to hill climb maybe on a metric that's more detached from, textbook knowledge, which consumes an awful lot of flops during training. So if you could actually separate those out, Kind of an interesting way to maybe get a lot more efficient. Some of the findings by the way so, they probed at this question of like how much reasoning is actually enough, like how reasoning length or number of tokens generated while thinking, if you will, affects accuracy deep seeker one.

It performs better after about 3000 reasoning tokens, but when you go beyond 10, 000 tokens, you kind of plateau, right? So we're seeing this a lot with these models where if you will, the inference time scaling works well up to a point, and then you do get saturation. It's almost as if the base model can only effectively use so much context. And the failure mode with deep seeker one is especially interesting. It will explicitly output it. I give up in a large fraction of cases.

It's about like quarter to a third of cases. And anyway, this idea of kind of prematurely conceding the problem is something that previous benchmarks just hadn't exposed. So it's sort of interesting to see that quantified and demonstrated in that way. And then it's also the case that R1 and some of the other models too, but R1 especially will sometimes just generate an incorrect final answer that never appeared anywhere in its reasoning process.

it'll go, A then B, C then D, D then E, and then it'll suddenly go, F, you know, and it'll give you something that's like completely untethered. So kind of interesting an indication that there is a bit of a fundamental challenge going on, at least at the level of R one, where if you can suddenly get a response that is untethered to the reasoning stream, like that's a problem for the robustness of these, of these systems.

So exciting to see a paper that dives into, if you will, pure reasoning which is something that we, we definitely haven't seen before. Where every measurement of reasoning is always. To some degree entangled with a measurement of world knowledge, right? Even when you're asking a model to just like solve a puzzle or something, somehow that model has to understand the query, which requires it to know something about the real world and language and all that stuff.

so it is really tough to tease these things apart and kind of interesting to see this probed at directly. And just to give a taste of the things in the benchmark, here's one question.

Think of a common greeting in a country that is not the U. S. You can rearrange its letters to get the capital of a country that Neighbors the country where there's greeting is commonly spoken that what greeting is it the answer there is Ni Hao, Hanoi, Ni Hao being in China and Hanoi being in Vietnam, so a lot of those kinds of questions, you do need some knowledge about the world, but nothing too specialized.

And you can need to kind of think through a lot of possible answers and see what matches the criteria specified. Last story, open Euro LLM. Which is an initiative across 20 European research institutes, companies, and centers with the plan to develop an open source multilingual LLM. They have an initial budget of 56 million euro. Which I guess is, is not too bad to start training at least given, given DeepSeek v3, right? It may be doable. Although DeepSeek v3 is v3.

So they were working on it for a while. They hope also they are claiming to go with compliant EU AI Act, they are beginning their work starting February 1st. So we will have to see where it goes. Yeah, I think that, I think that 56 million budget is, is going to be a, a real challenge. but yeah, I mean, like, you know, that'll, that'll be good for what, for couple thousand GPUs maybe, but that's just the GPU.

Like that doesn't, that doesn't even like address the data center, but like It's a nothing budget. This is a nothing burger. It's going to have to pay as well for like the researchers and shit. Like, again, I'm sorry. I know, I know. It's not always nice to say, but Europe has a problem funding CapEx. Like this is just a, an issue. one thing that makes this.

Yeah. I guess if you're into European regulation in the way that they do it the one thing that might make this interesting is that it is explicitly trying to hit like all the, the kind of key European regulatory requirements. So at least you know that that box is ticked if you use these models, but I mean, expect them to kind of suck. that's all I'm going to say. They, they may attract more investment as they hit. More proof points. Hopefully that happens.

But think it's, not necessarily the best idea to be to be blunt about my perspective. I think it doesn't track the scaling laws. It doesn't particularly like, yeah, account for the real cost of building this tech. And on to research and advancements. We begin with a paper on reasoning titled Limo. Less is more for reasoning. So last year we saw a similar paper, Lima. Less is more for alignment.

With the highlight being that if you curate the examples very carefully, You can align a large language model with let's say hundreds of examples as opposed to a massive number. Here they're saying that you can curate 817 training samples and just to that are able to get very high performance comparable to DeepSeq R1. If you then tune a previous model here they're using Clang 2. 5 32 B instruct the get that you do need to very, very carefully curate the data set.

And that's kind of about where it goes into the highlights you need for something to be challenging to have a detailed, reasoning traces. So that's kind of the reasoning aspects of this. You need specifically the kinds of outputs you would get from R1 to challenging queries.

Thanks. So, one demonstration and I think increasingly kind of the realization is that LLMs may kind of already be mostly capable of reasoning and we just need something like RL or very carefully tuned supervised learning to get that reasoning more out of them. Yeah, it's an interesting question as to why this also seems to be happening all of a sudden, whether it's, you know, have models have base models always been this good. And now we're just realizing that.

Or is there something fundamentally that's changed in particular, the availability of the right kind of data, you know, these reasoning traces online that makes pre trained models just better at reasoning generally than they used to be. I think that's a pretty plausible hypothesis because you have to imagine that the very first thing opening, I would have tried back in the day after making their base models would have been straight reinforcement learning, right?

They, they probably would have tried that before trying fancier things like process reward models and stuff like that that kind of got us stuck in, in a bit of a, a very temporary rut, but a rut nonetheless for kind of a year or two. So you know, it wouldn't be surprising if this is just a phase transition as the corpuses of data available for this sort of thing has shifted. It is noteworthy. I mean 817 training samples giving you like almost 95 percent on the math benchmark.

and 57 percent on Amy. Like that's really, really impressive. I think one of the things that this does is it just, it shows you that this whole idea of just Pretraining plus reinforcement learning with very minimal supervised fine tuning in between is probably the direction things end up going. That's consistent, of course, with our one zero, right? That paper that we've talked about quite a bit.

But another key thing here is that you look at the at the consequence of doing doing fine tuning on a tiny data set like this, one consequence is you don't run the same risk of overfitting To the fine tuning data center, at least overfitting to the kinds of solutions that are offered in those those reasoning traces. And so what they actually find is that this model performs better out of distribution.

They find that compared to traditional models that are trained on like 100 times more data, this actually does better on new kinds of problems that require a bit of more novel general purpose reasoning. And so the idea, the intuition here may be that by. Doing supervised fine tuning on these relatively large data sets where we're trying to force the model to learn how to reason in a certain way, right? With a certain chain of thought, structure or whatever.

What we're kind of doing at a certain point is just causing it to memorize that approach rather than thinking more generally. And so by reducing the size of that data set, we're not causing the model to pour over and over and over in the same way. Okay. that reasoning structure, and we're allowing it to use more of its natural general understanding of the world to be more creative out of distribution. So that to me was an interesting little update.

Yeah, hard, hard to know how far all this goes, but it's just kind of an early sign that this may be interesting. I will say that many of their out of distribution tests are still within the broad domain of mathematical reasoning. So there's always this question when you're making claims about out of distribution results, like how out of distribution is it really you know, you train the model in geometry, you're applying it in calculus. Does that count?

Is it, you know, is it still in the broader math umbrella and all that stuff? But anyway, interesting paper and something we, we might find ourselves turning back to in the future if if it checks out. Yeah. And I think that's a good caveat in general with a lot of research on reasoning like R1 as another example, you know, there's a lot of training on math and on coding because that's the stuff where we have a ground truth labels and you can do reinforcement learning there too.

I'm not so sure about how much that. Translates to other forms of reasoning. So you do want to also see things like arc and this new data set we just talked about but regardless, as you said a lot of insights being unblocked. And in fact, the next paper, the base similar to the previous one titled S1 simple test time scaling and an entirely different approach, they are introducing this concept of having. An inference time budget.

So you're only allowed to use some amount of tokens and you basically get caught off or you get told to keep thinking if you have kind of more budget to spend and with that, they are able to curate 1000 samples. So slightly more than the previous paper, but around the same also fine tuning Quen's model and also achieving a pretty good performance with inference or test time scaling. So yeah, very much in line with limo. The previous one was less as more here. It's. Simple test time scaling.

Yeah, this is really one that I think rich Sutton would be, would be saying, see, I told you so about, I mean, it's an embarrassingly simple idea that just works, right? So when you're thinking about the bitter lesson of just like apply more compute, this is the, or maybe the dumbest way that I could think of. Not that I did, but you know, it's the, it's the dumbest way you could think of, of applying more compute to a problem. You know, you have the model, just try to solve the problem.

If it solves it like right out the gate in like, you know, 30 tokens, then you say, Hey like I'm going to add a token. To my to my my string here, I'm just going to write the word wait, right? And you, so you put in the word wait. So maybe you ask it a question like, how many airplanes are in the sky right now? it starts to work out through reasonable assumptions, like what the answer is, and then it gives you the answer.

If it answers too quickly, you just append to its answer the word wait, and then that triggers it to go wait, okay, like I'm going to try another strategy. And then it continues. And you just keep doing this until you hit whatever your, your kind of token budget is. The flip side is. Okay. Bye. If your model is going on for too long, then you just insert the tokens, final answer, colon to kind of force it to come out with its solution.

That's kind of in practice how they do this and dead simple works really well. And it is the first time that we are that I've seen that we're getting. An actual kind of inference time scaling plot that looks like what we had with opening eyes. Oh, one, right? Even deep seek when they put out their paper, what they showed was, we can match the performance of one. That's great and really impressive. But if you look at the paper carefully, you never actually saw this kind of scaling curve.

You never actually saw the actual sort of like flops or tokens during inference. And And versus the, the accuracy or performance on some task, what you saw was anyway, different curves for like reinforcement learning performance during training or whatever, but you didn't see this at test time. And that's what we're actually recovering here.

So this is the first time I've seen something that incredibly replicates a lot of the curves The opening I put out with their inference time scaling laws, which is really interesting. I mean, I'm not saying opening eyes literally doing this but it seems to be a legitimate option if you want to, you know, find a way to get your your system to kind of pump in more inference time, compute and pump out more performance. And let's keep going with the general trend of this theme of scaling.

Here, the next one is titled Zebra Logic on the Scaling Limits of LLMs for Logical Reasoning. So I think you previewed this a little bit previously in the episode. The idea of this paper is what if we can have a benchmark that sort of separates the reasoning from the knowledge. And the way we do that is The setting up these grids and essentially setting a number of constraints and requiring the LLM to be able to infer the locations of different things in a grid. So here's an example.

There are three houses numbered one, two, three, from left to right. Each house is occupied by a different. person. Each house has a unique value for each of the following attributes. Each person has nickname, Eric Peter Arnold. And then there's a bunch of clues. Arnold is not in the first house. The person who likes milk is Eric, blah, blah, blah, blah, blah, blah.

So we set up a set of constraints, and then you have to figure out what goes there that allows them to build different sizes of puzzles. So different amounts of variables, different amount of clues. And so per the title of the idea then is you can get bigger and bigger puzzles and eventually kind of hit a wall where you can no longer, be able to answer these kinds of questions. Due to things like recursive complexity, you know, as eventually the dimensionality of a problem is such that.

Throwing more compute at it isn't going to solve it. So again, another benchmark and another way to evaluate reasoning. And on this one, they do find something, you know, along the lines of other benchmarks with O1 performing quite well. DeepSeq R1 also performing really well. But in general, all these reasoning models performing much better than typical LLMs. Yeah, I think one thing that has broken a lot of people's cognition when it comes to scaling is that, so you actually should expect.

Inference time scaling to saturate to kind of level off just like this for any fixed base model, right? And the reason just is that the, the context window can only be so big or it can only effectively use so much context. And so what you want to be doing is scaling up the base model at the same time as you're scaling up the inference time compute budget.

It's kind of like saying if you have 30 hours total to either study for or write an exam, it's up to you how you want to, you know, trade off that time. You could spend, you know, 25 hours studying for the exam and five hours writing it. You could spend 29 hours and 30 minutes studying for the exam and then just 30 minutes writing the exam.

But there's kind of an optimal balance between those two, and you'll find, like, often you're more bottlenecked by the time writing the exam than by the time studying or vice versa. and that's exactly what we're seeing here. And so Epic AI has a great breakdown of this. I think it's a sort of under recognized fact of the matter about scaling that. You really do want to increase these two things at the same time.

And no discussion of scaling is complete when you just fix a base model and look at the inference time scaling laws and then whine at them about how they're flattening out. We actually had, this was the very trap media fell into. And a lot of technical analysts fell into when they were talking about, Oh, we're pre training, we're saturating ROI isn't there anymore. It's like, no, you actually have these two different modalities of thinking. It's like spending. Yeah. If you spent.

You know, 30 hours studying for an exam in five minutes writing it. Yeah, your performance at a certain point will plateau. Another 10 hours studying isn't going to help. So anyway, that's really what you're seeing reflected here. It's, it's, you know, reflected as well, whether it's best event sampling or, or other approaches that you use, you will see that saturation. If you use a fixed base model just something to keep in mind as, as you, as you look at scaling laws. Exactly.

At the end of the day, this kind of makes sense that you will saturate. And essentially also as you grow dimensionality of a model, eventually for any size LLM any amount of inference tokens, you're still going to be incapable of doing well. The paper does go into some analysis of the best way to do it and so on. So some Cool insights here also actually from the Allen Institute for AI we covered earlier.

So yeah, lots of research, lots of insights on reasoning, which is of course pretty exciting. And now to the last paper, this one, not on reasoning, but instead on distributed training, which I think Jeremy, you'll be able to comment on much more. The title of it is Streaming Dialogo with Overlapping Communication towards a distributed free lunch.

And so that's a basic story is they are trying to get to a point where you can train in a more distributed fashion without incurring kind of the cost or, worse performance as opposed to more localized training. And I'll stop there and let you take over. Yeah. I almost wish that we'd cover this in the hardware episode, but it's so hard to figure out where to draw the line between hardware and software.

There's maybe something we could do with, like, optimized training or, you know, anyway, it doesn't matter. So yeah, DeLoco is this increasingly popular way of training in a decentralized way.

The big problem right now is If you do federated learning, you'll have, imagine like one, let's call it like a, I don't know, to grossly like oversimplify, let's have like a data center that's chewing on one part of your data set and another data center that's chewing on another part of a data set and a third and so on. every so many steps those data centers They're going to pull together their gradient updates and then kind of update one global version of the model that they're training, right?

So you have, you know, data center one has a bunch of gradients that it's learned based on training that essentially these are the changes to model parameters that are required in order to. Cause the model to learn the lessons it should have learned from the data that that data center was training it on. And so those gradient updates, which are known as pseudo gradients, pseudo because each data center is only working on a subset of the data.

You're going to pool together, average together, or otherwise pull together those pseudo gradients to update in one step the global model, which then gets redistributed back to all those, say, data centers. Now, the problem with this is it requires. A giant burst of communication. like every data center needs to fire away at once this big wave of gradient updates and associated kind of meta parameters. and that just clogs up your bandwidth.

And so the question is, can we find a way to manage this update in a way that? You're maybe only sharing a small fraction of the gradient updates at a time. And that's what they're going to do. They're going to say, okay, let's take our model. Let's essentially break it up into chunks of parameters. And let's only update together.

One sort of chunk, one fraction of the model's parameters, one fraction of the gradient updates that we want of the pseudo gradients and, and kind of pull them together and then redistribute so that those bursts of information sharing don't involve information pertaining to the entire model at once, but only subcomponents. And anyway, there's more detail. In, in terms of like how specifically the local works, we covered that in a previous episode so moving on to policy and safety.

The first story is regarding AI safety within the U S seemingly kind of not going so good. So with Trump having taken over office, we are getting the news that several government agencies tasked with enforcing AI regulation have been instructed to halt their work. And the director of the U S AI safety Institute, AI SI has resigned. This is of course, following. The repeal or whatever you want to call it of the executive order on AI from of the Biden administration we had previously commented on.

it's actually not too surprising. And I think that The accurate frame on this is, is not that they don't care about safety in the sense that we would kind of traditionally recognize it national security, kind of public safety. Part of the challenge has been that in the original Biden executive order, there was basically so much stuff kind of shoved into that consumer protection stuff privacy stuff. There was social justice stuff, AI ethics stuff, bias stuff. Like it was, it was.

It was literally the longest, it may still be the longest executive order in the history of the United States. And it really was that they were trying to give something to everyone that their, their coalition was so broad and so incoherent that they had to stuff all this stuff in together. That's something that, you know, at the time I'd mentioned was probably going to be a problem here.

We are, you know, we have a new administration and no surprise, like That's not the way that they're going with this. I think on national security grounds, yeah, you're going to see, you know, some thoughtful work on would otherwise be called safety. The problem is the word safety is taken on a dual meaning, right? That's almost politicized.

So in terms of the actual kind of loss of control of these systems, in terms of the actual weaponization of these systems, that's something that from a concrete level, I think this administration is going to be concerned about. But there's all this kind of fog of wars. Frankly, they're still trying to figure out what is the problem with this tech and, how are we going to try to approach it?

Especially given that we have a competition with China the departure of Elizabeth Kelly, this is the easy director, not super, super shocking as well. The kind of role you would expect to shift over, you know, the head of a fairly prominent and to some degree, sort of like. Let's say a department that is, is linked to this executive order, maybe less surprising that she'd be, she'd be departing.

It's unclear to what extent the Trump administration is actually going to use the AZ or if you find some other mechanism. but there certainly is interest in tracking down a lot of these big problems associated with the tech within the administration. They're just trying to figure out what they take seriously and what they don't, which is pretty interesting. More or less what you'd expect in the first few weeks of an administration.

And next, going back to research and of a paper on infant time computation, but this time for alignment. So the title of the paper is Almost Surely Safe Alignment of LLMs at Infant Time. Almost surely safe is a bit of a technical term, actually. It's a, you know, theoretical guarantee you can get that you will approach a probability of one with some metric this gets pretty theoretical.

So I'm not going to even try to explain it broadly speaking, they train a critic and solve constrained, Mark of decision process. So again, there's a lot of theory around these kinds of things, but the end story is they find a way to do inference time decoding or, or kind of even without training the model at all at inference time, you can get safe guarantees if you train this critic to guide it.

Yeah, the cool implication here of it being inference time is that you can basically retrofit or slap this on to an already trained model, and it'll work right out of the box in that sense. If you're getting excited because the title really sounds like a very promising advance, it's not that it's not promising, but there's a big giant caveat right in the middle of this thing, which is, It's almost surely safe with respect to some particular model or metric that defines safety.

So traditionally, the challenge with AI has been that no one knows how to define a metric that would be safe to optimize for when it's not. By a sufficiently intelligent system, there are always ways of optimizing for any metric you can think of. This is called Goodhart's law, where you end up with very, very undesirable consequences when you push that metric to its limit, right? So when you tell teachers, for example, they'll be rewarded for the scores of their students on standardized tests.

Don't be surprised when the teacher teaches to the test, right? They find cheats, they game the system that is in no way addressed remotely. By by this, the scheme, it leaves us with all our work still ahead of us in terms of defining the actual metrics that will quantify safety. So, so that's a bit of a distinction here.

What they're saying is once you have that metric, then we can make certain guarantees about the safety of the system as measured by that, but it doesn't guarantee that that metric will not be hacked. That's kind of the caveat here. And we have another paper on that general topic of alignment, this one coming from Unfropic titled Constitutional Classifiers Defending Against Universal Jailbreaks.

So the story is they have their constitutional alignment approach, where you set you write down a set of rules, a constitution, and then you can generate many examples of Things that align with that constitution. So this in a way is similar where you have, again, your constitution and you need to train to not be possible to jailbreak. So you're not gonna give in to a demand to go against, your. Constitution. And the highlight here is that they had over 3000 hours of red teaming.

Red teaming, meaning that people tried to break this approach, find a universal jailbreak that could reliably get ve LM to disclose information. It's not supposed to. This approach did basically successfully make it so you were not able to do that. one more thing to note here, Antropic also offered 20, 000 to be able to jailbreak this new system. So they are still trying to get even more red teaming to see if they can do it. So the challenge is closing today.

So I guess we'll see if anyone is able to beat it. Yeah, I will say 20 grand is a, it's a pretty small amount of money for a skill set that is extremely valuable, right? So the best jailbreakers in the world are guys who are making or people who are making a huge, huge amount of money every year. 20 grand, probably not too especially given the amount of value that it would offer to anthropic in terms of, of plugging these holes. That's something that anthropic was criticized for online.

I know plenty of the prompter plenty, plenty of the liberator, I guess, as he's now known kind of got into this semi heated exchange with, I'm trying to remember if it was like Yon like I think it was Yon on, on X saying I'm not going to even bother trying to do this unless he has a whole issue with, the way anthropic is approaching safety, fine tuning and the extent to which they may be hobbling their models ability to, I guess, talk about themselves.

I'm now I'm trying to remember if plenty of the liberator is one of these, like like, you know, kind of consciousness concerned people, or if, if it's more of an open source thing, I think it is both. A whole bunch of people anyway, in that ecosystem who were complaining about about anthropic potentially using this to prevent the models from expressing the full range of their consciousness and all that jazz. So it's kind of this is interesting.

I mentioned it because plenty is actually possibly the world's most talented jailbreaker, right? So like this is the guy who every time there's a new flashy model that's announced, yeah. Supposedly has the best safety and security characteristics. He goes out and announces like, Oh yeah. And like five minutes later, it's like, I figured out how to, how to jailbreak it. Here's the jailbreak, you know, check it for yourself.

So anyway interesting that, he has not been engaged in these red teaming exercises by any of the labs. And that's an ideological thing, but. That makes this ideology worth tracking. It actually is affecting these Frontier Labs ability to recruit jailbreakers at the top tier. And with that, we are done with this episode. Thank you for listening. Once again, you can go subscribe to newsletter where you'll get emails with all the links for the podcast.

You can go to last week in dot AI, as always, we appreciate you subscribing. We appreciate you sharing, reviewing, chatting on the discord, all those sorts of things, and we try to, you know, actually take your free bit to count. Maybe there won't be an AI generated song on this episode. We'll see. But regardless, please do keep tuning in and we'll keep trying to put these out every week.

Transcript source: Provided by creator in RSS feed: download file