Hello and welcome to the last week in AI podcast. We can hear us chat about what's going on with ai As usual. In this episode, we'll be summarizing and discussing some of last week's most interesting AI news. You can go to the episode description for all of the links to the news we're discussing and the timestamps so you can jump to the discussion if you want to. I am one of your regular hosts, Andre ov. I studied AI in grad school and I now work with at a generative AI startup.
And I'm, you hear me typing here just 'cause I'm making final notes on what is an insane We can, if we do our jobs right today, this will be a banger of an episode if we do our jobs wrong. It'll seem like any other week. This has been insane. I'm Jeremy, by the way. You guys all know that if you listen Gladstone, ai, ai, national Security, all that jazz this is pretty nuts. Like we were talking about this I think last week where.
You know, we're catching up on two weeks worth of news and we were talking about how every time we miss a week and it's two weeks, inevitably it, it's like the worst two weeks to miss. And the AI universe was merciful that time. It was not merciful. This time, this was an insane, again, banger of a week. Really excited to get into it, but. Man, is there a lot to cover? Exactly. Yeah. There hasn't been a week like this probably in a few months.
You know, there was a, a similar week I think around February where a whole bunch of releases and announcements were bunched up from multiple companies and that's what we're seeing in this one. So just to give you a preview, the main bit that's exciting and very full is announcements concerning tools and sort of consumer products. So Google had their IO 2025 presentation, and that's where most of the news has come out of.
They really just went on the attack, you could say, with a ton of stuff either coming out of beta and experimentation being announced, being demonstrated. Et cetera, and we'll be getting into all of that. And then afterward, philanthropic went and announced cloud four and some additional things in addition to cloud four, which was also a big deal. So those two together made for a really, really eventful week. So that'll be a lot of what we were discussing.
And then in applications as business, we'll have some stories related to OpenAI. We'll have some interesting research and some policy and safety updates about safety related to these new models and other recent releases. But yeah, the exciting stuff is definitely gonna be first up and we're just gonna get into it. First into and apps is Claude four. Maybe 'cause of my own bias about what's being exciting. So this is Claude Opus four and Claude Sonnet four.
This is the large and medium scale variant of Claude from Tropic. Previously we had CLA 3.7 I think for a few months. Cloud 3.7 has been around but not super lengthy. And this is pretty much an equivalent update. There'll be costing the same as the 3.7 variance. And the pitch here is that they're better at coding in particular and better at long workflows. So they are able to maintain focused effort across many steps in a workflow. This is also coming paired with updates to cloud code.
So it's now more tightly integrated with development environments coming with an SDK now. So you don't have to use it as a command line tune. You can use it programmatically. And related to that as well, both of these models, opus and sonnet, are hybrid models, same as 3.7. So you can adjust the reasoning budget for, for models.
So I guess qualitatively not anything new compared to what philanthropic has been doing, but really doubling down on the agentic direction, the kind of demonstration that people seem to be optimizing these models for the task of like give a model some work and let it go off and do it and come back after a little while to see what it built with things like cloud code. Yeah, the two models that are released, by the way, are, are Claude Op Opus Four and Claude Sonnet four.
Note the Slight Change yet again in naming convention. So it's no longer Claude four Sonnet or Claude four Opus. It's now Claude Sonnet four, Claude Opus four, which I personally like better, but hey, you do you a lot of really interesting results. Let's start with SWE Bench, right? So Thiswe Bench verified the sort of like software engineering, pseudo real world tasks benchmark that really opening AI polished up. But that was anyway, developed a little while ago in the first place.
So opening Eyes Codex One, which I'm old enough to remember a few days ago when that was a big deal was just for context hitting about 72%, 72 point 0.1% on this benchmark. that was like really quite high. In fact, for all of 20 seconds it was soda. Like this was a big deal, like on whatever it was. Tuesday when it dropped. And, and now we're on Friday, no longer a big deal because sonnet four hits 80.2%. Now going from 72 to 80%, that is a big, big jump, right?
You think about how much, there's not that much more left to go. You've only got 30 more percentage points on the table, and they're taking eight of them right there with that one advance. Interestingly Opus four scores 79.4%, so sort of like comparable. I. Performance to sonnet on that one. And we don't have much information on the kind of Opus four to sonnet four relationship and how exactly that distillation happened if there was sort of extra training.
Anyway, so that's kind of a another thing that we'll probably be learning a little bit more about in the future. But and this is the numbers v like upper range with a lot of compute that, you know, similar to O three for instance, from OpenAI, when you let these models go off and, and go work for a while and not kind of a more limited variant. E Exactly. That's a really good flag. Right.
So there's a, a range with inference time compute with test time compute models where yeah, you have the, kind of like the lower inference time, compute budget score, which in this case is around 72, 70 3%. And then the high inference time, compute budget, which is around 80% for both these models. Again, contrasting with codex one, which is sitting at 72.1% in this, in this figure.
And they don't actually indicate whether that's low compute mode or high compute mode which is itself a bit ambiguous. But in any case, this is a, it's a big, big leap. And this bears out in the qualitative valuations that a lot of the folks who had early access have been sharing on X. So, you know, you know, make, make of that what you will, all kinds of really interesting things.
So they've figured out how to apparently significantly reduce behavior associated with getting the models to use shortcuts or loopholes. Big, big challenge with Codex One. A lot of people have been complaining about this. It's like, it's too clever by half, right? The O three models have this problem too. They'll sometimes like find janky solutions that are a bit dangerously creative, where you're like, no, I didn't mean to have you solve the problem like that.
Like that's a, that's kind of, you're sort of cheating here. And and, and other things where they'll tell you they completed a task but they actually haven't. That's kind of a, a thing I found a little frustrating with especially with O three. But so this model has significantly lower instances of that.
Both models, meaning Opus four and Sonnet four, are, they say 65% less likely to engage in this behavior than sonnet 3.7 on age agentic tasks that are particularly susceptible to shortcuts, end to loopholes. So this is, pretty cool. Another big dimension is the memory performance. So when developers build applications that give Claude local file access, Opus four is really good at creating and maintaining memory files to store information.
So this is sort of partial solution to the problem of persistent LLM memory, right? Like you can only put so much in the context window. These, these models are really good at, at building, like creating memory files, explicit memory files, so not just story in context, but then retrieving them. So they're just really good at a kind of implicit rag, I guess you could, you could call it. It's not actual rag, it's just they're, they're that good at, at recall.
there are a bunch of features that come with this. As with any big release, it's like this whole smorgasborg of different things and you gotta pick and choose what you highlight. We will get into this, but some of the most interesting stuff here is in the Claude for System card. And I think, Andre, correct me if I'm wrong, do we have a section to talk about the system card specifically later, or is this it?
I think we can, yeah, get back to it probably in the advancement section just because there is so much to talk about with Google. So we'll do a bit more of a deep dive later on to get it to technicals. But at a high level, I think, you know, as a user of chat gt, lms, et cetera, this is a pretty major step forward. And in particular on things like clot code on kind of the ability to let these LMS just go off and complete stuff for you. And so moving on for now from philanthropic.
Next up, we are gonna be talking about all that news from Google that came from IO 2025. Bunch of stuff to get through. So we're gonna try and get through it pretty quick. First up is the AI mode in Google search. So starting soon, I guess you have a tab in Google search where you have AI mode, which is essentially like chat GP of search.
Google has had AI overviews for a while now, where if you do, I think at least for some searches, you're gonna get this LLM summary of various sources with an answer to your query AI mode. Is that, but more in depth. It goes deeper on various sources and you can do follow up questions to it. So, very much now along the lines of what perplexity has been offering, what Chad GT search has been offering, et cetera. And that is. I guess really on par.
And, and Google has demonstrated various kind of bits and pieces there where you can do shopping with it. It has charts and graphs, it can do deep search that is able to do, looking over hundreds of sources, et cetera. Yeah. That kind of tight integration is, I mean, Google kind of has to do it. One of the, the issues obviously with Google is when you're making hundreds of billions of dollars a year from the search market and you have like 90% of it, it's, it's all downside, right?
Like the, the thing you worry about is what if one day OpenAI like chat, GPT just tips over some threshold and it becomes the default choice over search. Not even a huge default choice, just the default choice for 5% more of users. The moment that happens, like Google's market cap, I. Actually we would drop by more than 5% because that suggests an erosion in the fundamentals of their business, right?
So this is, this is a really big five alarm fire for Google, and it's the reason why they're trying to get more aggressive with the inclusion of generative AI in their search function, which is overdue. I think there are a lot of people who are thinking, you know, why did this take so long? I think one thing to keep in mind too is with that kind of massive market share, in such a big market comes enormous risk.
So yes, it's all fine and dandy for opening AI to launch chat GBT and to have it tell people to commit suicide or help people bury dead bodies. Every once in a while, people kind of forgive it because it's this upstart, right? At least they did back in 2022. Whereas with Google, if Google is the one doing that, now you have Congressional and Senate subpoenas. Like people, people want you to come and testify. They're gonna grill you. You know, Josh Hawley's gonna lay into you hard as he ought to.
But that's the, that's kind of the problem, right? You're reaching a fundamentally bigger audience. That's since Equilibrated. So open AI is, is kind of benefiting still from their brand of being kind of swing for the fences. So in some ways the expectations are a bit lower, which is unfair at this point. but Google definitely has inherited that legacy of a big company with a lot of users. So yes, the rollouts will be slower for completely legitimate sort of market market reasons.
So anyway I think this is just like really interesting. We'll see, we'll see if this actually takes off. We'll see what impact that has too. On, on chat. GPT, I will say the Google product suite. Is this sort of unheralded, relatively speaking unheralded suite of very good generative AI products. I use Gemini all the time. People don't tend to talk about it much. I find that really interesting.
I think it's a bit of a failure of marketing on Google's end which is weird 'cause their platform is so huge. So maybe this is a way for them to kind of solve that problem a little bit. Well, we'll touch on possibly usage being higher than some people. I think there might be a Silicon Valley bubble situation going on here where, yeah. Yeah. Fair. I get you're not Silicon Valley, but you're like, you know spiritually in Silicon Valley in terms of a bubble. moving right along.
Next announcement was talking about Project Mariner. So this was an experimental project from DeepMind. This is the. Equivalent to OP opening as operator, Amazon's Nova on Propex computer use. It's an agent that can go off and use the internet and do stuff for you. It can, you know, go to a website, look for tickets to an event, order with tickets, et cetera, et cetera. So Google has improved this with the testing and early feedback, and is now gonna start opening up to more people.
And the access will be gated by this new AI Ultra Plan, which is $250 per month, which was introduced also in the slate of announcements. So this $250 per month plan is the one that will give you like all the advanced stuff, all the models, the most compute, et cetera, et cetera. And you'll have Project Mariner as well. And with this update, you are gonna be able to give Project Mariner up to 10 tasks and it'll just go off and do it for you in the background. Somewhat confusingly.
Also, Google had a demo of Agent Mode, which will be in the Gemini app. And it seems like agent mode might just be an interface. To Mariner in the Gemini app maybe I'm not totally sure, but apparently ultra subscribers will have access to agent mode soon as well. Yeah. And it's so challenging to, to kind of, I find to highlight the things that are fundamentally different about a new release like this.
Just because so often we find ourselves saying like, oh, it's the same as before, except smarter. And that's kind of just true and that is transformative in and of itself. In this instance, there is one sort of thing, you alluded to it here, but just explicitly say it, the previous versions of Project Mariner were constrained to doing like one task at a time. 'cause they would actually run on your browser.
And in this case, the, the big difference is that because they're running this in parallel on the cloud, yeah. You can reach that kind of 10 or a dozen tasks being run simultaneously. So this is very much a, a difference in kind, right? This is like many workers in parallel chewing on your stuff. That's a change to the way people, people or you're more of an orchestrator, right, in that universe than a sort of a leader of one particular ai. it's quite interesting. And moving right along.
The next thing we'll cover is VO free, which I think from just kind of the wow factor mm-hmm. Of the announcements is the highest one. I think in terms of impact, probably not the highest one, but in terms of just wow, AI is still somehow blowing our minds.
VO three was the highlight of Google io and that is because not only is it now producing, you know, just mind blowingly coherent and realistic videos compared to even a year ago, but it is producing both video and audio together and it is doing a pretty good job. So there's been many demonstrations of the sorts of things video can do.
The ones that kind of impressed me, and I think a lot of people, is you can make videos that sort of mimic interviews or, you know, typical YouTube style content where you go to a conference for instance and talk to people and you have people talking to the camera with audio and it just seems pretty real. And it's yes, you know, different in kind from video generation we've seen before. And coming also with a new tool from Google called Flow to kind of.
Be able to edit together multiple videos as well. So again, yeah, very impressive from Google. And this is also underwear, AI Ultra Plan. Yeah. It's funny 'cause they also include a set of benchmarks in their, launch website, which by the way are sort of, they're sort of hidden, right? You actually have to click through a thing to see any anyway. And, and I always find it interesting to look at these when you've got that wow moment. I, I don't mean to call it quite a chat GPT moment.
for text to video because we don't yet know what the adoption's gonna look like. But certainly from an impact standpoint, it is a wow moment. When you look at how that translates though, relative to VO two, which again, relatively unheralded, like not a lot of people talked, they did at the time, but it hasn't really stuck. So 66% win rate, so thirds of the time it will beat VO two on a sort of movie gen bench, which is a benchmark that meta released.
It. It's just a, basically about preferences regarding videos. So it, it wins about two thirds of the time. It loses a quarter of the time and then ties 10% of the time. So it's a pretty, like, it looks like a, a fairly dominant performance, but not, not like a, a knockout of the sort that you might. It's difficult to go from these numbers to like, oh, wow, like this is the impact of it. But it certainly is there. Like when you look at these, it's, it's pretty remarkably good.
And this speaks to the consistency as well of those generations. It's not that they can cherry pick necessarily just a few good videos. It does pretty consistently beat out previous versions. Right. And they also actually updated VO two. So just demonstration of how craz this was in terms of announcements. VO two now can take kind of reference photos. We've seen this with some other updates. So you can give it an image of, you know, a t-shirt or a car and it'll incorporate that into a video.
And all this is folded into this flow video creation tool. So that has camera controls, it has sin Builder where you can. Edit and extend existing shots. It has this asset management things where you can organize ingredients and prompts. And they also released this thing called Flow tv, which is a way to browse people's creations with vo free. So tons of stuff.
Now Google is competing more of a runway and kind of a, I guess what OpenNet started doing with SOA when they did release SOA fully that had some built-in editing capabilities. Now VO isn't just text to video. They have more of a full featured tool to make text to video useful. Yeah. And the inclusion of audio too, I, I think is actually pretty important. You know, it's, it's this other modality. It helps to ground the model more.
And I suspect that because of the causal relationship between video and audio, that's actually quite a meaningful thing. Like if they, you know, this is interesting from that whole positive transfer standpoint. Do you get to a point where the models are big enough? They're consuming enough data that what they learn from one modality leads them to perform better when another modality is added, even though the complexity of the problem space increases.
And I suspect that will and probably is already happening, which means we're heading to a world by default with more multimodal video generation that wouldn't be too surprising, at least. And next up, you know, Google, I guess didn't just wanna do text to video. So they also did text to image with Imagine Four. This is the latest duration of their, you know, flagship text to image model. As we've seen with Text to Image, it is even more realistic and good at following prompts and good at text.
They're highlighting really tiny things like the ability to do detailed fabrics and fur or an on animals. And also they apparently paid attention to generation of text and typography saying that this can be useful for slides and invitations and other things. So, rolling out as well for their tool suite. And last thing to mention about is they also say This will be faster than Imagine Free. The plan is to make it apparently up to 10 times faster than Imagine free.
Yeah. And it is, it's, it's unclear because we're talking about a product rather than a model per se. It's unclear whether that's because there, you know, there's a compute cluster that's gonna come online that's gonna, you know, allow them to just crunch the crunch through the images faster, or that there's an actual algorithmic advance that makes it say, 10 times more compute efficient or whatever. So always hard to know with these things.
It's probably some mix of both, but Interesting that, yeah, I mean, I'm at the point where I, it's like flying on instruments. I feel like I can't tell the difference between these different image generation models. Admittedly these photos look super impressive, don't get me wrong, but I just, I can't tell the incremental difference. And and so I just end up looking at like, yeah, how much per token, like, or how much per image?
So you know, the, the price and the latency are both collapsing pretty quickly. And moving right along. We just got a couple more things we want to even covering all of announcements from Google. This is just a selection that I thought made sense to highlight. Next one is Google Meet is getting real time speech translation. So Google Meet is the video meeting offering from Google, similar to Zoom or other ones like that. And yeah, pretty much now you'll be able to have.
Almost real time translation. So it's similar to having a real time translator for like a press con conference or something. When you start speaking. It'll start translating to the paired language within a few seconds, kind of following on you. And they're starting to roll this out to consumer AI subscribers initially only supporting English and Spanish. And they're saying they'll be adding Italian, German, and Portuguese in the coming weeks.
So something I've sort of been waiting on honestly, I've been thinking we should have real time AI power translation that is very kind of sophisticated and and powerful, and now it's starting to get rolled out. I personally thought people who spoke languages other than English were just saying complete gibberish up until now. So this is a real shock that, that Yeah. No, but it, it's, it's kind of funny, right?
This is another one of those things where you hit a point of, you know, where latency crosses that critical threshold and that becomes the magic unlock. Like a model that takes even 10 seconds to produce a translation is basically useless. 'cause it's at least this really awkward conversation, at least for the purpose of, of Google Meet.
So another case where it did take Google a little while, as you pointed out, but the risk is so high if you mistranslate stuff and start an argument or, you know, whatever. That's a, that's a real thing. And that're deploying it again, across so many, so many video chats because of their reach, that that's, you know, gonna have to be part of the, the corporate calculus here. Right, and this is a thing we're not gonna be going into detail on, but Google did unveil a demo of their smart glasses.
And that's notable, I think, because Meta has their smart glasses and they have real time translation. So if you go to a foreign country, right, you can kind of have your in ear translator and I wouldn't be surprised if that is a plan as well for this stuff. But last thing to mention, for Google, not one of the highlights, but something I think notable as we'll see compared to other things.
Google also announced a new JUULs AI agent that is meant to automatically fix coding errors for developers. So this is something you can use on GitHub and it very much is like GitHub co-pilot. You will be able to task it with working with you on your code repository. Apparently it's gonna be coming out soon, so this is just announced. And yeah, it will kind of make plans, modify files and prepare pull requests for you to review in your coding projects.
And like literally every single product announcement like this, they have Google saying that Jules is in early development, end quotes may make mistakes. Which anyway, I think we'll be saying that until we hit super intelligence. Just 'cause, you know, the hallucinations are such a, a persistent thing, but there you have it. Right. And next story actually is directly related to that, it's that GitHub has announced the new AI coding agent. So GitHub co-pilot has been around for a while.
You could task it with your reviewing your code on a pool request on a, on a request to modify a code base. Google also had the ability to integrate Gemini for viewing code. So Microsoft very much. Kind of competing directly with JUULs and Codex as well, with an offering of an agent that you can task to go off and edit code and prepare a pool request.
So. Just part of an interesting trend of all the companies very rapidly pushing in a direction of coding agents and agents more broadly than they had previously. Yeah. This is also notable because Microsoft and OpenAI obviously are in competition and this frenemy thing copilot was the first, apart from Opening Eyes, codex was the first sort of large scale deployment at least of, of a coding auto complete back in like, I wanna say 20 20, 20 21, even just after GBT three.
And yeah, so the, they're continuing that tradition in this case, kind of being fast followers too, which is interesting. Like they're not quite first at the game anymore, which is something to note 'cause that's a big change. one small thing worth noting. Also, they did announce open sourcing of GitHub copilot for VS code.
So this is like a nerdy detail, but you have also competition from Cursor and these other kind of alternative development environments with a company behind Cursor now being valued at billions and billions of dollars. And that is a direct competitor to Microsoft's Visual Studio Code with GitHub copilot.
So them open sourcing with GitHub copilot extension to Visual Studio Code is kind of an interesting move and I think they are trying to compete against these startups that are starting to dominate in that space. And just one more thing to throw in here. I figure we're flagging because of its relation to this trend. Mistral, the French company that is trying to compete with OpenAI and philanthropic has announced Devra a new AI model focusing on coding.
And this is being released under an Apache to license. It is competing with things like Gemma free 27 B, so like a mid range coding model. And yeah, Misra also working on a larger agentic coding model to be released soon. Apparently with this being the smaller model, that isn't quite that good. This is also following up on code trial, which was more restrictively licensed compared to death trial. So there you go. Everyone is getting into coding more than they have before.
You get an agent and you get an agent and onto applications and business. We have, first up, not the most important story, but I think the most I. Kind of interesting or weird story to talk about, which is this open AI announcement of them fully acquiring a startup from is it Johnny? Ive Yeah. Johnny Ive, yeah. Johnny Ive, yes. Who seemingly has had this startup io that was, the details here are, are quite strange to me.
Yes. So there's this startup that Joni, I've started with Sam Altman seemingly two years ago that we don't know anything about or what is done. OpenAI has already owned 23% of this startup and is now going on a full equity acquisition that they're saying they're paying $5 billion for this IO company. And that's a company with 55 employees that again, at least I haven't seen anything out of. And this is, they're saying like the employees will come over. Johnny, I've will still be working.
Yeah. At Love From, which is his design company, broadly speaking, which has designed various things. So Johnny, I've not a full-time employee at OpenAI or io still sort of like a part-time contributor collaborator. And to top off all these various kind of weird details, this came with an announcement video of.
Sam Altman and John Johnny, ive walking through San Francisco, meeting up in a coffee shop and having like a eight minute minute conversation on values and AI and their collaboration that just had a very, very strange vibe to it. That was, you know, trying to make this very artsy, I guess feel to it. They also released some glass post coffee, coffee, you know, called Joni and Sam. Anyway, I, I just don't understand the PR aspects of this with business aspects of this. All of this is weird to me.
it almost reads like a landing page that Johnny Ive designed. Like to, to announce it. It's very like, kind of sleek, simple Apple style almost one might say very similar to actually love from their website has the same style of Yeah. Yeah. The blog post is like this minimalist centered text, large text. And the headline is Johnny and Sam. I think it's, I'm just gonna say it's weird. So they, so there, this blog post, they're talking about the origin story of this.
I think the news reports around the time that IO was first launched said, recall that like Sam Altman and Johnny Ives new startup. And the implication was that this was a company being co-founded by Sam and Johnny together or something. That's clearly not the case, at least according to what they said. They. Imply they say something like, it was born out of the friendship between Johnny and Sam, which is very ambiguous. But the company itself was founded about a year ago by Johnny.
Ive, along with Scott Cannon, who's an alum at Apple. and then Tang Tan and Evans Hanky. Evans e Evans hanky actually took over Johnny's role at Apple after Johnny departed. So they're tight there. A lot of shared history. But none of the co the actual co-founders are Sam opening. I already owns 23% of the company, so they're only having to pay 5 billion out of the total valuation of 6.4 billion to acquire the company.
And then, as you said, somehow out of all this, Johnny ends up still being a free ish agent to work at Love Run that, that, by the way, highly, highly unusual to acquire a company even at a $6 billion scale. And to let one of the core, arguably the most important co-founder just leave, like this is normally not how this goes usually.
famously with like the WhatsApp acquisition by Facebook it was like a, I forget what it was, like, a $5 billion acquisition, but the founder of WhatsApp left Facebook early, and so he was on an equity vesting schedule, so most of his shares just vanished and he didn't actually get the money that he was entitled to if he'd stuck around.
So the common thing weird that Johnny gets to just leave and fuck off, and like, apparently, I don't know if he's still getting his money from this or like, it's so weird. This is like. Very esoteric kind of deal it seems. But bottom line is they're working on a bunch of hardware things. Opening I has hired, the former head of Meta's, Orion Augmented Reality Glasses Initiative, that was back in November. And that's to head up its robotics and consumer hardware work.
So there's a bunch of stuff going on at OpenAI. This presumably folds into that hardware story. We don't have much information, but there's presumably some magic device that is not a phone that they're working on together. And who the hell knows? Right. So this announcement, which is very short like, I don't know, like maybe nine paragraphs concludes with saying, as IO merges with OpenAI, Johnny and Rum will assume deep design and creative responsibilities across OpenAI and io.
not like a strong commitment and as you said, said free agent. Like what deep design and creative responsibilities. And yeah, IO was seemingly working on a new hardware product, as you said like a hardware interface for ai, similar to the humane AI pin and rabbit R one, famously huge failures. Very interesting to see if they're still hopeful that they can make this.
I. AI computer, or whatever you wanna call it, AI interface within OpenAI and with Joni, ive, but anyways, yeah, just, such strange vibes out of this announcement and this video and with business story around this. Can an announcement have code smell? 'cause I, I feel like that's what this is and moving out to something that isn't so strange. We have details about Open AI's Planned Data Center in Abu Dhabi.
So they're saying that they're going to develop a massive five gigawatt data center in Abu Dhabi, which would be one of the largest AI infrastructures globally. Yeah. So this would span 10 square miles and be done in collaboration with G 42 and would be part of Open AI's Stargate Project, which. I'm kind of losing track is open the Eyes Stargate Project, just like all their data centers where they might wanna put them.
And this is coming after, you know, of course, Trump's tour in the Middle East with G 42. Having said that, they're gonna cut ties divest their stakes in entities like Huawei and Reba genomics Institute. this is pretty wild from national security standpoint. It is not unrelated to the deals that we saw Trump cut with the UAE and Saudi Arabia, la I won't say last week or the week before.
So for context, opening Eyes First, Stargate Campus in Abilene, Texas, which we've talked about a lot, that's expected to reach 1.2 gigawatts, really, really hard to find a spare gigawatt of power on the US grid. That's one of the big reasons why America's turning to the Saudis the Emiratis and, and so on and the Qataris to find energy on these kind of energy rich nations grids.
And so when you look at five gigawatts, you know, five times bigger than what is being built right now in Abilene, that would make this by far the largest structure or so, sorry, the largest cluster that OpenAI is contemplating so far. It also means that it would be based in foreign soil on the soil of a, a country that the US has a complicated past with. And just based on the work that we've done on securing. Builds and data centers I can tell you that it is extraordinarily difficult.
To actually secure something when you can't control the physical land, that that thing is going to be built on ahead of time. So when that is the case, you have a security issue to start with. That is prima facie. Not an option. When you're building in the UAE for a variety of reasons, you may tell yourself the story that you're. Controlling that, that environment, but you cannot and will not in, in practice.
And so from national security standpoint, I mean, I would really hope the administration is tracking this very closely and that they're bringing in, you know, the special operations, the, the Intel folks, including from the private sector who really know what they're doing. I, I, I gotta say the, the current builds, including from the Stargate family so far are the, the level of security is not impressive.
I've heard a lot of private reports that are non-public that make it very clear that that's the case. And so this is a really, really big issue. Like, we, we gotta figure out how to secure these. There are ways to do it and ways not to do it, but opening eyes so far has not been impressive in how seriously they've been taking the security story, but they've been talking a big game.
but the actual on the ground realities seem to be seem to be quite different again, just based on what we've been hearing. So really interesting question. Are we going to have this build go up? Is it going to be effective from a national security standpoint? And what's it gonna take to, to secure this? Yeah. Anyway all part of that G 42 backstory that we've been tracking for a long time between Microsoft and OpenAI and the United States and all that jazz.
Yeah. And, and it seems like with Trump in office, there's definitely set to be a major deepening of ties and open AI is opening Microsoft. Other tech companies seem happy to jump on board with that move. And yeah, as you said, there's been kind of a lot of investments going around from that region into things like open ai. So makes some sense. It, it's worth it if you can secure it. Like this takes immense pressure off the US electric grid, right?
Like we're not gonna just build or find five gigawatts like tomorrow that takes, we, we actually don't know how to build nuclear reactors in less than a decade in America. So it's a really good option. Saudi capital, UAE capital, those are great things. If. You know, they don't come with information rights or whatever.
But yeah, this is like if you, if you wanna like get the fruits of the of, of sort of Saudi and Uua e energy, you gotta make sure that you understand how to secure the supply chain around these things. 'cause Yeah, well we've, the billions of dollars this will surely cost. You'd hopefully put in a little bit of effort. Well, we'd be surprised. You'd be surprised. Yeah. Yeah, yeah, yeah.
Security is, is expensive and, and it's, it's actually like, it can't necessarily be bought for money because the, the teams that actually know how to secure these sites to the point where they are robust to, for example, like Chinese or Russian Nation state attacks are extremely rare.
And it's literally like a couple guys at like Seal Team Six and Delta Force and the agencies and like, yeah, their demands on their time are extreme and you probably can't network your way to them unless you have a, a trusted way to get there. So it like, it's a really tough problem. Well onto the next story. I think another sort of weird, almost funny story to me that I thought were covering LM Arena, which had the famous AI leaderboards that often have been covered.
We covered it just I think a few weeks ago. There was a big controversy around seemingly the big commercial players gaming their, to get ahead of open source competition. That organization has announced a hundred million dollars in a seed funding round led by a 16 z and uc investments. So this is gonna value them at something like $600 million and this is coming after them having been supported by grants and donations. So. Like, I don't understand.
What is the promise here for this, leaderboard company, organization is this just charity. Anyway, it's very strange to me. I would love to see that slide deck that pitch deck. There's a lot here. That's. Interesting to say the least. So one thing to note by the way, is they raised it's a hundred million dollars seed round. This is not a priced round.
So, so for context, when you raise a seed round you're, oh man, this gets into unnecessary detail, but basically it's a way of avoiding putting a real valuation on your company. If you raise it with, with safes, usually the whole thing with a seed round is you don't give away a board seat. Whereas if you raise a series A or series B, you're starting to give away board seats. So this implies that they have a lot of leverage.
Like if you're raising a hundred million dollars and you're calling it a seed round, you're basically saying like, yeah, we'll take that money. You'll get your equity, but don't even think about getting a board seat. That's kind of the frame here. You can only do that typically when you have a lot of leverage. Which again brings us back to your very, I think, very good and fundamental question. What is the profit story here and. The, like, I have no idea.
But it's notable that like Ella Marina has been accused of helping top AI labs game its leaderboard, and they've denied that. But when you think about like, okay, how could a structure like this be monetized? Well, maybe, showing some kind of, not overt preference, but, subtle preference for or indirect preference for certain labs. Like, I don't, I don't know. I'm speculating and this should not be taken.
Like we, I just don't see any information on exactly what the profit play is, which kind of makes me intrinsically skeptical. And yeah, we'll, we'll, we'll see where this goes. But again, there's a lot of leverage here. There's gotta be a profit story. It's being led by a 16 z so, you know. There's a, there, there, presumably. Yeah. Apparently it cost a few million dollars to run the platform and, and they do need to do the compute to compare these chatbots.
So er here is you get two generations, two outputs for a given input and people vote on which one they prefer. So it is costly in that sense and, and it does require you to pay for the inference and what at least has been said is this funding will be used to grow a la marina and hire more people and pay for costs such as the compute required to run this stuff. So yeah, basically saying that they are gonna scale it up.
And grow it to, to something that supports the community and helps people learn from human preferences. Nothing related to how this a hundred million will be, you know, something that the investors will get a return on. But it could be a data play, like, you know, a kind of scale ai we're, we're doing, you know, it, it, it is. You've got some data labeling. That's cool there. I just like, yeah. I'd love to see that deck.
Yeah. And next up going back to hardware Nvidia, CEO has said that the next chip after the aged 20 for China will not be from the Hopper series. So this is just a kind of small remark, and it's not because previously it was reported that Nvidia planned to release a downgraded version of the H 20 ship for China. In the next two months, this announced and mades a transition in US policy as to restrictions on ships.
And, and after the sale of these age 20 chips designed specifically for China was banned only a few months ago. looks like Nvidia is yeah, kind of having to change their plans and adapt quite rapidly. I. It seems like they will be pulling from the Blackwell line. This makes sense. Jenssen's quote here is, it's not Hopper because it's not possible to, to modify Hopper anymore. So they've sort of moved their supply chains over onto Blackwell, no surprise there.
And they've sort of squeezed all the juice they can out of the, the Hopper platform and presumably sold out of their stock when, when it was announced that they couldn't do anymore. Next up, I put this in a business section just so that we could move on from Google for a little bit, was announced that Google Gemini AI app has 400 million monthly active users, apparently, which is approaching the scale of Chad GPT, apparently that had 600 million monthly active users as of March.
So yeah, as I, as I previewed, I guess seems very surprising to me because Gemini as a chatbot hasn't seemed to be particularly competitive with offerings like Chatt and Claude and haven't seen many people be big fans of Gemini or a Gemini app. But according to this announcement, lots of people are using it. Yeah, and apparently so, so the comparable here, there are recent court filings where Google estimated in March that chat GPT had around 600 million monthly active users.
So, you know, this is like two thirds of, of where chat GPT was back in March. So, you know, to the extent that, you know chat GPT and open AI encroaching on Google's territory, well Google's, you know, starting to do the same. So yeah, it, this is all obviously a competition as well for data as much as for, you know, money in the form of subscription. So this is all self licking ice cream cones, if you will, or flywheels that both these companies are trying to get going.
Right. And I think also part of a broader story, this whole thing with Google IO 2025 and then this announcement as well, I think demonstrates that over the last few months really Google has had a real shift in fortune in terms of their place in the AI race and competition basically until like 2025. They've seemed to be surprisingly behind.
Gemini was like surprisingly bad, even though the numbers looked pretty good and their web offerings in terms of search lacked behind perplexity and Chatty P search. Then Gemini 2.5 was updated or, or released I think in late January and kind of blew everyone away how good it was. Gemini 2.5 and Gemini Flash have continued to be updated and, continue to impress people. And now all this stuff would be a free Imagine four.
The agents all these like 10 different announcements, really position Google as, as I think for many people in the space looking at who is in the lead or who is killing it. Google is killing it right now. They are, and this is, you know, we've talked about this before, Google being the sleeping giant, right? With this massive, massive pool of compute available to them.
They were the, the first to, I mean, there's the first to recognize scaling in, in the sense that OpenAI did with GPT two and then G PT three, but then there's the first to recognize. Let's just say the need for distributed computing infrastructure in a more abstract sense, and that was certainly Google. They invented the TPU explicitly because they saw where the wind was blowing, and then they, now they have these massive TPU fleets and, and a whole integrated supply chain for them.
You know, OpenAI really woke the dragon when they, when they went toe to toe with Google via chat, GPT and Microsoft. And so. Yeah, I mean, to some degree this is not to some degree entirely. This is the reason why you're seeing that, you know, five gigawatt UAE build, that's, that OpenAI is gonna build. They need to be able to compete on a flop per flop basis with Google. If they can't, they're done. Right. This is kind of just how the story ends.
So that's why all the CapEx is being spent anyway, just these announcements that we're seeing today are the product of CapEx that goes back, you know, two years, like breaking ground on data centers two years ago and making sure chip supply chains are ready three years ago and designing the chips and all that stuff. So, you know, this is really a long time in the making, every time you see a big rollout like this. Yeah. And, and not just the infrastructure, I mean.
Having DeepMind, having Google ai. Yeah. You know, Google was the first company to really go in on AI in a big way spending, you know, billions of, of dollars on DeepMind for many years as just a pure r and d play. Microsoft later, you know, also started investing more in, in meta and so on. But yeah, Google has been around for a while in research and that's why it was to a large extent, kind of surprising how lagging they were on the product front. And now seemingly they're catching up.
And just one more thing to cover in the section. We have a bit of a analysis on the state of AI servers in 2025. This is something Jeremy, you linked to just on x. So I think I'll, I'll just let you cover this one. It's sort of like a, a random assortment of of excerpts or, or take homes from this big JP Morgan report on AI servers from their Asia Pacific Equity Research branch. And there's just like a bunch of, a bunch of little kind of odd tidbits.
We won't spend much time on this 'cause we gotta go, man. There's more news. Just looking at the, the mismatch between for example packaging production, so, so tsm C'S ability to produce wafers of like, kind of packaged chips. And how I. And then downstream, GPU module assembly and how that compares to GPU demand.
And they're just kind of flagging this interesting mismatch where it seems like there's about a 1 million or 1.1 million GPU unit oversupply currently expected heading into anyway the next few quarters, which is really interesting given where things were at just like two years ago, right? That massive, massive shortage that saw prices skyrocket. So cur, you know, kind of curious to see what that does to margins in the space.
This is all because of NVIDIA inventory buildup, basically, like there, there's a whole bunch of, of excess there. Anyway, and, and there were some yield issues and things like that that are being fixed. Anyway, they, they've got interesting, interesting numbers about the whole space. CapEx increasing across the board from these massive cloud companies by like pretty wild amounts. And in particular. ASIC shipments.
So basically AI chip shipments projected to go up 40% year over year, which is huge. I mean, that's like, that's a lot more chips in the world than there were last year. And keep in mind, those chips are also much more performant than they were before. So it's 40% year over year growth on a per chip basis, but on a per flop basis, a per compute basis, it's even more than that. You know, we may be like doubling the amount of compute or, or actually more that there is in the world based on this.
Anyway, you can check it out if you're, if you're a nerd for these things and you wanna see, you know, what's happened to Amazon's training two demand it's up 70%, by the way, which is insane. And a bunch of other cool things. So, so check it out if you're like a sort of like finance and compute nerd. 'cause this is just gonna, just gonna be your, your weekend read. Onto the next section, projects and open source.
We just have one story here I guess to try and save time 'cause there is a lot more after. And the story is pretty simple. Meta is delaying the rollout of the biggest version of llama. So when they announced four, they also were previewing lama for behemoth, their large variant of LAMA four That is meant to be competitive with, you know, cha, BT, and, and Quad. And. Basically the frontier models.
So it seems like, and according to sources, that they initially planned to release this behemoth in April, that was later pushed to June, and it has now been pushed again until at least fall. So this is all, you know, kind of internal, they never committed to anything, but it seems like per kind of the reports and general, I think things that are coming out that Meta is struggling with training, with model to be as good as they want it to be.
Yeah, I think this is a, actually a really bad sign for meta. Because also they have a really big compute fleet, right? They, they have huge amounts of CapEx that they've poured into into AI compute specifically. And what this shows is that they now have consistently struggled to make good use of that CapEx.
They have been consistently pumping out these, like pretty mid models unremarkable and then to make up for that, gaming them to make them look more impressive than they are in a context where deep seek is eating their lunch, both from a marketing branding standpoint and also just raw performance and compute efficiency. And so. Yeah, this is, this is really bad.
The, the whole reason that meta turned to open source, it was never because they thought that they were going to somehow open source a GI that was never gonna happen. Anybody who has a GI locks it down and uses it to like, bet on the stock market and then funds the next generation of scaling in that shit. But and then obviously automates AI research. It, it was eventually gonna get locked down.
This was always a recruitment play for meta and there were some other ancillary infrastructure things, getting people to build on, build on their platforms and that. But the biggest thing absolutely was always recruitment. And now with that story just falling flat on its face it's really difficult. Like if you wanna work at the best open source AI lab.
A like, unfortunately, it looks like right now there are Chinese labs that are absolutely in the mix, but B there are a lot of interesting players who seem to be doing a better job on a kind of per flop basis. Over here. You look at even Alan ai, right? They, they're putting out some really impressive models themselves. You've got a lot of really, anyway, impressive open source players who are not meta. So I think like Zuck is in a real bind and they're doing a lot of damage control these days.
Yeah, and I think this speaks to like Meta has really good talent. They have been publishing just fantastic work for many, many years, but my sense is that the skills and experience and knowledge needed to train a massive, massive LLM model Yeah. Is very different. And the competition for that talent is just immense. X Xai when it came out, I think was.
Seemingly providing just really, really big packages to try and get people who have experience in that philanthropic has had very high retention of their talent. I think I, I saw a number somewhere like 80% retention. We've seen people leaving from Google to go do their own startups. So I think meta, presumably that's part of a problem here is this is a pretty specialized skillset and knowledge and they've been able to train good lms.
But to really get to the frontier is not as simple as maybe, you know, just scaling. After research and advancements, and we begin with not a paper and not a very detailed kind of advancement, but a, a notable one. And this is also from Google, so sort of under radar, just as a little research announcement and demo, they did announce a Gemini diffusion. And this is the kind of demonstration of doing language modeling via diffusion instead of auto regression.
So typically any chat bot you use these days essentially is generating one token at a time, left to right, start to finish, you know, it picks one word, then it picks the next, then it picks the next. And we I think recently covered. Efforts to move that to the diffusion paradigm where you basically generate everything all at once. So you start with all the text and there's some messy kind of initial state, and then you update it to do better.
And the benefit of that is you can do be just way, way, way, way faster compared to generating one word, a one token at the time. So DeepMind has come out with a demonstration of diffusion for Gemini, for coding. That seems to be pretty good. Seems to be comparable with Gemini two flash light with smaller kind of not quite as powerful fast model. And they are claiming speeds of about. 1500 tokens per second with very low initial latency.
So something roughly on a scale of 10 times faster than GBT 4.1, for example, just lighting fast speeds not much more details. Here, you can get access to the demo signing up for a wait list. And yeah, if they can push this forward, if they can actually make diffusion be as performant as just at aggressive generation at the frontier. Really, really big deal. And diffusion. So conceptually diffusion is, is quite useful from a parallelization standpoint.
It, it, it's got properties that allow you to paralyze at a, just in, in more efficient ways than transformers potentially. One of the consequences of that, they show a case where the model generates 2000 tokens per second of effective effective kind of token rate generation, which is pretty, pretty wild. It means you're almost doing like instant generation of chunks of code.
There's, to kind of give you a sense for why this would matter, there's a certain kind of, some sometimes know as like non causal reasoning that these models can do that your traditional auto regressive transformers can't. So an example is you can say like, solve this math problem. First give me the answer and then after that, walk me through the solution. Right?
So give the answer first, then give the solution that's really, really hard for standard auto aggressive models because what they wanna do is spend their compute first to spend their inference, time compute, generating a bunch of tokens to reason through the answer, and then give you the answer. But they can't, they're being asked to generate the solution right away and only generate the the sort of derivation after.
Whereas with diffusion models, they're seeing, they're, they're generating the whole thing all at once. They're seeing the whole canvas all at once. And so they can start by having, you know, a crappy solution in the first cycle of generation and, and a crappy derivation. But as they modify their derivation, they modify the solution, blah, and then eventually they get, you know, the right answer on the whole.
So this may seem like a pretty niche thing, but it can matter in certain sort of specific settings where certain kind of causalities at play and you're trying to solve certain problems. And just generally it's, it's good to have other architectures in the mix because if nothing else you could do like a kind of mixture of models where you have some models that are better at solving some problems than others. And, and this gives you an architecture it's a bit more robust for, for some problems.
Right, and like intuitively, you know, you're so used to when you are using chat GBT or these LMS to this paradigm of like, you enter something and then you see the text kinda pop in and you almost are reading it as it is being generated with diffusion. What happens is, like all the texts kind of just shows up it's near real time and that is a real kind of qualitative difference where it's no longer, you know, waiting for it to complete as you're going.
It's more like you enter something and you get the output almost immediately which is kind of bonkers if you think it can be made to work anywhere near as well as just the auto regression paradigm, but not many details here on the research side of this. Hopefully they'll release more 'cause so far we haven't seen very successful demonstrations of it. And moving on to an actual paper, we have a chain of model learning for language model.
So the idea here is you can incorporate what they're saying as hierarchical hidden state change chains within transformer architectures. So what that means is you can hidden states in neural nets is basically just the soup of numbers in between your input and output. So you take your input, it goes through a bunch of neural computing units and generate all these intermediate representations from the beginning to end and, and keep updating until you generate the output.
So. The gist of a paper is that if you structure that hidden state hierarchically and have these chains that are processed at different levels with different levels of granularly granularity and with different levels of model complexity and performance, you can be more efficient. You can use your compute in more dynamic and more kind of flexible ways. So that's, I think, the gist of this. And I haven't looked into this deeper sort of, Jeremy, maybe you can offer more details.
Sure. I, I think this is kind of a, a banger of a paper. It's also frustrating that this is, I mean, this is a, a multimodal podcast. We have video, but we don't like, you know, there's like an image in the paper that makes it make a lot of sense. It's figure two that just sort of shows the architecture here, but high level you can imagine, neural network has, you know, layers of neurons that are st, you know, stacked on top of each other.
And typically the, you know, the neurons from the first layer are e, each one of them is connected to each neuron in the second layer. And each neuron in the second layer is connected to each neuron and the third layer and so on. So you kind of have this dense mesh of, of neurons that are linked together. So there's a width, right? The number of neurons per layer, and then there's a depth, which is the number of layers to the network.
in this case, what they're gonna do is they're kinda gonna have a slice, a very small, narrow of width slice of this network. And they're going to essentially make that the backbone of the network. So let's imagine there's like, you know, two, two neurons in each layer. And the two neurons from layer one are connected to the two neurons from layer two and layer three and so on. And the two neurons, say at layer two can only take input from the two neurons at layer one.
They, they can't see any of the other neurons at layer one. That then becomes this like pretty cordoned off structure within a structure. So if, if you have like a, a larger number of neurons in each layer that are only connected to the, the additional anyway sets of neurons at each layer, hopefully you can just check out the figure and see it. You can kind of see how this allows you to kind of increase.
the size you can run your model in, in larger mode, either by only using the thin slice of say, two neurons that we talked about, or by considering a wider slice, you know, four neurons or, or eight or 16 or whatever.
And so what they do is they find a way to train this model such that they are training at the same time, all these kind of smaller submodels, these thinner submodels, so that once you finish training, it like costs you the same amount basically to train these models, but you end up for free with a bunch of smaller models that you can use for inference. And the other thing is, because of the way they do this, the way they engineer the loss function is such that.
The smaller slices of the model they have to be able to independently solve the problem. So the, the thinnest slice of your model has to be able to make decent predictions all on its own. But then if you add the next couple of neurons in each layer to your model and get the slightly wider version, that model is gonna perform a little bit better 'cause it's got more scale, but it also has to be able to independently solve your problem.
And so it ends up those extra neurons end up specializing in kind of refining the answer that your first thinner model gives you. So there's this sort of idea where you can gradually control, you can, you can tune the width of your model or effectively the level of, capacity that your model has dynamically.
At Will. And from an almost interpretability standpoint, it's quite interesting 'cause it means that the, neurons from that thinnest slice of your network that's still supposed to be able to operate coherently and solve problems independently. Those neurons alone must be kind of focused on more foundational basic concepts that generalize a lot. And then the neurons that you're adding to the side of them are more and more specialized. As you add onto them.
They're, they're gonna allow the model to perform better when they're included, but. excluding them still results in a functional model. So this is, there's a lot of detail to get into in the paper we don't have time for, but I highly recommend taking a look at it. I wouldn't be surprised if something like this ends up becoming fairly important. It just, it smells of good research taste, at least to me. It is a Chinese lab that came out with it, which is quite interesting.
But in any case check it out. Highly recommend. It's yeah, it's cool paper. Yeah. Actually a collaboration between Microsoft Research and Food University. right. Several others. But they did open source or save a will open source for code for this stuff. And, but paper is is kind of funny. They. It produced a lot of terms where like, here's yes, the notion of chain of representation, which leads into chain of layer, which leads into chain of model, which leads into chain of language model.
Where we idea that yeah, these kind of cumulatively lead up to the notion that when you train a single large model, it contains these sort of submodels and it is quite elegant, as you say. Now that I've taken a bit of a deeper look, next paper is seek in the dark reasoning via test time instance level policy gradient in latent space. So the idea and or problem here is.
variant of test time compute where you want to be able to do better for given input by leveraging computation at test time rather than train time. You're not updating your parameters at all, but you're still able to do better. And the idea of how this is done here is sort of mimicking prompt engineering. So you're tweaking the representations of the input for the model, but instead of actually literally tweaking the prompt for a given input, it's tweaking the representations within the model.
So they are using every reward function to update the token wise, latent representations in the process of decoding, and that they show can be used to for given input, improved performance quite a bit. So they're kind of optimizing the. Internal computations in an indirect way that is yet another way to be able to scale at test time. Quite different from, for instance, chain of thought. Yeah. So that was actually really good.
I never thought of this as an alternative to prompt engineering, but I think you're exactly right. Right? It's like a activation space, prompt engineering, or at least that's that, that's a really interesting analogy. Yeah. It, it's so, so this, this is another in my opinion, another really interesting paper.
So is the basic idea is you're gonna take a prompt, you know, feed it to your model is in this case you, you're gonna give it a reasoning problem and get the model to generate a, a complete chain of thought, right? So the model itself just generates the full chain of thought, vanilla style, like nothing unusual. And then you're gonna feed the chain of thought. To the model, and you're going to, this is going to lead to a bunch of activations at every layer of the model as usual.
Now, the final layer of the model, just before it gets decoded, you have activations there and you're gonna say, okay, well why don't we essentially build a reinforcement learning model and have that, that model play with just those activations. And what we're gonna do is we're going to get the model itself to like decode and then, estimate the expected reward on this task for, for the, the final kind of decoding the answer. And you're gonna do it in a very, very, kind of simple, greedy way.
So whichever token is, is given highest probability. That's just the one that you're gonna predict. And you're gonna use essentially a version of the same model to predict the reward. And then, like, if the reward is low, you're gonna go in and modify. So according to the model's own self-evaluation, if the, if the reward is low, you're gonna modify the activations in that final layer, the activations that sort of represent or encode the chain of thought that was fed in.
So you're gonna tweak those. And then you'll, you'll try again decode, and then get the model to evaluate that output. Oh, you know, I, I think it needs to be, you know, we need to do some more tweaking. So you go back and you tweak again the activations and you can do a bunch of loops like this. Essentially. It's like getting the model to, to correct itself. And then based on those corrections, it's actually changing its own representation of the chain of thought that it was chewing on.
And it's really quite interesting. And, and again, it feels it sort of feels obvious when you see it. But somebody had to actually come up with the idea. A couple observations here. So there's an interesting scaling behavior as you increase the number of iterations of the cycle, right? Get, get the model to actually decode, evaluate its own output, then tweak The activation's a bit. What you find is there's typically like a, an initial performance improvement that's followed by a plateau.
And that plateau seems to come from the model's own ability to evaluate, to predict the rewards that would be assigned to its output. when, instead of the model self-evaluating, you use an accurate reward model. One that always gets the reward prediction, right? Then all of a sudden that plateau disappears and you actually get continuous scaling, like the more of these loops you do, as long as you're correctly assigning the reward. And, and it corresponds to like the true base reality.
You just continue, continue, continue to improve with scale. So that's another scaling law implied in here, which is quite impressive. There's also a bunch of like compute efficiency stuff, so. There's a question of like, do we, do we think of the, the playing field as every activation in the final layer of the transformer? Or as a subset, we could imagine only optimizing, only doing reinforcement learning to optimize say, 20% of those activations.
And in fact, it turns out that that ends up being the optimal way to go. And 20% is a, a pretty good a pretty good number. They find don't, don't optimize all of those activations, just optimize some of them. And at least for me, that seemed counterintuitive. Like why wouldn't you wanna optimize the full set of activations? It turns out, you know, a couple of reasons. One is just optimization stability, right?
So if you're updating everything, there's a risk that you're just gonna go too far off course and you need to have some anchoring to the original meaning of the chain of thought. So you don't yeah. Steer way off. And then there's issues of representational capacity. So just having enough latent representations to allow you to do effective extrapolation. Anyway, this is a really, I think, interesting and important paper.
Wouldn't be surprised to find it turn into another dimension of test time scaling. So yeah, just thought, thought it was worth calling out. Yeah, it's, it's interesting in a sense of, I don't know, it's, it's like you have an auxiliary model or you could conceptually have an auxiliary model that's just for evaluating this, like in-between activation and doing sort of side optimization without updating your main model.
something about, it seems a bit strange conceptually and, and maybe there's like equivalent versions of this, but that's just a gut feeling somehow I get, and next we have two experts are all you need for steering thinking. Reinforcing cognitive effort in MOE reasoning models without additional training is the title of this paper. So this is a way to improve reasoning in mixtures of experts model it without additional training.
Mixture of experts is when you have a model that sort of splits a work across subsets of it, more or less, and they are aiming to focus on and identify what we are calling cognitive experts within the model. So they're looking for correlations between undesirable reasoning behaviors and the activation patterns of specific experts in NME mixtures of experts models. So basically just large language models that have mixtures of experts.
And then when they find the experts that turn out to have the best kind of reasoning behavior, they amplify those experts in the computation of the output. And typically the, they make sure of experts work is you like, route your computation to a couple of experts and then you sort of average out the outputs of those experts to decide what to output. So conceptually you can sort of give more weight to certain experts or route the data to certain experts more often.
So when they find these, have these theoretical cognitive experts, they show that in fact, this seems to be something that can be done in practice for L LMS that have MOE for reasoning applications. Yeah. And it's, it's kind of like, I wanna say embarrassingly simple how they go about identifying what are the experts, what are the components of the model that are responsible for doing reasoning? And so it turns out, when you look at the way deeps, SEEQ, R one is trained, right?
It's trained to put, its thinking, it's reasoning between these thinking tokens, right? So they kind of have, it's like HTML if you're familiar with that.
Like, you know, you have like bracket, think bracket and then your actual thinking text and then close bracket think bracket what they end up doing is they say, okay, well, like, let's see which experts I. Typically get activated on the thinking tokens, and it turns out that it's only a small number that consistently get activated on the thinking tokens. So, hey, that's a pretty good in that those are the experts involved in, the reasoning process.
So the way they test that intuition is they say, okay, well if that's true, presumably, like you said, Andre, if, if I just dial up the contribution of those experts of the reasoning experts on any given. Prompt that I give them, then I should end up seeing more effective reasoning, or at least a greater inclination towards reasoning behavior. That's exactly what happens.
So this is pretty, I, I would've like, it's, this happens so often, but like, I would've been embarrassed to suggest this idea, like in a, it just seems so obvious, and yet the obvious things are the ones that work. And in fairness, they only seem obvious in hindsight. This is obviously a, a very good idea. Anyway, so they, they use a, a, a metric called pointwise Mutual information to measure the correlations between expert activations and reasoning tokens.
It's actually a pretty simple measure, but it, there's no point going over it in detail. one interesting thing is there's cross-domain consistency though. So the same expert pairs consistently appeared as the top reasoners the top cognitive experts across a whole bunch of domains, math, physics a bunch of stuff, which really does suggest that they encode general reasoning capabilities. I wouldn't have bet on this, like the idea that there is. An expert in an MOE that is the reasoning guy.
One thing, they don't touch on this in the paper, but I would be super interested to know how are the different so-called reasoning experts? I. Different, right? So like they're saying there's two reasoning experts basically in this model that you need to care about. So how do, like what, in what ways do their behaviors differ, right? What is, what are the different kinds of reasoning that the model is capable of or, or wants to divide between two different experts?
I think that'd be really interesting. Anyway, so a whole bunch of other stuff we could get into about computer efficiency. But there is no time. There is no time. We have quite a, a few more papers to discuss. So a lot of research also this week and next one is another Gemini related paper. It's lessons from defending Gemini against indirect prompt injections. Coming from Google. Quite a detailed report, something like I think 16 pages.
No, actually like dozens of pages if you include the appendix with, yeah, all the various details. The gist of it is. You're looking at indirect prompt objections. So things like embedding data in a website to be able to get an AI agent that's been, you know, directed to go off and do something go off course. And the short version I'll provide us a summary.
And Jeremy, you can add more details as you think you know is appropriate, is that they find that it is possible to apply known techniques to do better so you can protect against known attacks and do that via adversarial, fine tuning, for instance. But the high level conclusion is that this is an evolving kind of adversarial situation where you needs to essentially. Be continually on it and see what are these new attack techniques to be able to deploy new defense techniques as things evolve.
I, I think that's a, a great summary, especially given time constraints. Yeah, the I'll just highlight two quick notes. So first is they find adaptive evaluation of threats is critical. So a lot of the defenses that do really well on static attacks can be tricked by really small adaptations to attack. So tweak an attack very slightly and then it suddenly works, right? So this is something that, anyway, we see all the time.
and then there's this other notion that if you, if you use adversarial training to help your models get more robust to these kinds of attacks that's gonna cause the performance to drop. And what they find is that's actually not the case. One of the most interesting things about this paper is just like the list of attacks and defenses to prompt injection attacks that they, they go over. I'm gonna mention one and then we will move on. But it's just called the spotlighting defense.
I actually had never heard of this before. So if you have an attacker who injects a prompt into, or some, some dangerous text into a prompt, like ignore previous instructions and, do some bad thing. The spotlighting defense, what it does, it will insert what are known as control tokens.
So they're basically just like new different kinds of tokens at regular intervals that just break up your text so that, you know, ignore previous instructions, get split up and you have, you know, IG and then control token and then no, and then pre, and then another control token. And it, it just has a way of and then you tell the model, sorry, in the prompt you tell it to be skeptical of text between those control tokens.
And so that kind of teaches the model to, you know, be a little bit more careful about it. And it has anyway, really effective results. There's a whole bunch of other defenses and attacks they go into. If you're interested in the attack defense balance and the zoo of possibilities there, go check out this paper. It's a good catalog. Next up we have from Epic ai, how FAST CAN Algorithms, advanced Capabilities, so this is a blog post associated with a previously released paper titled LLM.
E guess can LLM capabilities advanced without hardware progress? The motivation of the research is basically asking the question of can we find software improvements that yield big payoffs in terms of better accuracy? So it ties into this hypothesis that if LLMs get good enough at conducting good AI research, they can find breakthroughs to self-improve.
And then you get this circle called intelligence explosion, where you, the VMs get better at research, they find new insights as to how to train better LLMs, and then the better lms keep finding better algorithmic insights until you become super, super ultra intelligent. And this is one kind of commonly. Believe hypothesis as to why we might get what is it? SAI, super intelligent ai. A SI, yeah, A-S-I-A-S-I relatively soon.
So this blog post is essentially trying to explore how likely that scenario is based on the trajectory and history of Val Go make progress so far. And the gist of their conclusion is that there are two types of investments. They are compute, dependent and compute independent insights. So there are some insights that only demonstrate their, true potential at large scales.
Things like transformers, mixtures of experts sparse attention that, you know, with smaller models when you're testing may not fully show you how beneficial they are, how promising they are. But as you scale up, you get way, way stronger. Benefits of like 20 times a performance, 30 times a performance versus smaller things like layer norm, where you can reliably tell that this algorithmic tweak is gonna improve your model.
You know, and you can verify that at a hundred million parameters instead of 10 billion parameters or a hundred billion parameters, meaning that you can do research and evaluate these things without like ultra large hardware capacity. So. The, the basic conclusion of a paper is that the idea that you can get intelligence explosion needs to be a result of finding these compute dependent algorithmic advances being easier to find.
So you need to find the advancements that as you scale up compute will yield like big, big payoffs rather than relatively small payoffs. Yeah. The, the frame is that these, so these compute dependent advances are you, like you said, you only see the, the return on investment at large scales for the full return on investment at large scales. And they point out that when you look at the the, the boosts in algorithmic efficiency that we've seen over the years.
These are dominated by compute dependent advances. So you look at the transformer, the MOE mixed query attention, sparse attention, these things collectively, they're like 99% of the compute efficiency improvements. We've seen 3.5 x according to them from compute independent improvements like flash attention and rope. But but they don't hold the candle to the these like approaches that really leverage large amounts of compute. And so I think in their minds, the case that they're making is like.
You can't have a software only singularity if you need to like leverage giant amounts of physical hardware to test your hypothesis, to validate that your new algorithmic improvement is actually effective. You need to actually work in the physical world to gather more hardware. I think this frankly doesn't do the work that it thinks it does. There, there are a couple of issues with this and, and actually Ryan Greenblatt on x has a, a great tweetstorm about this.
By the way, first of all, love that epic. KI is doing this. Really important to have these concrete numbers so that they can facilitate this sort of debate. But I think the key thing here is so they highlight, look, transformers, transformers only kind of give you returns at outrageous, or, sorry, give you the greatest returns at outrageous levels of scale. So therefore they're a compute dependent advance. I don't think that's what actually matters.
I think what matters is, would an automated software only process have discovered the transformer in the first place? And to that, I think the answer is actually probably yes, or at least there's no clear reason that it wouldn't have, in fact, the transformer, I. MOE mixed query attention. They were all originally found at Tiny Scale as Ryan points out about one hour of compute on an H 100 GPU.
So that's like quite small even, you know, even back in the day in relative terms, it was certainly doable. And so this is like the, the actual question is do you discover things that give you a little lift that makes them seem promising enough to be worthy of subsequent investment? The answer seems to be that actually basically all of the advances that they highlight is the most important. Compute dependent advances have that property we're discovered at far, far lower scale.
And we just keep investing in them as they continue to, to kind of show promise and value. And so it's almost like, you know, any startup, you keep investing more capital as it shows more traction. Same thing. You should expect the decision theoretic loop of a software only singularity to like, to latch onto that. 'cause that's just good decision theory. So anyway, I, I think this is a really rich area to dig into. I have some issues as well with their, their frame.
They, they look at a deep seek and they kind of say that the deeps seek advances were all kind of compute constrained advances or compute dependent. But again, the whole point of deep seek was that they used such a small pool of compute. And so I almost wanna say like. To the extent that compute independent means anything deep seek, a lot of their advances really should be viewed as statutorily compute, independent. Like the point is that they had very little compute.
This is actually a great test bed for what a software only process could unlock potentially. So lots of stuff there. You can look into it. I think it's, it's a great report and and great room for discussion. Yeah, I think it's, it's kinda introducing the conceptual idea of compete dependent versus independent algorithms. And then there are. Questions or ideas you can extrapolate. Last paper really quickly.
I'll just mention without going into depth, there is a paper titled Reinforcement Learning Fine Tunes, small Sub-Networks in Large Language Models. the short, short version is when you do alignment via reinforcement learning that turns out to update a small number of the model parameters. Something like 5% or sorry, 20% versus doing supervised fine tuning. You update all the weights as you might expect. So this is a very strange and, and interesting.
Kind of behavior of reinforcement learning alignment versus supervised alignment. And I figured should just mention it as an interesting paper, but no time to get into it. So moving on to policy and safety. We have first and exclusive with a report on what OpenAI told California's Attorney General.
So this is, I suppose, a leak or, or perhaps, I don't know, a demonstration of this response to petition for attorney general action to protect the charitable nature of open AI's sent to the Attorney General on May 15 by OpenAI basically has. All their arguments in a position to the groups that want to stop OpenAI from restructuring and really just restating what we've been hearing a whole bunch.
You know, Musk is just doing this as a competitor and is being, is harassing us and has misinformation and basically saying, you know, ignore this petition to block us from doing what you want. It isn't valid.
Yeah, and, and it's, so there are a whole bunch of interesting contradictions in there as well with some of the claims opening AI's been made, or at least the vibes they've been putting out, which is pretty standard open AI fair that they'll, you know, they get, they try to get away with a lot, it really seems. And there's a lot of examples of this here. So one item is so they, they suggested the nonprofit.
By the way, so some of this is revealing information, material information about the nature and structure of the deal. This sort of nonprofit transition thing that was not previously public, right? So opening AI recently came out and said, look, this whole plan we had of having the for-profit kind of get out from under the control of the nonprofit, we're, we're gonna scrap that. Don't worry guys, we hear you loud and clear. There are now a bunch of caveats.
We, we highlighted, I think last week that there would be caveats. The story is not as simple as OpenAI has been making it seem. A lot of people have kind of declared victory on this. It said, great, you know, the nonprofit transition isn't happening. Let's move on. But hold on a minute. This, this is OpenAI doing their usual best to to kind of control the, the PR around this. And, and they have done a good job at that. So here's a quote and let me just really quickly mention for context.
This is partially in reply to this not for private gain coalition that has a public letter. They released a public letter in April 17. They updated their letter in May 12th in response to on May 5th, OpenAI announcing that they're kind of backing off from trying to go full for profit with this new plan of the public benefit corporation and the kind of not going for profit. So this not for gain, a coalition updated their stance and essentially has criticism still.
And this letter on May 15th is in response to whole chain of criticism. Yeah. So if it wasn't complicated enough already, already yeah. And, and so here's, here's a line from the opening eye statement here. The, the nonprofit will exchange its current economic interests in the capped profit for a substantial equity stake. I. In the new Public Benefit Corporation and will enjoy access to the public benefit corporation's, intellectual property and technology personnel and liquidity.
That sounds like a good thing until you realize that. Well, wait a minute. The nonprofit did not just enjoy access to the technology. It actually owned or controlled the underlying technology. So now it's gonna just have a license to it. Just like opening Eyes commercial partners. That is a big, big caveat, right? That is not consistent with really the, the spirit potentially, but certainly the, the facts of the matter associated with the previous agreement as I understand them under the.
The current structure opening AI's, LLC, so the, the sort of main operating agreement explicitly states that the company has a duty to its mission and the principles advanced in the OpenAI charter take precedence over any obligation to generate a profit that creates a legally binding obligation on the directors of the company, the company's management.
Now, under the new structure, though, the directors would be legally required to balance shareholder interests with the public benefit purpose, and so. This is like the fundamental obligations, legal duties of the directors is now going to be to shareholders over, or potentially alongside, I should say, the mission. And that shift is probably a big reason why investors are more comfortable with this arrangement. We heard SoftBank say, you know, look, from our perspective, everything's fine.
After they had said, OpenAI has got to get out from under its nonprofit in order for us to, to keep our investment in, and now they're making these noises like they're satisfied. So clearly for them, defacto, this is what they wanted. Right? So there's something going on here that doesn't quite match up and, and this is certainly part of it or at least seems like it is. By the way, no public benefit company in Delaware.
Garrison Lovely, who's the author of this says no Delaware PBC has ever been held liable for failing to pursue its mission. Their legal scholars can't find a single benefit enforcement case on the books. So. In practice, this is a very wide latitude, right? There's a lot of shit that this could allow.
in this letter, they're trying to frame all the criticism of this very controversial and I think pretty intuitively kind of inappropriate attempt to, to try, you know, convert the the kind of nonprofit or the, all that jazz. They're trying to pin it on Elon and say basically he's like the only critic or that that's sort of the frame. Just 'cause it's easy to dismiss him as a competitor and for political reasons. He's a, he's an easy whipping boy. But there's a whole bunch of stuff in here.
You know, there, there's like, I'll, I'll just read one last one last excerpt 'cause we gotta go but open AI's criticism of the coalitions. This is the coalition you referred to Andre April 9th letter is particularly puzzling the company faults, the coalition for claiming that quote. OpenAI proposes to eliminate any and all control by the nonprofit over open AI's core work.
This criticism is perplexing because as OpenAI itself later demonstrated with its May 5th reversal, that was precisely opening AI's publicly understood plan. At the time the coalition made its statement, the company appears to be retroactively criticizing the coalition for accurately describing open AI's proposal as it stood. So you could be forgiven for seeing a lot of this as kind of manipulative, bad faithy comms from OpenAI, especially given that this letter was not meant to be made public.
And it fits, unfortunately, a pattern that we have seen the people, at least many people believe, they have seen many times over. We'll see where it all goes, but this is a, a thorny, thorny issue. Yeah, I think we've gotten hints at kind of the notion that open the eye legally, I. Has tried to be aggressive not just legally, also publicly in terms of arguing with Musk and so on. And we only have time for one more story, so we're gonna do that.
We have activating ai, safety level free protections from philanthropic. So Anthropic has their responsible scaling policy that sets out various thresholds for when they need to. Have these safety level protections with additional safety levels requiring a greater scrutiny more stringent processes, et cetera. So with Claude Opus four, they are now implementing these AI safety level three measures as a precautionary measure.
So they've said, we are not sure if Opus four is kind of at the threshold where it would be dangerous to the extent that we need this set of protections, but we are gonna implement them anyway. And this comes with a kind of a variety of stuff. They're committing to do it. They are making it harder to Jailbreak. They are adding additional monitoring system. They have a bug bounty program, have synthetic jailbreak data security controls, making sure the weights cannot be stolen and so on.
Got quite a few things. They released A PDF with the announcement that is you know, like something like a dozen pages with additional details in appendix. Yeah. And so the specific thing that's causing them to say, we think we, we are flirting with the ASL three threshold is the the bio risk side, right? The ability they think potentially of this model to significantly help individuals with basic technical backgrounds, like we're talking undergraduate STEM degrees to create or obtain and deploy.
Biological weapons. Right? So that's really where, where they're at here specifically. This is not I don't think associated with the autonomous research or autonomous autonomy risks that that they're also tracking. But we got early glimpses of this, right, with SONET 3.7. I think the language they used it was either ANTHROPIC or open AI with their model.
It was sort of similar was we are on the cusp of that next risk risk threshold where really it's kind of similar whether you look at the open AI preparedness framework or philanthropics L three in terms of how they define some of these standards. The security measures are really interesting especially kind of given our work on the data center security side and the cluster security side. One of the pieces.
And this echoes a re recommendation in a, in a RAND report on securing model weights that came out over a year ago now. They have implemented preliminary egress bandwidth controls. So this is basically restricting the flow of data out of secure computing environments where AI model weights are.
So literally like at the hardware level, presumably that's at least how I read this making it impossible to get more than a certain amount of bandwidth to pull data of any kind out of your, outta your servers. That's meant to make it so that if somebody wants to steal a model, it takes them a long time, at least if they're gonna use your networks, your infrastructure. And there are ways to kind of calculate what the optimal bandwidth would be under certain conditions for that.
But that was kind of interesting. That's a, a, a big piece of really r and d that they're doing there. Also a whole bunch of management protocols and point software controls, and there's a bunch of stuff here. this is a big leap, right? Moving to, to a L three. So this is fundamentally increasing. Th it means that they're concerned about threat actors, like terrorist groups and organized crime.
That, that they would start to derive a lift a, a significant benefit potentially from accessing Anthropics ip. They are not, you know, ASL three does not cover nation state actors like China. So they're not pretending that they can defend against that level of, of attack. It's, it's sort of like working their way there. As their models get more powerful, they wanna be able to defend against a higher and higher tier of adversary. so there we go.
Curious to see what the other labs respond with as, as their capabilities increased too. Yeah. And, and we're seeing hints that maybe we'll cover more next week, that and we've already covered to some extent that these reasoning models, these sophisticated models. Are maybe harder to align and, and are capable of some crazy new stuff. So this also makes sense for that. Yeah. But we are gonna call it with that for this episode, thank you for listening.
As always, we appreciate you sharing commenting and listening more of an anything. So please do keep tuning in.