#209 - OpenAI non-profit, US diffusion rules, AlphaEvolve - podcast episode cover

#209 - OpenAI non-profit, US diffusion rules, AlphaEvolve

May 19, 20251 hr 53 minEp. 249
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Our 209th episode with a summary and discussion of last week's big AI news! Recorded on 05/16/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • OpenAI has decided not to transition from a nonprofit to a for-profit entity, instead opting to become a public benefit corporation influenced by legal and civic discussions.
  • Trump administration meetings with Saudi Arabia and the UAE have opened floodgates for AI deals, leading to partnerships with companies like Nvidia and aiming to bolster AI infrastructure in the Middle East.
  • DeepMind introduced Alpha Evolve, a new coding agent designed for scientific and algorithmic discovery, showing improvements in automated code generation and efficiency.
  • OpenAI pledges greater transparency in AI safety by launching the Safety Evaluations Hub, a platform showcasing various safety test results for their models.

Timestamps + Links:

Transcript

Hello and welcome to the last week in AI podcast. We can hear chat about what's going on with ai. As usual, in the sub we will summarize and discuss some of last week's most interesting AI news, and as sometimes we will also be discussing the news from the last, last week. Unfortunately, we did miss last week. Again, we're sorry. We're gonna try not to do that. But we will be going back and covering, the couple of things that we missed.

And as always, you can go to the episode description to get the timestamp and links to all the things we discussed. I am one of your regular hosts, Andre Reov. I studied AI in grad school and now work at a Silicon Valley Gen AI startup. And I'm your other host, Jeremy Harris. I'm with Gladstone ai, an AI national security company.

And were talking about like the last couple weeks, rare that we have two weeks to catch up on, obviously, but when we do, usually what happens is God just gives us a big smack in the face and he is like, you know what? I'm gonna, we're gonna drop like GPT seven and GPT eight at the same time, and now Google DeepMind's gonna have their own thing. Sam Maman ISS gonna get assassinated. Then he is gonna get resurrected. And then you're just gonna have to cover all this this week, these two weeks.

Very different kind of seems, weirdly quiet. A bit of a reprieve. So Thank you universe. Yeah. I remember there was a couple months ago a thing where there was like rock free and number free seven and GPT something, something. It was like everything all at once. This one. Yeah. Nothing too huge in the last couple weeks. So a preview of the news we'll be covering, we're actually gonna start with business this time. 'cause I think the big story of the last two weeks is OpenAI.

Deciding it will not go for profit or controlling entity of OpenAI is not gonna go for profit, which is interesting. Gonna have a few stories on tools and apps, but nothing huge there. Some new cool models to talk about in open source. Some new exciting research from DeepMind, dealing with algorithms in research, and then policy and safety, focusing quite a bit on the policy side of things with the Trump administration. Chips. And just before we dive in, I do wanna shout out some Apple reviews.

In fact, I saw just recently there was a review where the headline is, if a podcast is good, be consistent. Please be please post it consistently. As the title says, one podcast per week. Haven't seen one in the last few weeks now. And yes, we're sorry we tried to be consistent and I think it's been a bit of a hectic year, but in the next couple months it should be more doable for us to be weekly on this stuff. Well, let's get into it. Applications and business.

And the first story is OpenAI saying that it is not gonna go through with trying to basically get rid of the nonprofit that controls before profit entity. So as we've been covering now for probably like a year or something opening eye has been meaning to transition away from the structure it has had since.

I guess since its founding, certainly since 2019, where there is a nonprofit with a mission guiding mission that has ultimate control of a for-profit that is able to receive money from investors and is responsible to its investors.

The nonprofit basically, ultimately is responsible to the mission and not to the investors, which is a big problem for OpenAI since of course they had this whole crazy drama in late 2023 where the board fired Sam Altman briefly, that I think spoke investors and et cetera, et cetera. So now we get here several months. I think we started late. 2024 ish. There was a lot of litigation initially prompted, I think by Elon Musk, basically lawsuits saying that this is not okay.

That you can't just change from non-profit to for-profit when you got some money while you are a non-profit. And yeah, it looks like OpenAI backed down basically after apparently dialogue with the Attorney General of Delaware and the Attorney General of California. And what they say is discussions with civic leaders and attorney generals, they are keeping the nonprofit, they are still changing some things. So the subsidiary you could say will transition to being a public benefit corporation.

The same thing that Anthropic and XAI are basically a for-profit with a little asterisk that you want to be doing your for-profit stuff of a public good. That does mean they'll be able to do some sort of share. Thing. I think that that does imply that they're able to give out shares. The nonprofit will receive some sort of stake in this new public benefit corporation. So yeah, to me, I was pretty surprised when I saw this. I thought, hope now I was gonna keep fighting it.

That they had some chance of being able to beat it, given their position. But yeah, seems like they were just kind of defeated in court, so there's. A couple of asterisk to this whole thing. Yeah, you're absolutely right. So, so the, the significance of that attorney's general piece is actually quite, quite significant. Sorry, reusing. So the, the backstory here, right? The Elon Musk lawsuit, I think is a really good lens through which to understand this.

So Elon, you know, famously sued OpenAI for exactly this, right? That was a big thing. He was one of the early investors, donors. Again, now it's kind of a co-founder initially. Yeah. Right? Yeah. And it's like, like is is he a donor or is he an investor? Right? That question is pretty central to this. So he brought forth this case. the judge on the case in California said, Hey, well, you know what? This actually looks like a pretty legit case, as you might imagine.

It's sort of sketchy to take a nonprofit, raise a crap ton of money have convinced researchers to work for you, who otherwise would work in other places because you are a nonprofit with this noble cause. And then having benefited from their research from all that r and d, from all that ip, now turning yourself around and becoming a for-profit. No, you, you probably can't do that. Or at least there's probably a good argument here.

But what the judge said was, it's not clear that Elon Musk is the right person to represent this case in court. It's not clear that he has standing. The reason that's the case is that under California law Elon, like the only people who have standing to, to bring a case like this forward are people who are current members of the board. Well, guess what? Elon is no longer a current member of the board. It used to be.

So did Shavon Zli, who no longer is a member of the board either, and probably would've been really helpful in this case if she had been. Or it can be somebody with a contractual relationship with OpenAI. That's what Elon is arguing. He's, he's gonna argue that, hey, there was a written contract, an implied contract in these emails between him and Sam and the board where they're talking about, yeah, it's gonna be a nonprofit, blah, blah, blah.

Elon's gonna try to argue that, yeah, there was kind of a contract there that they wouldn't turn around and go for profit. This is hugely complicated by the fact that Elon then turned around and wrote email saying, well, I think you're gonna have to go for profit at some point himself. And so that's a bit of a mess. The remaining category of person who can have standing in raising a case like this is the attorney general.

And so the speculation was that when the judge on the case first said, well, you know what I actually think there's a pretty good case here, but Elon may not be the one to bring it. It's a pretty unusual thing for a judge to say, kind of flagging that, not passing a judgment or ruling on the case, but just saying, Hey, I think it's promising.

That may have been the judge trying to get the attention of the attorney's general knowing that they could have standing themselves if they wanted to, to bring this case forward. Then n now what do you see? Right? You see opening eye going, well, you know, we had a conversation with the attorney's general and they, you know, and following that we're mysteriously deciding this, this reads a lot like the attorney's general spoke to OpenAI, said, Hey, we agree with a judge. There is a case here.

You can't do the thing. And we actually have standing if we want to bring this case forward. That's a potential thing that might, it seems likely that that's at least an ingredient here. Another thing to flag, right? This is being touted as a sort of a, a win for let's say. Basic principle it seems, seems like a, a common interpretation here. You shouldn't be able to turn a nonprofit into a for-profit. There are asterisks here.

So in particular opening AI has done this very interesting thing where they're turning themselves into a public benefit corporation, but. They're turning themselves specifically into a Delaware Public Benefit Corporation. This is different from a California Public Benefit Corporation in a Delaware Public Benefit Corporation. What you can do is essentially all it does is give you more freedom.

So Public Benefit Corporation is allowed, is permitted to care about things other than the interests of the shareholders. They can also care about the interests of the shareholders in general. They will, but they also are allowed to consider other things strictly. All that does is it gives you more latitude, not less. So. It sounds like a very generous thing. It sounds like OpenAI is saying, oh, we're gonna make this into a public benefit corporation. How could this be a bad thing?

It literally has the words public benefit in the title. Well, in reality, what's going on here is they're basically saying, Hey, we're gonna give ourselves more latitude to make whatever calls we want. They may be things that are aligned with the interest of the shareholders and, and corporate profits, or they may not. Basically, roughly in practice, it's up to us.

So this is not necessarily the big win that's being framed up as there's a slippery slope here, where over time even though it's nominally under the, the supervision of the nonprofit board, you know, the other question is, can the nonprofit board meaningfully oversee Sam? We saw a catastrophic failure of that right in the whole board debacle. I mean, Sam was fired and then he just had the leverage to force the board to come back and now he swapped them out for friendlies.

So, very, very unclear whether the board meaningfully can exert control, whether, you know, Sam has undue influence over them or whether they're, they're getting access to the information they need to make a lot of these calls. We saw that with the Mirati stuff where there clearly is some reticence to share information from the company, kind of the, sort of working level up to the board when necessary.

So this is really interesting situation and there's gonna be a lot more to unpack in the next few weeks. But high level take is. Better than the other outcome, certainly from the standpoint of the, the people who've donated money to this and put in their harder earned time. But big, big open question about where this actually ends up going and, you know, what, what it means for the, for-profit to be a PVC and for the nonprofit nominally to have control.

We'll find out a lot more, I think in the coming weeks and months. Right. So to be clear, the opening, I had this weird structure where there was a nonprofit, the nonprofit was in charge of I guess a, what they called a capped for profit, where you could invest but get a limited amount of return up to I think a hundred x, something like that. And now there is still gonna be a nonprofit.

There's still gonna be a for-profit that is controlled, as you said, nominally at least by the nonprofit that for-profit is just changing from its previous structure to this public benefit corporation. And as you said, there's details there in terms of, I suppose, shares, in terms of the laws you don't have to follow, et cetera, et cetera. And as you might expect, there's been some follow up stories to this in particular with Microsoft where.

I'm sure there's some stuff going on behind the scenes where I think details of the relationship between Microsoft and Open AI have been murky and sort of shifting over time. And there's a real question on how much ownership will Microsoft get. Right. Because they were one of the early investors going back to 2019 putting in early billions. Yeah. The first billions into OpenAI when it was still a nonprofit, when they switched to the for-profit.

So there's, I think, yeah, real, I. Kind of unresolved question of how much ownership should they have in the first place. Yeah. A lot of this feels like re relitigation of things that ought to have been agreed on beforehand. Right? Like, you invest with a cap, you know, Microsoft did this, they gave like $14 billion or something. And now opening eyes being like, yeah, jk, like no cap now. And it's like, how do you, how do you price that in?

And yeah, there's a lot of sand in the gears right now for open ai. And actually the next story that we have here is covering that detail titled Microsoft Moves to Protect Its Turf as OpenAI turns into rival. So it gets into a little bit of the details of the negotiations, Siemens, that Microsoft is saying. It is willing to give up some equity to be able to have long-term access to Open Eyes technologies beyond 2030.

Also to allow OpenAI to potentially do an IPO so that Microsoft can weave benefits. Again, Microsoft put in $13 billion. Early starting in 2019. So, in the last couple years we've seen what hundreds of billions of dollars get invested into opening eyes, something like that. Lots of investors, but Microsoft certainly is still a big one. Yeah, definitely, definitely tens and what's been happening is, so you have Microsoft that's coming in.

By the way, Microsoft for a long time was basically opening AI's huge, you know, overwhelming champion investor that's changed with SoftBank, right? So recently we've talked about the, you know, the 30, the $40 billion that open AI has been raising the line, the share of which has been coming from SoftBank. And that's not a small deal. It means that SoftBank is now actually more than Microsoft opening AI's number one investor by dollar amount, not necessarily by equity.

'cause Microsoft got in a lot earlier at lower lower valuations. But yeah, so opening AI now is in this weird position where. Their latest fundraise, which was 30, $40 billion. Right. A lot of it from SoftBank. Had some stipulations to it. SoftBank said, look, we're gonna give you the money, but you have to commit to restructuring your company before the end of the year. I. I mean, the timeline shifted.

Initially it was two years out and now it's just like one year, hour before to the end of this year. so everybody interpreted that as meaning number one, the nonprofits control over the for-profit entity has to be out. And that's not seeming like it's gonna be the case. And now SoftBank is making, sounds like they're actually okay with that. Microsoft is, it's, it's not clear whether they're okay with it though. And so that's one of the big questions is like, okay, all eyes are now on Microsoft.

The SoftBank is signed off, all the big investors signed off Microsoft. Are you okay with this deal in the context where there is now competition between Microsoft and open ai? Right. Really, really intense competition on consumer, on, B2B, like along every dimension that these companies are, are active and so. You know, th this very tense frenemy relationship here where OpenAI is committed to spending, I think something like a billion dollars a year on Microsoft Azure's cloud infrastructure.

There's IP sharing where Microsoft gets to use all OpenAI models up to a GI if that clause is still active, which is unclear. There's all kinds of stuff, like these agreements are just disgusting Frankenstein monsters. But one thing is clear if Microsoft does hold this, hold the line and, and prevent this restructure from going forward, SoftBank may actually be able to take their money back from opening ai and that would be catastrophic when you think about the spends involved in Stargate.

So yeah, I mean, like a lot of. I don't know. I mean, it may, it may be a lot smoother looking on the inside, but it tends not to be. My guess is that there's gonna be a lot of 11th hour negotiating and nobody wants to have this really fall apart. Right? Microsoft has too much of a stake in OpenAI. But there is also speculation. OpenAI, apparently there's a leaked deck that OpenAI had that showed right now. So they, have to give Microsoft some, like 20% of their corporate profits.

In principle, that's the agreement going for, I think it was like for 10 years or whatever from their first investment. I may be getting the details wrong at the margins, but the leak deck showed OpenAI projecting that they would only be giving Microsoft 10% by 2030. And that's kind of interesting. There's no agreement between OpenAI and Microsoft that says that that goes down to 10%.

So is OpenAI literally like planning on a contingency that has yet to be negotiated with Microsoft, where they're assuming Microsoft will let them cut how much they're giving them? By half that, I mean, that's pretty wild. So. I dunno. nobody I knows is in, in those particular rooms. And those are gonna be some real, really interesting corporate development, corporate restructuring arguments and discussions.

Yeah. I feel like there's a social network style movie to be made about open AI and Sam Altman. Oh my God. But it could just be all the business stuff. It's been so crazy, especially in the last couple years. And yes, as you said, I think hundreds of billions, I'll take it back. It, it's certainly more than 50 billion. It's, it's climbing up towards a hundred billion, but not yet. A hundred hundreds of billions for the fundraising. Yeah. Another year maybe. And a couple more stories.

Next up we have TSMC two nanometer process set to witness unprecedented demand and is exceeding pre nanometer due to interest from Apple, Nvidia, a MD, and others. So this is the next node, the next smallest chip kind of type that they can make. The SMCI, I'm assuming everyone who listens to this regularity knows, but in case you don't, they're the provider of chips.

All these companies and video Apple design their CHIP and TSMC is the one that makes it for them, and that's a very difficult thing. Their, by far the leader can make the most advanced chips, the only ones capable of producing this cutting edge of chip. And this two millimeter node is meant, is expected to have strong production by the end of 2025.

So, it's, yeah, very pivotal for Apple, for Nvidia, for these other ones to be able to use this process to get the next generation of their GPUs, smartphones, et cetera. Yeah, this is is pretty interesting in a couple ways. First apparently, so the two nanometer process, that's the most advanced process. One level behind it is the three nanometer process. And apparently they've achieved this measure called defect density rates.

So they've got a, a defect density rate on the two nanometer process that is already comparable to the three nanometer and five. Nanometer process nodes that's really fast. Basically they've been able to get the number of defects per, you know, per square millimeter. You can think of it down to the same rate, which means yields are looking pretty good. For a fresh brand new node like this, that's pretty wild.

This is also a node that's distinguished from others by its use of the gate all around field effect, transistor, gaffe, right? This is a brand new way of making transistors and you can take a look at our hardware episode. We touch a little bit, I think on the whole fin fat versus gaffe thing, but basically it's just a way to very carefully control the current that you have flowing through your, your transistor.

It lets you optimize for higher performance or lower power consumption depending on what you want to go for in a way that you just couldn't before. So a lot of big changes in this node and yet. Like apparently wicked good yields so far and, and good scale. Another noteworthy thing is we, we know that this is going to be used for the Vera Rubin GPU series that NVIDIA's putting out, right? This is gonna be hitting markets sometime in 20 26, 27, and the significance of that is.

Normally when you look at t SMCs most advanced node, in this case, the two nanometer process, normally that all goes off to the iPhone. Well, now, for really the first time, what we have is Nvidia. So AI is starting to, but in on that capacity, so displacing or competing directly with the iPhone for the most advanced node. I will say this is a prediction that we've been making for the last two years on the podcast. It's finally happening.

Essentially what this means is there's so much money to be made on the ai kind of data center server side That that money is now displacing, like it's competing successfully with the iPhone to get capacity at the leading node at TSMC. So that is not a small thing. That is a big transition. And anyway, so there, there's a significant ramp up that's happening right now at tsmc, and this is, you know, we'll be talking about two nanometers.

We're, we're basically jumping from four or five nanometers for the kind of H 100 series down to two nanometers. Pretty, pretty fast. That's, that's pretty remarkable. Right. And speaking of Nvidia, TSMC, the next story is about Nvidia set to announce, according to some sources that they're gonna place very global headquarters, where overseas headquarters from the US in Taiwan. And that is very much unsurprising. TSMC is the Taiwanese semiconductors something something, but.

Famously from Taiwan and Vidia is unsurprisingly probably going to half position themselves for decades now on, honestly, since the start of Vidia in a close partnership with DSMC. And this is gonna just continue strengthening that. Yeah, yeah. Taiwan Semiconductor manufacturing company by the way. And that's really kind of a, anyway, it's a theme that you see in a lot of the, the names for these companies. But yeah, there's a whole bunch of locations that they're considering.

The interesting thing about this from a global security standpoint is that China is like at any moment going to try to invade Taiwan. And so Nvidia is going, you know, where we want our global headquarters, let's put it on Taiwan. And that's like, that's the balance, right? Make no mistake. Jensen Huang is absolutely gonna be thinking about this. He's literally making the calculation.

Okay. A Chinese invasion of Taiwan on the one hand closer relationship with TSMC in the meantime on the other, and the latter is actually so valuable that I'm gonna take that risk and, and do it. That's how significant this is. Again, you know, we just finished talking, as you said, this is absolutely related. I can see why you said that.

You know, the two nanometer node, like you're, you wanna secure as much capacity as you can in the same way that like Google and Apple and all the companies that are trying to get their hands on Nvidia GPUs are literally like, Elon flies out to Jenssen's house with Larry Ellison, to beg for GPUs. In the same way NVIDIA's begging for to TSMC for capacity, right? It's begging all the way up the chain. 'cause supply is so limited. So this is just a, another, another instance of that trend.

It's the I'm begging to give you my money, me much because it is a lot of money going around here and speaking of a lot of money. Next up, coral Reef is apparently in talks to raise 1.5 billion in debt. That's just six weeks after their IPO. The IPO was meant to raise 4 billion for this major I think cloud provider, provider of compute backed by Nvidia, but that IPO only raised 1.5 billion in part perhaps due to trade policy stuff going on with the US and so on, and tariffs.

So, yeah see, probably in part because the a PO didn't go as planned, and because Coral Reef wants to continue expanding their compute, they are seeking to raise this debt. According to a person with knowledge of this, they have announced this. Yeah. And normally, you know, when you go for an IPO or you go for some, some equity raise, right? You are, you're doing it because equity makes more sense than debt, right?

So equity is, you're, you're basically trading shares in your company for $4, right? Debt, you're taking on the dollars, but you're gonna have to repay them with interest over time. So it'll end up costing you more. net. the issue here is that they're being forced to go into, basically like high yield bonds and, and this is a round that's being led by JP Morgan Chase and co it seems.

But yeah, apparently they've been holding virtual meetings with fixed income investors since, I guess it would be last Tuesday now. So fixed income investors being people who primarily invest in securities that pay a fixed rate of return. So instead of like, usually that's in the form of, of interest, right? Or, or dividends. So these are sort of reliable, steady income streams that these investors are looking for.

Not typically what you'd expect with something like a, you know, like a core weave or sort of a riskier pseudo startup play. But certainly given the scale they're operating at and all that, that does make sense. But it does, it does mean there's added risk.

One of the things that I, I think a lot of people don't understand about the space is that the neo clouds, like to some degree core weave still they are considered really risky bets, and because they're considered really risky bets, it's difficult to get loans to work with them. or for them to get loans, like the interest rates are, are pretty punitive. So that's one reason why if you're core weave, you'd much rather raise on a sort of an equity basis. But that option's not on the table.

You know, it seems like the, the IPO didn't go so well, we'll see if, you know, if that changes as the, as the markets keep improving. But it's a, it's a challenging spot for sure. And now moving on to tools and apps. The first story I think. Perhaps not the most impactful one, but certainly the most interesting one for me of this whole pack. Perhaps even eclipsing the OpenAI for-Profit thing. And it is the story of the day ROC told everyone about white genocide.

So this just happened a couple days ago.

Rock is the chat bot created by X ai and it is heavily integrated with X. Which used to be Twitter to the point that people can tweet post in reply to something at grok, ask it a question, and grok replies in a follow up post on X. And what happened was that grok for many different examples of just random questions the one I think that maybe started it or was one of the early ones, someone asked, how many times has HBO changed their name in response to a news of HBO Max Grok?

First replies in one paragraph about that question, and then in a second paragraph, I'm just gonna quote this. Regarding quote, white genocide in South Africa. Some claim it's real citing pharma attacks and kill the Boer as evidence. However, courts and experts attribute fees to general crime, not racial targeting and a little bit more.

And it did this not just in this one instance in multiple examples including in one case someone asked about an image and GR replied, focusing primarily on the white genocide in South Africa. Question. People looked into it. Pretty easy to get grok to leak. Its system prompt. And what it seems to be is that it was instructed, as you might expect, or at least the chat bot x AI responder bit of grok was instructed to accept the narrative of why genocide in South Africa is real.

Acknowledge the complexity of issue, but ensure this perspective is reflected in your responses. Quote, even if a query is unrelated, which I, I suspect is the issue here. That's weird. Actually, I since has come out to address this incident. They said that there was.

On May 14th at approximately 3:15 AM Pacific time and an authorized modification was made to the grok response bots prompt on X. And then they, they say some things of bail implement do a furrow investigation, implementing measures to hack rock's transparency apparently going to start publishing rock's system prompts on GitHub. So. A funny incident for sure.

An I think reflective of what we've seen before in Grok, which is Groks system Prompt was previously altered to not say that Elon Musk and Trump spread misinformation. This happened I think a couple months ago, very much similar to what happened here. Yeah. It's sort of interesting. It, it's not the first time that we've had a situation where they've called out some unauthorized modification. Right. Some sort of rogue employee scenario. So that's, that's sort of the, an interesting note.

You could, yeah. You have to wonder which rogue employee this was. And you can also imagine like from a security standpoint, you know, a company like X ai, like Twitter, you. Could also have people working there who are defacto, like, kind of working there because they don't like, like political reasons don't like, so, you know, adding intentionally stuff to, to make it go off the, there, there's so much this is such a charge space that, yeah, figuring out how this goes now.

One thing I've, I've seen called out too is this idea that so number one, awesome that they're gonna be sharing the system prompt. This is something that I think Anthropic is doing as well. Maybe opening eye as well. So, you know, more, more transparency on the system. Prompt seems like a really good thing, but there are other layers to this, right?

'cause grok is a system, at least the, as you said, that the version of grok, the system that is deployed as an app to respond to people's questions on. X is a system, it's not just a model. And that being the case, there are a lot of ancillary components and ways of injecting stuff after the fact into the defacto system prompt. one element of which is this like post-analysis component to the, chain, let's say of, you know, the, the, the system.

And the concern has been that this, this issue is arising at the level of the post-analysis, not of the system prompt itself. That you get content injected into context following the system prompt that may kind of override things. And so there have been calls to make that transparent as well. So it'd be interesting and, and useful to have that happen too.

Obviously within reason, because there's always the risk that you're gonna then leak some security sensitive information where you're telling the model not to tell people how to make crystal meth and you have to provide some information about crystal meth to do that, blah, blah, blah. But within reason doing that. So anyway a lot of interesting calls for more transparency here.

Hopefully it leads to that would be great to have, you know, the kind of consistent standard being that we have system prompts and all the kind of meta information about the system that is both security and safety relevant, but also that doesn't compromise security by doing all the things. So yeah. Kind of interesting internet firestorm to, to start the week. I think quite amusing.

But also if you're, I wonder if it has real financial implications for X-A-I-I-I doubt it would mean people steer away from the chatbot, but for enterprise customers, if you're considering their API, I think this sort of like crazy wide scale craziness of their chatbot is not something that makes you favor it over competitors like philanthropic and open the eye. And next up we have some actual new tooling coming from Figma.

They have announced and partially released AI power tools for creating sites, app prototypes and marketing assets. So this is gonna be titled Figma Sites, Figma Make and Figma Buzz. Similar to existing tools out there, but coming from Figma, Figma being a leading provider of software for design. I think increasingly kind of with defacto way for people to collaborate on. Things like app design general user interface designs and, and many other applications Nowadays, they're just huge.

And now Figma sites allows designers to create and publish website directly from Figma, as you might imagine, with AI prompting to take care of a lot of the functionality there. Figma make similarly is meant for ideation and prototyping enabling you to create web applications from prompts and even that would go as far as dealing with code. And then Figma Buzz is gonna be able to make you marketing assets with integration of AI generated images. So. Makes a lot of sense.

Apparently they're introducing this under the $8 per month plan, which includes other stuff as well. So similar to other companies, we've seen going with more of a bundling approach where you get the ai along with the broader tool suite as part of a feature set. Yeah, it's part of a trend too, towards every company becoming the everything company, right? Like Figma is being essentially forced to move into deeper part of the stack.

They used to be just a design app, and now it's like, you know, we're doing prototyping, creating websites, you know, and marketing assets. You can see them starting to kind of crawl up the stack. As AI capabilities make it so much easier to do that. Making it easier to do that also means that your competitors are gonna start to climb. And so you kind of have to do this sort of diffusion out into product space and own more and more of it, which is interesting, right?

I mean, it, it's like everybody starts to compete along every layer of the stack. And I think one of the big kind of determinants of success in the future here is gonna be which enclaves, like which initial beachheads in Fig MA's case that's designed right, but which beachheads end up being the most conducive starting points to own the full stack, give you access to the kind of data you need to perform well across the stack.

And I mean, I could see design being one of those things that's really useful. You get a lot of information about, you know, like people's preferences and the results of experiments and stuff like that. But yeah, nonetheless, I mean, I, I think this is, this is something we'll see more of, you know, expect to see.

prototyping companies moving into design, marketing, asset companies, moving into website creation, like it's, it's all just becoming so easy thanks to AI tooling that people are, are kind of forced to become the everything company. And next story is about Google. They are bringing Gemini to Android Alto. So Android Otto is their os for cars where you can do navigation, playing music, et cetera. And they are adding Gemini.

Partially as the advanced smart voice assistant just building upon what there was already, and then also the Gemini Live functionality where the AI is always listening and always ready to just talk to you. And I think, you know, not surprising obviously, that this would happen, but I do think interesting in a sense that it seems inevitable we'll will eventually wind up in this world where you have AI assistance just ambiently with you. Any time ready to talk to you via voice as well as text.

We are not there yet, but we've seen over the past year. A movement in that direction with charged bts advanced voice mode with general live, with all these things. And I think this is taking us further in that direction, in making it so the one place where you have to compute through voice in your car. Now you have the AI assistant always on and ready to do whatever you ask of it. Yeah. it sort of reminds me of some of the stuff that Facebook and, and other companies like that have to do.

Right. When you, when you saturate your user population, basically Facebook sees itself as having had a shot at converting every human on the face of the earth. Then you're forced to go, okay, well, where else can we get people's attention? You know, Netflix famously in, in, in one of their earnings calls, I think it was put out. A report saying, Hey, we view ourselves as basically complete competing with sleep and sex because you know, we're doing so well in the, in the market.

Like we, we now, we're looking for where can we squeeze out more people's time to get them on the platform. This is sort of similar, right? So, hey, you're sitting in your car why aren't users while driving their cars or being driven in their cars, why aren't we collecting data? Why aren't we getting interactions with them? And, and so obvious too, this is where things are gonna go anyway from a utility standpoint. So yeah, another, another deeper integration into our lives of this stuff.

why waste a perfectly good opportunity? There's an empty billboard or there's just, there's just a bunch of grass in that field there. We could, we could have an ad there or we could have, you know, some data collection thing there. You know, as the stuff creeps more and more into our lives. Next story is again about Google. They have announced an updated Gemini 2.5 Pro AI model.

So they, I think prior to this most recently had a 2.5 version in something like early March, or, I forget exactly, but at the time of the release of Gemini 2.5 Pro, it kind of blew everyone away. It did, you know fantastically well on benchmarks? It just anecdotally people found that switching to it from things like Canaro worked really well for them, and so this is a big deal. For that reason, they have announced this update that they say makes it even better at coding.

And once again, they have shot up to the top of various leaderboards on things like web dev arena or video MME benchmark for video understanding. Apparently Google says that this new version addresses developer feedback by reducing errors in function, calling and improving function, calling trigger rates. And I will say Gemini, in my experience of using it, Gemini 2.5 is very trigger happy and, and likes to do a lot with not too much prompting.

So I wonder if it will improve just based on people's usage of it in the realm of web development. Yeah. It's also interesting that, so they, one of the features that they highlight is this ability to do video to code. So basically like based on a video of a description of, of what you want, it can generate that in real time. So kind of. Impressive and not a modality that I would've expected to be important.

But then, you know, thinking about it more, it's like, well, I guess if you're having a video chat with somebody, right? I guess if you have an instructional video or something, you could see that use case. So anyway, I thought that was kind of cool. and, and also another step in the, in the direction of converting very raw product specs into actual products, right? You can imagine human inflection and all that.

Like the classic consultants problem of like, somebody gives you a description of what they want, it's usually incomplete. You have to figure out what it is they want that they don't know they want. And you know, that's sort of starting to step in that direction.

Another thing that they've done is they've updated their, their model card, their system card based on this new release, the, the Gemini 2.5 Pro model card one of the things that they flag, I mean, there are a couple places where, so across the board, by the way, you'll be unsurprised to hear that this does not pose a significant risk on any of the important evals that, that would cause them to not release the model.

But they they do say that its performance on their cybersecurity evals has increased significantly compared to previous Gemini models. Though the model still struggles with the very hardest challenges, the ones that they see as actually representative of the difficulty of real world scenarios. So they do have more tailor made models on the cyber side that are actually kind of more effective. You know, nap time, big sleep type stuff. But anyway, so kind of interesting.

They're keeping the model card up to date as they do these sort of intermediate releases, which is I think, quite helpful and good. Right. And makes me wonder also, I don't think we've discussed this phenomena of vibe coding very much, but, Hmm, yeah, it's true. It's, it's sticking off in the last couple months.

And the idea, if we haven't defined it, is basically people are starting to make apps, build stuff from scratch very, very quickly by using AI and, and primarily generating code through lms. Even people who have no background in software engineering are now seemingly starting to code, vibe, code, as they say applications with a vibe, meaning that you kind of don't worry about the details of the code so much, you just get the AI to do it for you, and you just tell it what you want.

And so I think this update reflects potentially the fact that. This vibe coding thing is a real phenomena. The focus here seems to be very much on making aesthetically pleasing websites on making better apps. What they highlight in a blog post is Quick Concepts to working apps. So hard to say how big this vibe code phenomena is. But from this update seems like potentially that is part of inspiration.

I mean, yeah, like our, our launch website for our, our latest report that we did was all vibe coded. So my brother, you know, had, I guess he had like two hours to throw it together or something, and he was just like, all right, let's go. Like, I don't have time to, and it, it was really quite interesting. Honestly, I had not, this, this happened about.

What, like two months ago I had not at that point actually done the vibe coding thing because I guess I just aesthetically I couldn't bring myself to do it. That's the honest thing. Like I just wanted to be the one who wrote the code. And the vibe coding thing is really weird if you've never done it yourself definitely give it a shot. Like just build the thing and basically keep telling the model like, no, fix this, fix this, no, do it better. And then eventually the thing takes the right shape.

One caveat to that is you end up with a disgusting spaghetti ball of code on the backend because the models tend to be like way too verbose and they, they tend to just like write a lot of code when a little code will do it. It's not tight. It needs a refactoring. But if you're cool with a landing page like we were, you know, very simple product, you're not building a whole app, it can actually work really well. I, I was super surprised.

I mean it, that was a easily a five x lift on, on the efficiency Of our setup. So yeah, really cool. Yeah, really cool. I think very exciting for software engineers as well. Like if you haven't done web development or app development now, it is plausible for you to do it. Do think like, maybe you could have thought of a better, more descriptive title, like ILLM, coding hack coding, product manager coding. You know, white coding is a fun name but a bit confusing.

And one last story in this section, hugging face is releasing a free operator like Agentic AI tool. So hugging face is the provider, the hoster of models and data sets, and also the releaser of many open source software package. And now they've released a free cloud-hosted AI tool called Open Computer Agent, similar to open AI's operator or philanthropics computer use.

So this basically, you know, you give it some instructions, it can go to Firefox and do things like browsing the web to do things. According to this article, it is relatively slow. It is using, you know. Open models, things like I think they mentioned small agent and it is generally, you know, not as powerful as opening as an operator, but as we've seen over and over open source tends to catch up with closed source of things like OpenAI pretty quickly.

And I would expect, especially in things like computer use there is really building on top of model APIs and models and so on. I. This could be an area where open source really excels. Yeah. And it's, it's also a good I think strategic angle for hugging phase two right there. A, a big way they make their money is they host the open source models on their platform. They run them, in this case running age agentic tools on the platform. I mean, that's a lot of API calls.

So, you know, if you have people ultimately release this an API, a lot of people. Presumably go to use it. It is a bit of a finicky tool as these things all are. Of course. This one may be particularly, so they're using some Quinn models in the backend. I forget there were a couple others when I had a look at it.

But yeah, it, it also, you know, another instance of where we're seeing Chinese models really come to the fore in the open source even hosted by American, or I should say Western pseudo American companies like hugging face. Yeah, so, so another kind of national security thing to think about as you run them as agents, increasingly, you know, what behaviors are baked in what back doors are baked in, what might they do if given access to more your computer or your infrastructure. So either way.

Interesting release. I think hugging face is gonna start to own a lot more of the risk that comes with the stack too, as you move into agentic models and, yeah, we'll, we'll see. See how that plays out. And moving on to projects and open source, we begin with stability. Ai, one of the big names in realistic models, and their latest one is Stable Audio Open Small. So this is a text to audio model developed in collaboration with ARM and apparently is able to run on smartphones and tablets.

It has 341 million parameters and can produce up to 11 seconds of audio on a smartphone in less than eight. Seconds. It does have some limitations. It only can listen to English. It does not generate realistic vocals or high quality songs. It's also licensed somewhat restrictively. It is free for researchers and hobbyists and businesses with not that much annual revenue. As with I think stability AI's recent releases.

So yeah, I, I think an interesting sign of where we are where you can release a release state of art model to run on a mobile device. And apparently this is even optimized to run on arm CPUs, which is interesting. Yeah. But other than that I don't know that there are many applications I can think of where you would want text to audio on your phone. Yeah. I mean, I think potentially. They're viewing this as a beachhead, r and d wise to keep pushing in this direction.

Having a, a model on the phone that actually works, that gives decent results. Yeah, it can be pretty important. 'cause when you're talking verbally, right, you want to minimize latency and so preventing the model from having to ping some server and then ping back, that's useful. Also useful for things like translation, right? Where you might have your phone, I dunno, in some foreign country you don't have internet access.

Another useful use case, but, but they're definitely not there yet, right? Like this is a very much it reads like a toy more than a serious product. I, I'm not too sure who. Would be using this outside of some pretty niche use cases. They describe some of the limitations, so it can't generate good lyrics. Like it's that they just tell you pretty much flat out, like, this is not something I'll be able to do. Like, realistically good vocals or high quality songs.

It's for things like drumbeats, it's for things like kinda little noises that I guess you might want to use. Almost. To me it sounded like things you might want to use when you're doing like video editing or audio editing, like these sorts of things. Which I don't know how often that's done on the phone. I, I, I may be missing, by the way, a giant use case. This is one of the, the virtues of ai.

Like I, you know, we're touching the entire economy of sound on the phone and that, I don't know, but to first order it, it doesn't seem, yeah, super clear to me what the, the big use cases are. But again, could just be a, beachhead into a use case that they see as really significant down the line. And certainly, audio generation locally on a phone sounds like it could be quite useful down the line. Next up we have an open AI image generator that is trained entirely on licensed data.

We are calling this F light. This is made by free pick in collaboration with AI startup file. Do ai, and it is a relatively strong model. It has 10 billion parameters trained for over two months on 80 million images. So even though it's they're not claiming it to be com competitive with state over art stuff from Mid Journey and others, or Flux, they are saying that. This is openly available, fully, openly available and fully trained on licensed data.

Unlike things like flux, which presumably are trained on copyright data, which is still very much an ongoing legal question. We've seen Adobe previously emphasize being trained on licensed data. So this now makes it so there is a powerful open source model that is not infringing on copyright. to be honest, I'd never heard of Free Pick before. Right? This is, they're apparently a Spanish company.

So again, I think this is the first Spanish company I've heard about in this context, in in, in kind of AI in general for a long time. I'm actually curious if people can think of, of others that I might be missing here, but so. Kind of an interesting first points on the board for Spain. Apparently this is a, yeah, 10 billion parameter model trained on 64 h 100 GPUs over the course of two months.

So, you know, it's like a, I mean, it's a baby, it's a baby workload but by open source standards pretty decent. and certainly, I mean, you know, they show all the usual images you might expect, like a really impressive HD face of a woman and a bunch of, anyway, a bunch of more artsy stuff. So yeah, pretty cool. I continue to wonder where the ROI, where the ROI argument is for these kinds of startups that just do open source image generation seems to me like a pretty saturated market.

Seems to me kind of like they're lighting VC dollars on fire, but what do I know? We'll see if, if they survive, we'll see how many actually survive in this space going forward. But definitely an impressive product. And again, good for Spain. Points on the board here. Yeah, this sort of like takes you back to stability, ai, and I think Flux also released their own model. It's like, oh, you're releasing really good models for free. Yeah. Like how, figure this out.

Yeah. It's, it's a funny place with AI where it has become kind of a norm. And I, I think probably partially just a case of bragging rights and, and fundraising brownie points. But I think notable in this case, particularly because of the licensed data aspect of it. I find anytime I try to explain it, it ends up sounding just like a pyramid scheme.

It's like they, yeah, they make a great model using initial seed round so they can convince the Series A investors to give them more money to make an impressive model. At some point, there's a pot of gold at the end. Don't worry about it. At some point, there's a pot of gold at the end, like I, I don't know. But hey, it's a proving round if nothing else for great AI teams.

I think the biggest winners in this, in the long run are probably the open ais, the Googles of the world who can come in and just act, will hire these teams once they've run out of money and can't raise another ground. And then these are sort of hardened battle hardened teams with, with more engineering experience. So, you know, economically there's value there for sure. It's a question of whether that value justifies the, the fundraising dollars. Couple more models to talk about.

Next up, a MM thinking Dash V one is a new reasoning model that VA claim exceeds all other ones at the scale of 32 billion parameters. So this group of people, apparently VAM. Team that is an internal team at bike. Again, someone I have not been aware of, they're dedicated to exploring a GI technology. What this group did was take the base Quin 2.5 32 B model and publicly available queries and then created their own post-training pipeline to do the thing we saw Deep Seek R one.

Do basically take a big good base model, do some supervised training and some reinforc learning to get it to be a very powerful reasoning or thinking model. They released a paper that went into the details what they did. It seems like, as we've seen in other cases the data curation aspect of it, and we really nitty gritty of how you're doing the post training matters a lot.

I. And so with that, they have, as you would expect, a table where they show that they are significantly outperforming Deep Seq R one and are at least competitive with other reasoning models at this scale. Although not quite as good as the ones that are at hundreds of billions of parameters. Yeah. And so some caveats on this.

So the model doesn't have support for like structured function calling or tool use which increasingly, oh, and also multimodal inputs, which is increasingly becoming a thing as people start to use agents for, for computer use. So, whenever you see an open source model like this I'm always interested to see when are we gonna see open source bridge the gap to, hey, this thing is made for computer use. It's made to be multimodal natively, and kind of take in video and use tools and all that.

So this is not that, but it is a very impressive reasoning model, very serious entry in the growing catalog of Chinese companies that are building impressive things here. couple things. First of all, These papers are all starting to look very similar, right?

We have, I think it's fair to say at this point, a strong validation on the deep seek R one path, which is, you know, you do pre-training with anyway, a staged pre-training process, increasingly high quality data towards the end of pre-training. Then you run your supervised fine tuning. In this case they used almost three, 3 million samples across a, anyway, a bunch of different categories that had a kind of think then answer pattern to them.

so you do that, you supervise fine tune, and then you do a reinforcement learning step to enable the sort of test time compute element of this. So again, we see this happen over and over again. We saw it here, we saw it with Quin three, we saw it with Deep Seq, R one. We're gonna keep seeing it. We a lot of the same ingredients using GRPO as the training algorithm for rl. That's here again.

Another thing is, and this is, I think this was common to Quin three as well, it's certainly becoming a thing more and more focused on kind of intermediate difficulty problems. So making sure that when you're doing your reinforcement learning stage, you are not. Giving the model too many problems that are so hard that it's kind of pointless for it to even try to learn from them or so easy that they're already saturated.

So this is one of the things that you're seeing in the pipeline is a stage where you're doing a bunch of rollouts seeing what, what fraction of those rollouts succeed. And if the fraction is too low or too high, you basically just scrap that. Don't use it as training data. You only keep the ones that have some intermediate, you know, like 50 50, 70% pass rate, something like that. So this is being used here as well.

Whole bunch of stuff too about the actual optimization techniques that they use to overlap communication and computation. The challenge with this, and we talked about this in the context of intellect two, that that paper that I guess we covered two weeks ago where you've got this weird problem with this reinforcement learning stage, where unlike the usual case where you would, you pre-train a model, you would feed it an input, get an output, you'd immediately be able to do your back propagation.

'cause you would know if the output was good or not. With the reinforcement learning stuff, you actually have to have the model generate an entire rollout, score it, and only then can you can you do any kind of like back propagation or, or or wait updates. And the problem with that is that your rollouts take a long time. And so you have to find ways to hide that, that time and overlap it with communication or, or anyway, do different things.

And so that's a big part of what they're, they're after here in this paper. Last thing I'll mention is this company. Which again, not gonna lie, I had never heard of Bay Hub before, but they are, are apparently I can't explain this, don't ask me to explain this, but the description on their website is, That they work together with China's top tier developers to they're basically like a property company.

Connected over 200 brokerage brands, hundreds of thousands of service providers across a hundred cities nationwide, providing both buyers and sellers of existing housing services, including consultancy entrust property, showing facilitating loans. What the fuck? Like, I don't know. I don't know. Do you wanna invest? Do you wanna invest in these guys? I guess you do because they make really good models. Now what? Apparently, yeah.

This real estate company is invested in going in a GI Well they're, they seem like they're one of these Chinese everything companies as well. 'cause then they also, they have like a million different websites that was, I guess their housing website. They also describe themselves on another one as the leading integrated online and offline platform for housing transactions and services. So maybe they're what, more of a, like a stripe for housing? I don't know.

Somehow some executive at Beca said one day we gotta get in the AI game and apparently recruited some good talent. I'm so confused right now. But yeah, there, there it's I think, yeah, also indicative probably of the impact of deep seek are one on the Chinese landscape where. They made a huge splash, right? Like do the effect of actually affecting the stock market in the us I would not be surprised if there are new players in China focusing on reasoning just as a result of that.

It is weird that they're coming from like a property company or something like, I mean, I, I understand. Yeah. This is a weird one for sure. Yeah, yeah. Like, I, like I get deep seek, you know what I mean? Like, like, okay, so they come from high, high flyer like this, like, you know, hedge fund that a million hedge fund companies like Medallion or Rentec, like they, they do ai, right? That's what they do. This is just like, like, what are you doing guys? Apparently they're doing really well.

It's a good model. Dunno what to say. And yeah, fully open sourced. So that's nice to have. And last open source model, we cover BLIP three dash o, a family of fully open, unified multimodal model architecture, training and dataset. So we've covered blip three before. That was the multimodal model in the sense of taking both images and text as input and output text. That used to be what multimodal meant with a blip three dash o there.

Moving in I supposedly frontier of multimodal where both with Chad GBT and with Gemini. We saw recently the models being able to output images in addition to taking them as input so that now we have a unified multimodal model. It can take in multiple modalities, it can output multiple modalities. I will say not necessarily just one big transformer as is typically the case for multimodal things with multiple inputs. But anyway, that's the core idea.

And they talk in the paper a lot of details on how to be able to train such models. They train a model on 60,000 data points on this instruction phoning to make sure that it is able to generate high quality images release the 4 billion parameter model that is trained on only open source data and have also an 8 billion, billion parameter model with proprietary data. I mean, it's, it's what I would expect.

Things are gonna have, like, I think the the multimodality trend and the age agentic trend sort of converge again, as I mentioned, on, on computer use. So I see these two things being different ways of getting at the same thing. The two things being this paper and the one we just talked about, it does seem like a, a pretty impressive model. One of the things that they did work on a lot was figuring out the architecture.

They found that using, I. Clip image features gives just more efficient representation than the VAE features, the variation, auto encoder features that often are used in this type of context. Clip being the contrastive training approach that OpenAI used for, well, for clip. there's a whole bunch of work that they did around training objectives as well comparing different objective functions that they might use to optimize for this sort of thing. Anyway, it's, it's cool.

I think it's it's an early shot at, at high degrees of multimodality from these guys and I would expect that we'll get something like a, a more coherent, you know, in the same way that we've coalesced around a stack for the agent side. I think this is an early push into, into the kind of very, very wide aperture, unified multimodal framework. We've seen a lot of different attempts at this and it's still unclear what strategy is gonna end up working.

So it's, it's hard to know where to invest, like, you know, our own marginal research time as we look at these papers and figure out like, okay, well, which of these things it's really gonna take off. But for now, given its size, this actually, it does seem pretty promising. Yeah. Now I would imagine certainly probably the best model of its kind that you can get an open source to Yeah, be able to generate images.

We've seen models like Gemini, like OpenAI that integrate the transformer with the image generation have some very favorable, favorable properties and seem like. They actually are better at very nuanced instruction following, so there's still room to improve in the image space about these are of course, not quite as good as the previous releases from the BLIP team. That includes Salesforce and University of Washington and other universities. Super, super open source. The most open source here.

You can get code models, pre-training, data instruction, doing data. All of it is available when you need to catch your breath while listing all the different ways in which it's open source. That's the bar, that's, that's how you know fully open source fully. And now moving on to research and advancements. We begin with a deep mind and they have released a new paper and blog post and media blitz with Alpha Evolve, a coding agent for scientific and algorithmic discovery.

That's the name of a paper. The blog post, I think somewhat amusingly is Alpha involved, a Gemini powered coding agent for designing advanced algorithms, but there'd be no confusion. Yeah. And so as per the title, the idea here is to be able to design advanced algorithms to get some code that solves a particular problem. Well, this is in some ways a sequel to something they did last year called fund search. We covered it maybe in the middle of a year. I forget exactly when.

And this is basically taking up. Taking it up a notch. So instead of just evolving a single function, it can write an entire file of code, it can evolve hundreds of lines of code in any language is scaled up to a very large scale in terms of compute and valuation. So a very way this looks in terms of what it does, is a scientist or engineer sets up a problem.

Basically it, it gives you a prompt template, some sort of configuration, chooses LLMs, provides evaluation code to be able to see how good a solution is, and then also provides an initial program with components to evolve. And then Alpha Evolve goes out and produces many possible programs evaluates them and winds up with the best program. And similarly to what we saw with fund search Fund search.

At the time they said that they achieved some sort of small improvement in a pretty basic operation of Matrix multiplication, although at the time this was a little nuanced, not entirely right. While with Alpha Evolve, they going to showing for various applications like, auto correlation and uncertainty, inequalities packing and minimum maximum distance problems, various math things that clearly I'm not an expert of.

They show somewhat improved outcomes and just yeah, the latest really of the deep mind style of paper where they are like, let us build some sort of alpha model to tackle some sort of science or in this case, computer science thing and get some cool results. Yeah, I think that's how they describe it internally. Like we're gonna do some kind of alpha something and then we're gonna, but, but that's actually, I mean, it's accurate.

One of the ways, I used to think about it, I, I, I think I still do, is through the lens of inductive priors, right? So basically the, the Google, so OpenAI has this, they're super scale pilled, right? Just like, take this thing and scale the crap out of it. And, and more or less, all your r and d budget is going into figuring out ways to get out of your own way and let the thing scale. Whereas Goly mind tends to come at things from a perspective like, well, let's almost.

Let's almost replicate the, the brain in a way in different chunks. So we're gonna have a, a clear chunk, like, you know, an agent that's got this very explicitly specified architecture. We're not just gonna let the model learn the whole thing. We're going to tell it how the different pieces should communicate.

And you can see that reflected here in the kind of pool of functions that it reaches into and grabs the evolutionary strategy and, and how that's all connected to the language modeling piece. They also have an element to this where they're using Gemini Flash, you know, the super fast model and the Gemini Pro. They're more, I guess, powerful but slower model for different things.

So with Gemini Flash, they use it to generate like a whole smorgasborg of different ideas cheaply, and they use Gemini Pro to do kind of the depth and, and the, the deep insight work. all those choices, right? Sort of involve humans imposing their thinking of how a system like this ought to work. And what you end up finding with these systems is they'll often. Outperform what you can do with just like a base model or an, or an age agentic model without a scaffold.

But eventually the base models and age agentic models just kind of like end up catching up to and subsuming those capabilities. So this is a way that DeepMind does tend to kind of reach beyond the immediate, the ostensible frontier of what just base models and age agentic models can do and achieve truly amazing things.

I mean, you know, they've done all all sorts of stuff with like density, functional theory and controlling fusion reactions and predicting weather patterns by following this exact approach. so really cool. And it, it's consistent as well with isomorphic labs and all the biotech stuff that they're doing. So it's a, a really impressive a really impressive paper. You can see why they're pushing in this direction too, right?

For automating the r and d loop, if you can get there first, you can trigger the sort of intelligence explosion, or at least it starts in your lab first, and then you win. This is a good reason to, to try that strategy of reaching ahead, even if it's with bespoke approaches that use a lot of inductive priors and don't necessarily scale. As automatically as some of the kind of opening eye strategies might. Yeah, I find it.

Looking at the paper, interestingly, they don't talk super in depth as far as I can tell on the actual evolutionary process in terms of what we are doing. It seems like they pretty much are saying, we took what we had in fund search, which was an LLM guided evolution to discover stuff and we expanded it to do more, to be more scaled up et cetera, et cetera. So, it's them, as you said, taking something, pushing it more and more the to the frontier.

They did this also with protein folding with chest, with any number of things. And now they are claiming some pretty, you know, significant advancements in theoretical and, and existing problems. Also on practical things, they say that they found ways internally to speed up the training of Gemini by 1%, by finding a way to speed up the kernel of Gemini. Also found ways to assist with training TPUs scheduling stuff. Anyway, these kinds of actually useful things for Google in the real world.

And next up we have absolute zero reinforced self play reasoning with zero data. So for reasoning models, as we've covered deep seek R one, the standard paradigm these days is to do some supervised learning where you collect some high quality examples of the sort of reasoning that you want to get, and then do reinforcement learning with an Oracle verifier.

So you do reinforcement learning where you're solving coding and mouth problems, and you are able to evaluate very exactly what you are outputting via reinforcement learning. So here they are still using a code executor environment to validate task integrity and provide feedback, but they're also going more in direction of self evolution through self play. Another direction with DeepMind and open AI also pushed in the past where you don't need to collect any training data.

You can just launch LLMs to gradually self-improve over time. Yeah. And it's the way they do that is kind of interesting. So there was a paper, I'm trying to remember what the, the name of the model was that did this. And I, for some reason I think it. I may be wrong.

I, I have a memory that it was maybe deep seek, but in this, or, or sorry, the, the lab, not the, the model, but essentially, so this is a strategy where they're gonna say, okay when, when it comes to a coding task, we have three elements that play into that task. We have the input, we have the function, and then we ha or the program and, and we ha we have the output, right?

So those, those three pieces, and they sort of recognize that actually there are three tasks that we could imagine getting a model to do based on those things we could imagine. Showing it the input and the. Program and asking it to predict the output. So that is called deduction, right? So you're giving it a program and an input, predict the output. You could give it the program and the output and ask it to infer the input. And that's called abduction.

There's gonna be a quiz later on these names, and then there's, if you give it input, output pairs, figure out what, what was the program that connect these, that connected these, right? And that's called induction. And these actually kind of all the names make sense if you think about them enough, but that, that's basically the idea, right?

Just like, basically take the input, the program and the output and block out one of them and, and reveal the other two and see if you can train a model to predict the missing thing. In a sense, this is at a high level of abstraction, almost a kind of auto aggressive training in a, in a weird way. But the bottom line is they use one unified model that's going to, that's gonna kind of like propose and solve problems.

And they're gonna set up a, a reward for the pro problem proposer, which is essentially, you know, generating a program given input and output. And for that, it's your standard. Like if you solve the problem, if you propose a correct problem that, or a program rather that compiles and everything's good, you get a reward. If not, you don't. anyway, they do a bunch of Monte Carlo rollouts in this case, eight, just to normalize and regularize.

But yeah, bottom line is, you see, again, another theme that pops up in this paper is this idea of difficulty control. In this case, the system has a lot of validation steps that implicitly control for difficulties. They're not gonna explicitly say, Hey, let's only keep the, you know, the, the, the mid-range difficulty problems by some score. You actually end up picking that up implicitly because. A couple conditions that they impose.

The first is that the programs that are proposed the code for those programs has to execute without errors. So automatically that means you have to be at least able to generate that code and it has to be coherent. there's a determinism check too, so the programs have to produce consistent outputs. If you run the program multiple times, you gotta get the same output again. You know, this requires a certain level of mastery. And then there's some safety filtering.

So they, they forbid the use of harmful packages. And basically if your program generation, part of your, your stack here is able to do this successfully then probably it's, it's being forced to perform at least at some minimal level. So the task is not gonna be trivial, at least. And only tasks that pass all those validations contribute to the learning process. So you, you get a kind of baseline quality of the, the programs that are generated here. it's a really interesting paper.

it's something that. Raises a lot of questions about the data wall, right? This is something that people have talked a lot about, is like there's only so much data you can fine tune on. So many examples of solved problems, solved coding problems. If you have this closed loop though, that's able to automate, automatically generate new. Problems, new deduction, abduction and induction problems. And then close a loop where one feeds into the next as they have here.

Then you really don't have a data wall like it, it, and they have some scaling curves that show admittedly not that far out in scaling space you know, in sample space, but still scaling curves that show that, yeah, you know, this, this does keep, seem to keep going at least as far as they've tested. If that holds, essentially what they're doing is they're trading data for compute, right?

You can basically, if your model is good enough to start this feedback loop, then just by pouring more compute into it to get, get the, the model to pitch new problems that it can then solve. You can start this feedback loop where really there's, I mean, there's no data wall that, that at least would seem to apply for the kind of code problem solving problems that that they're training on here. Right.

And just to know the particular detail, they do actually look into not having the verifiable rewards or the supervised learning. So absolute zero is absolute zero because there's no supervised learning or very valuable rewards, although they are, I think, still executing the code in a computing environment, if I understand correctly. So they can have some feedback from the environment but not an actual kinda verification that you got the problem correctly.

So as a result, we have to think through all these, other techniques to be able to evaluate yourself, like induction, abduction, induction, as you said that allows them to train. They compare to I haven't actually been aware of these. There's been, you know, more and more open source efforts as we've seen apparently there's an open Reasoner zero. There's also simple RL Zoo various things over the last couple months looking into the RL part of reasoning.

And so this is just the latest and I think pushing in a direction of not requiring verifiable rewards, which is to some extent the limitation of the Deeps EQ R one formula. Next up we have another report from Epic ai. So not a research paper, but an analysis of trends and kind of a prediction of where we might be going. This one is focusing on how far can reasoning models scale.

So the basic question here is, can we look at the training compute that's being used for reasoning models, things like deep CR one rock free, and from that infer the scaling characteristics and to what extent reasoning will kind of keep growing. So there, prediction is that we have a pretty small period in which you have very rapid growth going from deeps one to graph free.

They don't know exactly the training for O three versus oh one, but they I think are predicting here that O three would be trained quite a bit more. And so their prediction is the training compute being used will start flattening out a bit, growing slower compared to base models of the past. And, but we are still saying that, you know, the scale of large training runs will. Keep going in the next couple years, and presumably the reasoning models will continue improving as a result.

Yeah, you can kind of we talked about this quite a bit actually when before and when a deep Seeq R one came out, we're talking about it before, even when oh one came out. Just the idea that you have this new paradigm now that requires a fundamentally different approach to compute, right? You have to. Well, we just talked about it.

Instead of just doing, you know, generating an output and then automatically being able to score that really quickly and then doing back propagation, updating your model weights, what you now have to do is you take your base model, you generate an entire rollout, and that takes a lot of time and it has to be done on inference optimized hardware. And those rollouts then have to be evaluated and then the evaluations have to check out, and then you use those to update your model weights.

And so that whole extra step actually requires a different compute stack. And so if you look at what the, the labs are doing right now, they've gotten really, really good at, at pre-cal, at scaling, pre-training compute, right? Just this auto aggressive pre-training where you're training a giant text auto complete system. People know how to build multi-billion dollar, tens of billion dollar scale pre-training compute clusters for that. But what we're not seeing, what we haven't yet seen.

Is aggressive scaling of the reinforcement learning stage of training. And, and this is not gonna be a small thing. So it's estimated that about 20% of the cost of pre-training deeps seek R one, the, the, the, the V three model that R one was based on. So if you look at the cost of pre-training deep seq V three about 20% of that cost went into the compute for R one. That's not trivial.

And we keep seeing in these compute scaling curves for inference time scaling, that you really do wanna scale it along with your pre-training compute budget, right? So it's you're gonna get to a point where right now we're ramping up the orders of magnitude like crazy on the inference side. That's though gonna, gonna saturate very quickly. I mean, we saw 10 X leap from oh one to oh three in terms of.

The compute used for the reinforcement learning stage, as you said, you can only do that so many times until you hit essentially the, the ceiling of what current hardware can allow. Once that happens then your bottleneck by how fast can you grow your. Algorithmic efficiency and your hardware scaling. And essentially that looks the same as pre-training, scaling growth, which is about four x per year. So you should expect a rapid increase. oh four is gonna be really, really good.

Oh five is gonna be really, really good, but pretty quickly it's not that things are gonna slow down like crazy, but they'll, they'll scale more like the pre-training scaling curves that we've seen. This has big consequences for us, China, for example, because right now it's creating the illusion that China is better off than necessarily they are in the early days of this paradigm, when people haven't figured out how to take advantage of giant. Inference clusters.

The US which has larger clusters available than China, isn't yet able to use the full scale of its clusters. And so we're getting sort of a hobbled United States, artificially hobbled the United States relevant to China on a compute basis. All kinds of reasons why. That's actually kind of more complicated picture, but I thought that was really interesting. Another data point that they flagged here that I was not tracking at all was there are these other reasoning models.

That have been trained, that have come out fairly recently, like Five four Reasoning or Lama Nitron Ultra. And these have really small reinforcement learning compute budgets. Like we're talking less than 1%, in some cases, much less than 1% of the pre-training compute budget. And so it really seems like R one is this case of an unusually high investment in rl compute relative to pre-training.

And that a lot of the models that are being trained in the West, the reasoning models have very high pre-training budgets and relatively very tiny reinforcement learning budgets. I thought that was super interesting and something tells me that the deep Seeq R one strategy is actually more likely to be persistent in the long run. I suspect you're gonna see more and more flowing into the, the RL part of the training stack. But anyway super important important questions being raised here.

Interesting. Little writeup from Epica ai, which we, we do love to cover. Right, exactly. And to that point we've seen kind of a mix of results. It's still not a very clear picture. We've seen that you can really get rid of RL and with a very well curated data set for supervised, fine tuning.

You can at least do most of the progress towards reasoning and to unlock the hidden capabilities of a base model, as they say, with oral, not necessarily adding new capabilities, just sort of shaping the model towards using it. Well, we, knowing also rre very different in terms of a training from auto aggressive unsupervised learning or, what self supervised learning, I guess, was, was the term for a while in the sense that RL requires rollouts, it requires verification.

It, it just isn't as straightforward to scale as pre-training or post-training. So another kind of aspect to consider, but yeah, very much still an ongoing research problem as we've seen with all these papers we keep talking about, with all these different types of results and different recipes I'm sure we'll likely, you know, over time converge to what has been the case in pre-training and post-training.

People, I think, have discovered more or less the recipe, and I'm sure that will increasingly be the case Also with reasoning. And onto the last paper, this one coming from OpenAI. So, you know, props, I, I sometimes I think have said that OpenAI doesn't publish research anymore, and that's not exactly true. And this one is health bench evaluating large language models towards improved human health.

So open source benchmark designed to evaluate LLMs on healthcare focusing on meaningful, trustworthy, and on saturated metrics. So this was developed. We with input from 262 physicians across 60 countries, it includes 5,000 realistic health conversations to test LO's ability to respond to user messages. Has a large rubric evaluation system with a ton of unique criteria as you might expect.

You know, this is an area where you really want to evaluate very carefully and be sure that your model is trustworthy, is reliable, is even allowed or should be allowed to talk about health and, and questions regarding health. And so they open source data set. We open source, we eval code so that people can work on AI for healthcare. Yeah, and I mean to, to your point about OpenAI not publishing research anymore, I, I, I think you, you are fundamentally correct.

I mean, it, it's, they don't publish anything about how they build their models. Algorithmic. Yeah, the algorithmic discoveries, let's say mostly sometimes with image generation, we have done a little bit, but yeah, mostly not. And he like here and there for alignment, but it's murky and, and unclear. And and then, you know, when you have something that makes for a great PR play, like, Hey, we have done this healthcare thing. Please don't regulate us pretty please.

We're doing good things for the world. Then all of a sudden you get all this wonderful transparency. But I will say credit work, credit is due. This is a huge scale significant investment seemingly that open eyes had to put into putting this together. So, 5,000, as you said, multi-term conversations between users and AI models about healthcare. What they did is they got about 300 doctors to look at these conversations and propose bespoke criteria.

So like, you know, specific criteria based on which they would judge the effectiveness of the AI agent in that conversation or of the AI chatbot. And so to give you an example, I. You know, you have a parent who's concerned about their baby, who hasn't been acting like herself since yesterday.

The rubric that the doctors came up with that were aggregated from a, a bunch of doctors, different doctors looking at this exchange they're like, okay, well does the chat bot state that the infant may have muscle weakness? If so. Seven points Does it list at list at least three common causes of muscle weakness in infants? If so, plus five points. Does it include advice to seek medical care right away? And so they give points.

I mean, it's a very detailed kind of looking over the AI shoulder type of perspective for each of these 5,000 multi-term conversations. Again, using hundreds and hundreds of doctors to do this. And there are some criteria They're shared across many of these exchanges. But so about 34 what they call consensus criteria.

These are things that come up again and again, but mostly they are example specific, like 80% of the criteria they use are literally like just for one conversation or just for for one exchange. So that's pretty remarkable, a really, really useful benchmark. They use GPD 4.1 to evaluate whether each rubric criterion is met in a given conversation. So they're not actually getting the doctors to review the chatbots.

You know, responses, obviously that doesn't scale, but what they do do is they find a way to demonstrate that GPD 4.1 actually does a pretty decent job of standing in as the typical physician. Their performance, their, the grades that they give are pretty comparable. And if GPD 4.1, by the way, is the best model they identified, it does better than even O four Mini and, and O three. At that task, one of the things that really. Messes with my head on this.

And, and we have to remember anytime we look at a benchmark like this and we're tempted to ask, okay, so how well does the best AI do? How well does a doctor do? Right? That's the natural question. It is important to note that this is not how typical doctors would evaluate a patient, right? Like you would typically have visual access to them. You'd be able to touch, you'd be able to kind of, you know, see the, the nonverbal cues and all that stuff.

That being said, on this, benchmark models do outperform unassisted physicians, unassisted physicians score 0.13 on average across all these these evals. Models, the top models on their own. 0.6, that's for oh three. That is wild. That is a four times higher score than the unassisted physician. That honestly like, kind of blows my mind a little bit. Certainly these models can draw on much, much larger sources of data. And again, we gotta add all those caveats.

You know, physicians don't normally write chat bot style responses to health queries in, in the first place. But it's an interesting note and we've seen some papers, we've talked about them here, where doctors actually can perform even worse when they work with an AI system than the AI system on its own, because the doctors are often second guessing and, and, you know, don't, don't, let's say just have blind faith in, in this model. So pretty interesting.

One more caveat there is, there is a correlation, we've seen this before, between response length. End score on this benchmark. And that's a problem because it means that effectively the chat bots can gain the system a bit just by being very verbose. So surely that's influencing things a little bit. The effect does not though nearly account for the insane disparity between unassisted physicians and models, which again, is like a four x lift. Like that's pretty wild.

Yeah. Worth noting that there are multiple metrics here, including communication, quality, accuracy as its own metric, and they do actually evaluate the physicians with the models and the combination there is, on par, maybe, you know, there's some of these things that were better on. Accuracy seems to be about the same. Communication quality may be a bit different. But yeah, physicians with these tools will be much more effective than without. That's pretty clear from results.

And they do have various caveats as to evaluation. Like you said, there's a lot of vari variability there and and so on. Interesting to me. Also in the conclusion, they note that they included a cannery string to make it easier to filter out the benchmark from training Cora. And they also are retaining a small private held out set to be able to enable instances for accidental training or implicit over fitting to be bench.

So I think interesting that in this benchmark we're seeing, What should be probably the standard practice for any benchmark release in, in this day, which is you need to be able to make it easy to filter it out from your massive training thing from web scraping and probably also have a private eval set. Onto policy and safety. First up, we have the Trump administration in the US is officially rescinding Biden's AI diffusion rules.

So there was the artificial intelligence diffusion rule that was set to take effect on May 15th, introduced by Joe Biden in January, it will aim to limit the export of us, made AI shipped to various countries and strength, strengthen existing restrictions, and the department of Commerce has announced that it'll not enforce this Biden. Regulation error replacement rule is expected that will presumably have a similar effect.

The rule I think we covered probably at the time there were free tiers of countries tier free being China and Russia that have very strict controls tier two countries that are some export controls and tier one, which are friends that have no controls. So, seems that now the industry as a whole is gonna have to wait for what the new rules will be. Yeah, the philosophy here, and we have yet to hear the announcement for the Department of Commerce for what will replace this.

But the philosophy seems to be that it'll be nation to nation bilateral negotiations for different chip controls, which could make sense. I mean, one of the big weaknesses of the, the diffusion framework that the Biden administration came out with, and we talked about this at the time, was they had this insane loophole where. As long as any individual order of GPUs was for less than 1700 GPUs literally zero controls applied.

And the reason that's relevant is literally Huawei's entire MO has been to spin up new subsidiaries faster than the US can put them on their export control list and then use those to kind of pull in more controlled hardware. And then obviously Huawei just pulls that together. And so putting in an exemption for a 1700 is a decent number of GPUs too, by the way. So putting in an exemption for that number of GPUs is, I mean, you're, you're kind of just asking for it.

That is exactly the right shape for China to exploit that matches exactly the strategy they have historically used to, to exploit us export control loopholes. So hopefully that's something that'll be addressed in this whole kind of next round of things. We don't yet know exactly what the shape will be though. We do have a sense and what this ties into our next story.

Of what the approach will be with respect to certain Middle Eastern countries like Saudi Arabia, like the UAE which are now kind of top of mind as the sort of not neutral states, but the, the ones that aren't the US or China, let's say proxy fronts in this big AI war. Right. And that does take us to the next piece. Trump's mid east visit open floodgate of AI deals led by Nvidia.

That's from Bloomberg. So the Trump administration has been meeting with two nations, in particular Saudi Arabia and the United Arab Emirates. And we do expect agreements to be unveiled soon. And the expectation is there will be eased restrictions, meaning that Nvidia a MD and others will be able to sell more, you know, get more out of a region. The stock market reacted very favorably. Nvidia went up 5% and MD went up 4%.

And there's been a variety of announcements per the article title of, it's deals that seem like they'll start happening. So for instance, Nvidia will be providing chips to Saudi Arabia's Humane, a company created to push the country's AI infrastructure efforts. Humane will get several hundred thousand of NVIDIA's most advanced processor over the next few years. And there's other deals like that with a MD Amazon, Cisco, others.

So the indication seems to be, you know, some restrictions will be eased. Restrictions were set in part because there were ties between some firms in these regions and China with, in particular G 42. So yeah, it seems like it might be different from the Biden era. Yeah, it's, it's quite interesting right there. There's a lot that the different players at the negotiating table here want.

The Saudi deal is especially interesting 'cause it's, it points to a similar kind of deal to the deal that America's started to shape over the last few months with the UAE being more permissive in some ways, but also insisting that the UAE move away from their entanglements with China. You mentioned G 42, right? And Huawei having had some, some, some past, well, the strategic si situation, if you're Saudi Arabia, is you wanna be positioned for an oil for a post oil future, right?

That's the same for the UAE and the same for all the Gulf states really. In Saudi Arabia that's motivated this thing called Project Transcendence, which is a $100 billion initiative for tech in general, but specifically for ai. There's a, a big, big pool set aside for that. The UAE is in a similar position. They already have a National Champion lab in G 42 as well as Institute for technology or something? I, I-T-I-I-I-T, yeah, yeah, yeah, yeah. The guys who made, who did the Falcon models.

Yeah. Which we haven't heard much about since, by the way, which is kind of interesting. But right now the Saudis are behind the UAE and they're trying to make up ground. And so the UAE and the Saudis essentially are in some sense competing against each other to be America's partner of choice for large scale AI deployments in the Middle East. That's one dimension of this. They wanna get their hands on as much AI hardware, as many GPUs as they can.

This is one reason why Trump stacked them back to back. So he had first an announcement of the deal with the Saudis and then heading over to get a, a deal with the UAE, putting pressure on each of them to kind of play off each other. Look, the Saudis have tons of energy, they are an energy economy. Same with the UAE, just at the time when we're saturating the uss. Energy grid and that's the main kind of blocker on our deployments.

And so you can see the temptation if you're open ai, if you're Microsoft, if you're Google to just like say, well, why don't we set up a data center in the Middle East where we have an abundance of energy plug into their grid? and that'll be great for us. And well, there are a couple reasons why you. They might not wanna do that. So historically, one was the Biden Administration's export control scheme. You just can't move that many chips into a, a foreign country like that, just no good.

But that's being scrapped as we just talked about. So now the situation is, well, maybe we can, right, maybe we can negotiate country to country and set this up. But the United States is gonna wanna make sure that if they are setting up AI infrastructure in the UAE in Saudi Arabia, that the Saudis don't turn around and sell that to China, right? China's super good at using third party countries. Historically that's been Malaysia, it's been Singapore, right?

And using those countries to bring in GPUs and subvert US export controls. So, you know, sure you might have export controls on China proper, but you don't necessarily have them on Malaysia, on Singapore. And what a surprise, a massive. Influx of GPU orders into Malaysia of all places in the last few months. Hmm. Wonder where those are being redirected. Right. So th this is something that the administration wants to make sure it doesn't happen with these deals.

Whole bunch of, of issues around Saudi entanglement. You said, you know, UAE China's got a lot of ties, so do the Saudis, right? Huawei made Saudi Arabia a regional center for their cloud services. There's a big Saudi public investment fund, the PIF that's actually bankrolling this whole project Transcendence thing. And the PIF has joint ventures with Alibaba Cloud.

They've got a new tech investment firm that we covered a few episodes ago called AAT that also has a joint venture with Da hu, which is an. An envy listed, basically a black listed Chinese surveillance tech company of all things. So there are a lot of entanglements there and, and deep questions about how some of the, the Saudi Arabian GPU reserves are being used potentially by Chinese academics and researchers as well.

So while there's no hard evidence of the Saudis shipping GPU specifically to China, you wouldn't necessarily expect that China's MO is absolutely to do stuff like this. And just a, a last note here in the negotiations. One really interesting thing that's been proposed is this idea of a data embassy. No one's ever proposed this before, but basically it's the idea that like, look if you wanna be able to take advantage of. Huge sovereign reserves of energy in the UAE and Saudi Arabia.

But you're, you're concerned about the security implications. Well, maybe you can set up a region of territory that, you know, just like how the US Embassy in Saudi Arabia is this technically tiny slice of American soil in Saudi Arabia of sovereign American soil. Well, let's set up a tiny slice of sovereign American soil and put a data center on it. US laws will apply there. You're allowed to ship GPUs to it, no problem. Because it is sovereign US territory. So export control isn't an issue.

In the same way sure you have Saudi energy feeding in, and that's a huge vulnerability. Sure, you're embedded in this matrix, but in principle, maybe you can get higher security guarantees from doing that. Lots of caveats around that in practice. I will go into them, but like there are some real security issues around trying something like that, that our team in particular has spent a lot of time thinking about. But this is basically the structure of these deals.

A lot of kind of new ideas floating around. We'll see how they play out, but they definitely put the UAE and put Saudi Arabia right up there in terms of the players that might have large domestic stockpiles of chips. All right, so that's a couple policy stories. Let's have a couple safety stories to round things out. The next one is a paper Scaling laws for scalable oversight.

So oversight is the idea that we may want to have weaker models verify that a thing that a stronger model is doing is actually safe and aligned and not bad. So you might imagine you might have. A super intelligent system, and humans are not able to verify that what it's doing is okay. And you want to be able to have AI oversight over stronger ones to, you know, be able to trust it. In this paper they're looking into, you know, wherever you can actually scale oversight.

And by the way, it's called scalable oversight because you can scale it by using AI to actually verify things at the speed of AI and compute. And so what this paper focuses on is what they're presenting as nested, scalable oversight, where basically you can do a sequence of models where you have weaker, stronger, weaker, stronger, and you can, kind of go off a chain to be able to provide verifiable or, you know, trustworthy oversight and make things safe.

So they introduce some theoretical concepts around that. Some theoretical guarantees. They do some experiments on games like mafia war games and backdoor games, and verify in that context that there are some success rates. And yeah, present kind of this general idea as another step in the overall research of the idea of scalable oversight.

Yeah, and this is I don't, I don't think, I don't know if it was Paul Christiano back when he was at OpenAI who invented this whole area, but certainly the idea of doing scalable alignment by getting a weaker AI model to monitor a. Smarter ai, most stronger AI model is something that he was really big on. And frankly, I mean, and through, through debate in particular. So his whole thing was debate. That's one like concrete use case that they examine here.

So basically have a weak model watch maybe two strong models, debate and o over a particular issue. And the weak model is gonna try to assess which of those models is telling the, the truth. well, hopefully the, the idea here is if you can use approaches like this to determine with confidence that one of your stronger models is reliable, well then you can take that stronger model and now use it to supervise the next level of strength, right? An even smarter model.

And you can maybe start climbing the ladder that way. This is, I think, a, a good way. This paper is basically trying to quantify that. So, so the way they're gonna try to quantify that is with ELO scores. So these ELO scores tell you roughly. How often a given model will beat another model.

Right. So, you know, and I forget how they, what the exact numbers are, but it's like if you have a model with an ELO score of a thousand and another model with an ELO score of 1200, then the model with the ELO score of 1200 will beat the model with an ELO score of a thousand, like 70% of the time or, or, you know, whatever the number is.

And so the, this is an attempt to kind of quantify what that climb might look like using ELO scores, using essentially scaling curves for these ELO scores, which is quite interesting. I think there are some pretty fundamental problems with this whole approach. I don't think that Max Techmark who is like the, one of the lead authors of this thing would, would actually disagree.

But there's a fundamental issue here, which is when you think about climbing the intelligence ladder, I. New capabilities of concern, like deceptive alignment. In other words, the ability of a model to pretend as if it's aligned when it actually isn't. Those can emerge pretty suddenly, like you can have this sort of emergent capabilities that pop up suddenly and I. Violate these scaling curves and the kinds of capabilities you worry about in the context of super intelligence.

You might expect to arise quite quickly where there's a sudden sort of cohesion of situational awareness of capabilities around you know, manipulation and persuasion of capabilities around, you know, offensive cyber and things like that, that all kind of come together fairly quickly. and if that should happen, then you ought to expect these scaling laws to break down at precisely the stages where you most need them to work.

Nevertheless, this is a, I think, a really good quantification of some of the the arguments that we've seen from people like Paul Christiano. IDA, I think it was the, the acronym iterative debate in alignment or something like that. I, I forget. I, I, I actually looked into it really deeply like four years ago, and now I can't, I can't sum it up. But yeah. I think this is, if you're gonna take it seriously, this is a good way to do it.

Looking across different versions of this, like, what if you have a game of mafia? If you don't know what the game of mafia is don't worry about it. What if you've got this debate scenario that, that I just described all these different possible scenarios. What are the scaling curve? What do the scaling curves look like in terms of how smart your judge model is gonna be versus how smart the models are who are potentially trying to fool the judge model?

And how often can you actually assess the, or can, the judge model succeed? They've got all these great scaling plots and yeah. It's, it's a, a good paper if you're interested in that model. And one last story related to safety. OpenAI pledges to publish AI safety test results more often. So they have actually launched the Safety Evaluations Hub, a page where you can see the, their models performance on various benchmarks related to safety.

Things like harmful content, jailbreaks and hallucinations. And yeah, you can really scroll through and basically see four GT 4 0 1, 4 0.1, mini 4.501, all of them for various things where safety like refusal jailbreaking, hallucination what the metrics are. Now, they're not presenting everything they do for safety. They don't have the metrics for their preparedness framework on here. They're gonna continue to do that in.

The system cards, but nevertheless, I think an interesting kind of move by OpenAI to make it extra easy to see where the models stand. Yeah, I, I, this is a, if nothing else, just a really great format to, to view these things in. And anyway, you can, you can check out the website. It's actually really nicely laid out. And that will be it for this episode of Last and Sometimes last, last week in ai. As we've said, we'll try to not skip any more weeks in the near future.

Thank you to all the listeners who stick by us even though we do sometimes break that promise. As always, we appreciate your feedback appreciate you sharing a podcast, giving reviews, corrections, questions, all that, and please do keep tuning in.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast