#206 - Llama 4, Nova Act, xAI buys X, PaperBench - podcast episode cover

#206 - Llama 4, Nova Act, xAI buys X, PaperBench

Apr 09, 20251 hr 14 minEp. 246
--:--
--:--
Listen in podcast apps:

Episode description

Our 206th episode with a summary and discussion of last week's big AI news! Recorded on 04/07/2025

Try out the Astrocade demo here! https://www.astrocade.com/

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • Meta releases LlAMA-4, a series of advanced large language models, sparking debate on performance and release timing, with models featuring up to 2 trillion parameters for different configurations and applications.
  • Amazon's AGI Lab debuts NOVA Act, an AI agent for web browser control, boasting competitive benchmarking against OpenAI's and Anthropic's best agents.
  • OpenAI's image generation capabilities and ongoing financing developments, notably a $40 billion funding round led by SoftBank, highlight significant advancements and strategic shifts in the tech giant’s operations.

Timestamps + Links:

  • (00:00:00) Intro / Banter

Tools & Apps

Applications & Business

Research & Advancements

Policy & Safety

Transcript

Hello and welcome to the last week in AI podcast where you can hear chat about what's going on with AI as usual. And in this episode we'll summarize and discuss some of last week's most interesting AI news, which you can go to the episode description to see all the timestamps and links to those articles. I'm one of your regular hosts, Andre Kko. I studied AI in grad school and I work at a generative AI startup now, which is, let's say about to do something exciting, hopefully.

Yeah, I'm, I'm actually excited about it 'cause we've been talking offline about this announcement that's coming and I feel like I probably joined the audience in being very curious about what you're, what you're up to day to day. So, we'll, we'll see something soon. I hope. I'm really excited for that. Yeah, yeah. Yeah. I guess I haven't disclosed too much I can say.

I mean, I'm working on AI for making little games with idea to have a platform where people can make and publish and play each other's games. And yeah. Tomorrow, April 8th, 'cause we are recording this a little late. So Tuesday is a big launch of the latest situation about technology. Oh. Yeah, exactly. So by the time this episode is out, more than likely we'll have already done a big launch and anyone listening to this can go to astro k.com to try it out.

I'm sure I'll try and plug it elsewhere as well, so you'll be sure to hear about it. Yeah. Yeah. So on that note, like, we're basically gonna have to blitz this episode 'cause Andre has to get to work, man. He's gotta actually, like, you know, get out there, stop being such a lazy schmuck So yeah, let's get going, starting right away in tools and apps. And the first story is a pretty big one.

It's LAMA four Meta has released the latest iteration of their open source series of large language models, large multimodal models as well. These ones come in four varieties and different sizes. Some of them are called Lama four Scout, Lama four Maverick, and LAMA four Behemoth. And they're also launching it out to all their various ways to interact with chatbots on WhatsApp, Instagram, Facebook, I forget, wherever else we let you talk to ai. And these are quite big.

So just to give an idea, Maverick has 400 billion total parameters, but only 17 billion active parameters. So they are kind of pitching this also as more friendly to lower device configurations. But on the high end behemoth, which is not released and that they are saying is still being trained, has nearly 2 trillion total parameters and 288 billion active parameters.

Which from what people were kind of a speculation around GP four at the time was that it was something like this, it was a mixture of experts where you have. Nearly two bill a trillion total parameters, and then like over a hundred billion, maybe 200 billion total parameters. We dunno. But this reminds me of a GT four kind of speculations. Yeah, this, this release, by the way, is pretty underwhelming to a lot of people.

So there's this interesting debate happening right now over what exactly is fucked with the LAMA for release, right? So they're large models. I'll, I'll just, I'll talk about the positives first. So, from an engineering standpoint, everything that's revealed in the sort of 12 page read, I don't know whether they're called a blog post or a technical report or whatever. It's not this like, you know, beefy 50 pager, the of the kind that deep seek produces. It gives us some good data, though.

Everything we get there about the, the engineering seems. Kind of interesting by the way, like when it comes to the, the general architecture choices being made here, like a lot of inspiration from deep seek man, like a lot, a lot. And just to give you an idea, so they trained at FP eight Precision. So again, like Deep Seq V three though Deep Seq V three used some like fancier mixed precision stuff too. We, we talked about that in the Deep Seq episode. Theoretical performance of an H 100 GPU is.

Around a thousand tariff flops for FP eight and they were able to hit 390 tariff flops in practice. So they were hitting utilization of around 39 40%, which that's on the high end for, for a, a fleet of GPUs this big, this is 32,000 to H 100 GPUs they used for this. This is no joke. Like that is just getting your GPUs to hum that consistently is a very big deal. And so from an engineering standpoint, that's a pretty good sign.

There are a couple things here that they did that that are a, a little bit distinct. So one piece is this is a natively multimodal model. So yes, drawing a lot of in inspiration from deep seek, but at the same time very much kind of that meta philosophy of, you know, we want good grounding, good multimodality.

They use this technique called early fusion, where essentially like text and vision tokens are combined from the very start of the model architecture and they're all used to train the full kind of backbone of the model. And that means that the model learns a joint representation of both of those modalities from the very beginning. That's in contrast to like late fusion where you would process text and images and other data in separate pathways and just like merge them near the end of the model.

More of a kind of like more of a janky kind of hack together. Frankenstein monster. This is not that, right? This is more of a monolithic thing. Anyway, there's a whole bunch of stuff in here. It turns out very much that scout and behemoth seem to be in the same model line. Like they kind of seem to have, they're making the, some of the same design choices in terms of their architecture. Maverick seems like. A deep clone. It seems like an attempt, and some people are speculating.

This is like a last minute decision to try to replicate what Deep Seek did. Whereas if you look at Scout and you look at behemoth, those are much more of the kind of like. Like you said, Andre, it's like somebody was trying to do GPT-4 meets like a mixed trial type of model and like, it's very unclear why this happened. But one thing we do know is the performance seems to be shit, like when people actually run their own benchmarks on it, or at the very least, very mixed right.

There's all this stuff about like just for example l on LM sis. They've got a model that seems to be crushing it. And it's, it's doing great on amazing ELO score. But we see if we read very closely in the paper that the, the model they used for that is not the model they released is not any of the models they released. It's a custom fine tune for the LMS arena. Leaderboard. And that is a big problem, right? That's one of the things people are really ripping on Meadow on us about.

Like, look, you're, you're showing us eval results for benchmark results for one model, but you're releasing a different one. And this really seems like eval gaming all kinds of weird questions about why they released this on a Saturday. Like this is supposed to be one of your flagship launches, your Frontier la what is going on? The release date on GitHub was switched to from April 7th to April 5th, like o overnight basically.

Maybe because they're anticipating some interesting releases coming this week that will scoop them. There's a whole bunch of stuff around here. Last thing I'll say is one explanation people are throwing around for the bad performance of these models is just the hosting. The systems they're being hosted on are just not optimized properly.

Maybe they're quantizing the model a little too much or poorly, maybe they're using bad inference parameters like temperature or like, you know, top p or, or whatever or bad system prompts. Anything like that is possible, including more nuanced kind of hardware considerations. But bottom line is this may be the flub of the year in terms of big flashy launches. That should have been a big deal.

I think we'll be picking up the pieces for a couple weeks to really figure out how we actually feel about this, because right now there's a lot of noise and I personally am not resolved yet on how impressive. Any of these things really are, but those are my top lines anyway. Yeah, I think that's a good overview of the sort of discussions and feedback and reactions people have been seeing.

There's also been speculation that they, at meta, at least, leadership, push towards gaming the quantitative benchmarks. Things like M-M-O-U-G-P-Q-A the usual sort of numbers you see, life code bench. Of course they say it's better than Gemini two, flash better than deep, seek three, one better than gp, PT four.

Oh. But as you said, when people are using these models sort of anecdotally personally or on their own personal held out benchmarks that are not these available benchmarks where you can cheat, you can train it even like accidentally cheat or, or sort of like cheat Yeah. By not intentionally trying not to cheat. Right. Which is one of the important things you do these days. You need to make sure your model isn't trained on the training data when you're scraping the internet.

If you're not doing that, you might be cheating without knowing it. Or, or at least like pretending not to know it. So yes, it's seeming like the models are not. Good. Is, is the general reaction worth noting? Also, as you said with behemoth Maverick Scout? It is this yeah. Difference where they have 16 experts for behemoth and scout. So you know, pretty big models that are doing most of our work. Maverick is different.

It has 128 experts, so it's a bigger model, but the number of total active parameters is low. And I think there's various reasons. You could speculate. Like they want the models generally to be runable on less hardware and behemoth would be the exception to that. As you said, also, they. Need to keep costs down.

I have no idea how meta like is thinking about the business plan here of supplying free chatting with LLMs, which is relative to anything else, very expensive and they're still doing it for free, kind of all over their product line. So various kind of speculations you can have here. But as you said, seemingly the situation is they launched possibly kind of in a hurry because something else is coming because these businesses typically know, you know, each other's releases somehow.

And perhaps they should have waited a little longer. Yeah, and a lot of questions as well around like just the size of these models and what it means to be open source, right? We've talked about this in the context of other models, including, including deep seq, V three and R one and all that. At a certain point, your model's so big, you just need expensive hardware to run it in the first place. And I think this is a really good example of that, right?

So Scout, which is meant to be their small model, right? Like, it sounds like Flash, it sounds like one of those, you know, 2.7 billion parameter models or something. It is not. So it, it's a 17 billion active parameter model. As you said. Their big flex here is that it fits on a single. Nvidia, H 100 GPU. So that's actually pretty, that's pretty swanky hardware. That's, you know, tens of thousands of dollars of hardware. You know, that's 80 gigs of hbm three memory basically.

Does have one thing, by the way that this stuff does have going forward is an insanely large context window. So 10 million tokens. That is wild. The problem is that the only evals they show. On that context, window length are needle in a haystack evals, which as we've covered before, are pretty shallow. Like, it doesn't really tell you much about how the model can use the information that it recovers. It only tells you, oh, it can pick out a fact that's buried somewhere in that context window.

It's not bad, but it's, it's not sufficient. Right? It's, it's one of those things. So and that's the lama for Scout Maverick, they say fits on one H, 100 GPU host. Now the word host is doing a lot of heavy lifting there. Really what that means is one H 100 server. So presumably they mean the H 100 DGX. In fact, I think that is what they said in the, the writeup that would be eight H one hundreds. So, hundreds of thousands of dollars worth of hardware. Hey, it fits on just one of these servers.

Like, that's, that's a, that's a lot of hardware. So anyway, I. Bottom line is, I think, you know, these are incidentally scout, I believe is a distillation of La Lamo for behemoth, which is still in training. So we don't know what Lama for behemoth actually is gonna look like. So we're all kind of holding our breath on that.

But for now, unless Meta has like really screwed the pooch on distillation, and they just, they, they have an amazing behemoth model and they just, the distillation process didn't work. It seems plausible that the behemoth model itself may be underperforming as well. But again, all this is still up in the air. As with so many things here, it does seem like a rush release, and I think the dust is gonna be settling for a few weeks to come.

still worth highlighting they are sticking to their general approach of open sourcing. This stuff. You can request access to LAMA for Maverick and Lama ForeScout to get the actual weights as you were able to with previous lamas, as was the previous case where license under these bespoke things, the LAMA for community license agreement, where you are saying you will not be doing various things that meta doesn't want you to do.

Lama. But still, you know, LAMA has, I think, been a big part in a lot of the research and and development on the open source side. So that at least is still laudable. And onto the next story. We are moving to Amazon, which hadn't released too many models, but they seemingly are starting, and their first act here is NOVA Act, which is an AI agent that can control a web browser. This is coming from Amazon's a GI lab, which hasn't.

Really put out many papers, many products so far, but this seems to be a pretty big one. And it is comparable to something like open ai I forget what it's called, but their web use agent that can Oh, operator. Yeah, operator. Exactly where you can tell it, you know, go to this website scrape all the names of the links and summarize it for me, stuff like that. So they have this general purpose AI agent.

They're also releasing vi novo Act, SDK, which would enable developers to create agent prototypes based on this VA say this is a research preview, so still a bit, you know, early, presumably not like fully baked. Yeah, it's, it's an interesting play. I think we still don't have too many models or offerings of this particular variant. We have ropa computer use, we have opening a operator, which I don't recall if they have an SDK for that. So this could be a pretty significant entry in that space.

And this is the first product to come out of the Amazon a GI lab, right? So this is kind of a, a big unveiling. Couple of, I mean, we don't have a ton of information, but some notes on, on benchmarks, right? So they are claiming that it outperforms OpenAI and Philanthropics best agents on at least the screen spot web text benchmark. That's a, a measure of how well an agent interacts with text on a screen.

And apparently on this benchmark, Novak scored 94%, and that's in contrast to open AI's agent, which scored 88% Anthropics Quad 3.7 sonnet at 90%. So significant seemingly on that benchmark, but that. Isn't enough data to actually be able to kind of generally assess the capabilities of this agent.

In particular, web Voyager is a more common evaluation that was not like the reports the performance of the Nova Act agent isn't being reported on that, so that's kind of leads you to ask some of these questions. But we'll see. I mean, they, they definitely have, you know, great distribution through Alexa and maybe that'll allow them to, you know, iteratively improve this pretty fast. We'll, we'll see.

They're also improving their hardware stack, thanks to, among other things, the philanthropic partnerships. So even if it doesn't come out great out the gate, they're sitting at least on a, on a promising hyperscaler stack. So this might improve fairly quickly. Right, and it could be also part of their plans for Alexa plus venue new subscription service. They are launching Alexa also as a website in addition to their hardware.

So presumably they might be thinking to make this part of their product. And, and we'll keep pushing #onto a few more stories. Next up another giant company planning a giant model release. This one is Alibaba and reportedly they are preparing to release their next flagship model qu free soon. So the article here is saying as soon as April, apparently Aaba is kind of rushing here to respond to deep seek and a lot of the other hot activity going on in China.

We've talked about Quinn actually quite a bit over recent months. Quinn 2.5 various. Smaller releases, I guess you could say. And qu free presumably is meant to be kind of the, the best and, and beat everyone, right? the one thing I'll say is I've seen speculation that this may have been part of the driver for the rapid release of LAMA four. So all we really know could be next. We have a smaller company, a startup runway, and it has released their latest video generating AI model.

So this is gen four, and it is meant to be kind of a customer facing, usable video generation model. And it looks pretty impressive from just having looked at it. It is kind of catching up to soa, catching up to more top of the line models that are capable of a consistent video, capable of also being prompted both by text and image.

They have a little kind of mini short film that they launched this with where you have a, like a cute forest and then some like characters interacting, showcasing that consistency across multiple outputs. And this is at a time when they are raising a new funding round, valuing them at 4 billion with the goal of getting 300 million or, or, sorry. They're, the goal is to get 300 million in revenue this year. So runway a major player in the space of AI for video. And a similar story.

Next this one is about Adobe and they are launching an AI video extender in Premier Pro. So Premier Pro is their flagship video editing tool, and we've seen them integrate a lot of AI into Photoshop. For instance, this is the first major AI tool to get into premier Pro and video editing. So in the future is generative extend. It will allow you to extend videos, but up to two seconds. With Adobe Firefly recovered, I think when we previewed this but now it's rolling out to the actual product.

Yeah. This, this kind of rollout at least to me makes a lot of sense as a first use case for these sorts of, of videos. It sort of reminds me of, you know copilot. That was powered by Codex back in the day, right? First use case was text or code auto complete. We've seen the auto complete kind of functionality is, is sort of like native for a lot of these transformer models.

This one's a little different, but it's still kinda this very natural very grounded in real data and you're just extending it a little bit. So especially for video where you need to capture a lot of the physics, that's something I'd expect to be kind of a, a nice way to iron out a lot of the kinks in these models. Right. And they do have some other small things launching alongside that.

They have AI powered search for clips where you can search by the content of a clip, which I would suspect is actually a bigger deal for a lot of developers. That's true. Yeah. Because if you have a hundred clips now you can search for the content as opposed to file name and whatever. And they also have automatic translation making it. So quite a few kind of significant futures coming from Adobe. And just one more story.

OpenAI is apparently preparing to add a reasoning slider and also improved memory for Chad GBT. So we've seen some I guess people starting to observe presumably this idea of a reasoning slider in testing. And that allows you to specify that the model should think a little, think harder, or you can leave it at automatic to let the model do its own thing. Mirroring to some extent what philanthropic has also been moving toward. And onto applications and business.

The first one is about Nvidia H 20 chips, and they're being 16 billion dollars worth of orders from buy dense Alibaba and Tencent recently. So this is covering sort of a, a set of events, I suppose, or or a time in early 2025 where these's major AI players from China have all ordered this massive amount of the H two eight chip one that's not kind of restricted.

And the one that I believe, or, or this, or some variant of it was what Deeps Seeq was built upon and what showcased the ability to train Deeps Seeq V three. So this is a big deal, right? And, and Nvidia presumably is trying hard to not, be too limited to be able to actually do this. Yeah, the, the Deep Seq was trained on the H 800, which is, it's another China variant of the H 100. So they're kind of, you, you're right, they all fall under the hopper generation, but specifically for China.

This is a, a play that we've seen from Nvidia a lot. It's them responding to the looming threat of new potential export control restrictions, right? So you can imagine if you're Nvidia. And someone tells you, Hey, like in a couple months we're gonna crack down on your ability to sell this particular GPU to China. Well, you're looking at that and like you'll go, okay, well I'm gonna quickly try to sell as many of these to China as I can while I still can make that money.

And then, you know, once the export control ban comes in, then, then that's it, right? So you're gonna actually tend to prioritize Chinese customers over American customers. This has happened in the past. This will continue to happen in the future as long as we include loopholes in our export control regimes. And so what you're literally seeing right now is Nvidia making the call.

Do we have enough time to proceed with making the chips to meet the $16 billion set of orders from bike dance, Alibaba and Tencent? Like are we gonna have the GPUs ready and sold before the export control bans come into effect? Otherwise, if we don't, we're just sitting on this hardware. Now, keep in mind. H twenties are strictly worse than the say H one hundreds that Nvidia could be making or H two hundreds that Nvidia could be making instead.

So from the standpoint of selling 'em to the domestic market or, or you know, potentially depending on the node, like, and, and the interactions here, they could be making blackwells to so there's this question of like, if they choose to go with making age twenties to try to meet the Chinese demand, which is about disappear, then they may end up being forced to sit on these kind of relatively crappy age 20 chips that don't really have a market domestically in the us right?

So that's a big risk, and they're calculating that right now. They have limited TSMC allocation to spare, so it's not like they can meet both at the macro level. you can think of this as like NVIDIA is deciding whether to make the TSMC fabs churn away on chips for China or for the us. That's kind of the decision point here. That's what it boils down to. So again, we've seen this before.

And it's all gonna come down to their internal assessment of when exactly these export controls are gonna come. Moving along, we have a kind of a funny story and as opposed to another one of the big business stories of a week, and it is that Elon Musk's ex previously Twitter has been sold to Elon Musk's, XI in a 33 billion all stock deal. So you heard that right? The company, the AI company that is developing rock, has bought the social media company Twitter slash x for.

Tens of billions of dollars. grok has been hosted as part of X for basically since its inception. You can pay a subscription tier to use Grok on x. Grok.com I believe also exists, but I guess Grok has primarily lived on X and which justification here is, you know, of course that Twitter will provide a lot of data to train rock, and there's like deep synergies that can be leveraged.

Yeah. It's also kind of interesting to note that when Elon actually bought X in the, or Twitter as it was back then he paid $42 billion for it, right? So now this is an all stock deal at $33 billion. So the, the company's value is actually nominally decreased. Now, there are all kinds of caveats there. You know, you have a sort of internal, let's say purchase within the, within the ecosystem.

I'm not clear on what the legalities of that are, whether fair market value is an issue in the same way that sort of Elon raised it with respect to open AI's attempt to sell its for-profit arm to the non-profit arm, right? That was one argument is, Hey, you're not selling this at fair market value. I suspect this is probably more kosher because it has less like, you know, control weirdness issues, but super not a lawyer. Just interesting numbers.

So yeah, anyway, all stock transaction and the ultimate combination of these two would value XI at $80 billion. So that's, you know, pretty, pretty big. And also interestingly pretty close to Anthropics valuation. Right. Which is amazing given that X was 20 minutes ago non-existent. Right. This is a, a classic Elon play. XAI. Yeah, it's, I'm, it's a lot confusing. You're right. XAI came outta nowhere, right? Like what, 18 months, two years ago. Pretty, pretty wild.

Pretty impressive in a, a kind of classic Elon place. So, so there you have it. Not much information in the article but we just get these top lines and the numbers are big. The numbers are big. And as you might expect, there's speculation. I mean, a lot of people are making memes about the, you could say self-dealing, I suppose in this case, right? Like, these are two the musk companies, one of them is buying the other.

And it could have to do with some, you know, financial, aspects of the purchase of Twitter, the loans that El Musk took out against his Tesla stock, which is now falling a little bit. Yeah, so you can, you can do various kind of nitpicky ideas as to why this happens right now why this precise pricing. But in any case I suppose not entirely surprising given how this has been going. Onto the lightning ground.

We have the story about SoftBank is now open, AI's largest investor and has pushed the market pack market cap of OpenAI to 300 billion, but at the cost of a ton of debt. So SoftBank is, is a big investor out from Japan that has invested very large sums into various tech startups and seemingly they've taken on debt to do the $40 billion investment round for OpenAI with 10 billion borrowed from Mizuho Bank. So, wow, that's quite a borrow.

Yeah, it is, it's, it's also consistent with the a hundred billion, let alone 500 billion to invest in this deal. and, you know, all the, the sort sorted detail started to come out and it's like, yeah, well this is part of it. They're literally borrowing money. So this is an interesting play. Masayoshi son is, taking a big risk here. There's no, no other way to put it. And there's a kind of budding relationship there with, with Sam Altman that we've seen in various forms.

There are a couple of strings attached, as you might imagine, right? If you give $40 billion to an entity, there are gonna be strings. So the $10 billion deal is expected to be completed in April. The remaining amount is set to come in early 2026, right? So this is not. Super near term. But when you're thinking about the super intelligence training runs that OpenAI internally thinks are plausibly gonna happen in 2027, this is basically that, right?

This is that that injection of capital OpenAI has to transition to a for-profit by the end of the year in order to get the full $40 billion, right? So more and more pressure mounting on OpenAI to successfully complete that transition, which seems to be bogged down in an awful lot of, of legal issues now. So this is kind of a, a challenge. Apparently SoftBank retains the option to pair back the size of the funding round to $20 billion.

If OpenAI does not successfully transition that is an accelerated deadline, right? So OpenAI. Previously was under a two year deadline from its last round of funding. And now it's saying, okay, well by the end of the year you've gotta complete this, this transition. So that's kind of interesting, more heat on Sam to kind of complete this this weird you know, acquiring the, is it the nonprofit, the for-profit?

I mean, who can keep, keep track anymore, but basically yeah, turning the acquiring the, the nonprofit by the for-profit and all that. So, we'll, we'll see. I mean, this is gonna be a, a legal and tech drama, the likes of which I don't think many of us have seen truly a unique situation that can be said about the opening eye, who hasn't, who hasn't been there. Yeah. And next up we have a story about DeepMind, which is Google's ai arm.

And there's reports now that it's holding back release of AI research to give Google an edge. Just I think last week maybe I. Two weeks ago, we were sort of commenting on the, some of the research coming out of DeepMind as seeming like something Google might not want to share because it is competitive and fed. And yeah.

Now there's reports coming from former researchers that DeepMind, for example, is particularly hesitant to release papers that could benefit competitors or negatively impact Gemini, their LLM offering. And I've also heard a little bit this from people. I know that there's a lot of bureaucracy. There's quite a lot of tape around the publication and going to publication these days.

So apparently now new publication policies include a six month embargo on strategic generative AI research papers and have to justify the merits of publication to multiple staff members. Yeah. This is also, I mean, in effect, we've seen this sort of thing from DeepMind before. In particular, I'm trying to remember if it was the Chinchilla paper or if it was gato or something.

There, there were, anyway, we've talked about this before in the podcast as well, but there was one case early on where there was a full year's delay between I think the end of the training of a model and then it's, it's sort of announcement. It was one of these early kind of post GT three models and so, you know, this is in substance maybe, partly new, and then it's partly a, a habit that's been developed internally.

It makes all the sense in the world because, you know, they're forced to compete with an increasingly hot field of, companies like Open AI and Anthropic and so, and, and Chinese companies. But It's definitely interesting. I mean the claim as well has been, there were three former researchers they spoke to who said DeepMind was more reluctant to share papers that could be okay, exploited by competitors or that cast Google's own Gemini models in a negative light compared to others.

They talked about an incident where DeepMind stopped the publication of research that that showed that Gemini is not as capable or is less safe than, for example, a GPT four. But on the flip side, they also said that they blocked a paper that revealed vulnerabilities in chat JPT over political concerns, essentially concerns that the release would seem like a, a hostile tit for tat with open ai.

So you get a, a bit of a glimpse for the, the kind of inter-company politics, which is absolutely a thing. There's a lot of rivalry between these companies which is kind of an issue too on the on the security and, and the, control alignment side. But anyway, so there you have it. Maybe not a shock, but certainly a practice that we now have more evidence for coming from Google. And by the way, it's certainly gonna be practiced at other companies too.

This is not gonna be a a Google exclusive thing. Yeah. I mean, to be fair, OpenAI has basically stopped publishing Yeah. For the most part. So as you said, not surprising, but, but not well, because DeepMind for a long time was a fairly independent, like pure research, more or less organization, that that was much more academic friendly. And that is definitely changing. Next up we have a story about SMIC. China's leading semi semiconductor manufacturer trying to catch up to DSMC.

Apparently they are at least rumored to be completing their five nanometer ship developments by 2025, but at much higher costs due to using older generation equipment and presumably having very, very poor yields. Yeah, and check out our hardware episode for more on this. But what they're doing is they're forced to use DUV deep ultraviolet lithography machines rather than EUV for the five nanometer node.

Normally five nanometers is where you see the transition from DUV to EUV or at least that's one kind of node where you, you could and anyway, in order to do it with DUV, the, the resolution is lower. So you've gotta do this thing called multi pattering, take the same chunk of your wafer, scan it through again and again, potentially and again, potentially many times in order to achieve the same resolution as you would with just one scan with EUV.

And so what that means is you're spending four times as long on one particular pass for one particular layer of your, of your lithography and that makes your output slower. So it means that you're, you're not pumping these things out as fast, and it also reduces yields at the same time, both of which are economically really bad. So yeah, yields are expected to be. An absolutely pathetic 33%. And that translates into a 50% higher price than TSMC for the same node, by the way.

I mean the CCP is gonna be subsidizing this to blazes. So the economics are just fundamentally different when you talk about, for example, like lithography and FAB for AI chips in China. 'cause it's, it's just like a national security priority. But still this is, kind of interesting and this will be used this node by Huawei to build their Ascend nine 10 C chip. So these are actually gonna see the light of day in production.

And just one more story in this section, Google backed Isomorphic Labs is raising 600 million. So ISR collapse is basically spun out of a deep mind back in 2021. They are focused on AI to model biological processes and, and primarily do drug discovery pre presumably. And this is their first external funding round. So they you know, were able to do a lot, I suppose, with support from DeepMind and Google.

They have now made this major deals with companies like Eli Lilly and Novartis for billions in kind of partnerships and research programs. So, IC Labs is, is certainly active it seems. Yeah, it's sort of a, a funny headline in a way. Like, I'm not sure how they get to like external investment round. 'cause apparently yeah, they see the financing round. They're like, it's led by Thrive Capital. Okay. Okay, cool, cool.

With participation from, I. Gv, which is called gv the same way KFC is called K-F-C-K-F-C. Used to be called Kentucky Fried Chicken. Gv used to be called Google Ventures. Google. Oh, shit. That's Alphabet, isn't it? Yes. Yes. Alphabet also kind of the parent company. Isomorphic Labs. Right. So it's like it. So GV is participating. That's Google. And then follow on capital from an existing investor Alphabet. So like at least by entity count, this is two thirds sort of like the Google universe.

Yeah. External is let's say generous, I suppose. Yeah, yeah. Like I, I couldn't see how much is it is being led by Thrive Capital and they're external, so Great. I don't know how much is being contributed by whom, but I just sort of thought that was funny. Like Google is, is so big, or Alphabet is so big, they're kind of everywhere all at once. Which anyway, just what counts as external these days? I don't know anymore. And moving on to the research and advancements section.

First up, we actually have a paper from Open ai. So I should take back a little bit what I said about them not publishing. They do publish some very good research still. And this paper is called Paper Bench Evaluating AI's Ability to Replicate AI Research. So this is basically doing what it sounds like they are evaluating in a benchmark suite, the ability of AI agents to replicate state-of-art AI research and real AI research. So this is 20 ICML, 2024 spotlight and oral papers from scratch.

They need to understand the paper, they need to develop a code and they need to execute the experiments. So Kind of the ultimate result we are seeing is that the best performing agent, Claude 3.5 sonnet with some scaffolding achieves an average application score of 21%. And that is worse than top machine learning PhDs, which were also recruited to attempt the benchmarks. And they are also open sourcing this benchmark to facilitate future research in the AI engineering capabilities of AI agents.

one of the key things behind this benchmark is their, the strategy they use to kind of decompose the papers or the task of replicating a paper into a kind of tree with increasingly granular requirements. So you'll have these like leaf nodes that have extremely specific binary and relatively measurable results of the replication that you can actually get your, judge, LLM or this thing called judge eval to go through and, and evaluate. And then what they do is they have this sort of like.

I mean, in a way it's sort of like a, not a back propagation thing, but they, they, they essentially co combine the leaf nodes together into one sort of like not quite leaf node. And then those kind of next layers of the tree combine and merge up higher at, at more and more kind of higher levels of abstraction. And so this allows you to essentially, like, you know. Give partial marks for partial replications of these papers.

And a submission is considered to have replicated a result when that result is reproduced by running the submission in a fresh setup. So there's a, a whole kind of reproduction phase before the grading phase begins. And the, the kinds of tasks that they're evaluating at the leaf nodes are things like code development execution or results match.

So if, if they, you match a particular result or product of, of executing code anyway, this is all kind of a, a way, I think, a very interesting way of breaking down these complex tasks so you can measure them more objectively. I think what's especially interesting here is that we're even here, right?

We're talking about let's make an evaluation that tests an M'S ability or an agent's ability to replicate, like some of the most exquisite papers that exist like ICML as you said, this is like a, a top conference. These are the spotlight and oral papers that were selected from the 2024 ICML conference. So like, these are truly, truly exquisite. They span 12 different topics, including deep RL robustness, probabilistic models. It like, it is pretty wild.

And they worked with the actual authors of the papers to create these rubrics manually, to kind of capture all the things that would be involved in a, in a successful replication. Then they evaluated whether or not the replication was completed successfully using an l LM based judge. But they did check to see how good is that judge? And it turns out when you compare it to like a human, who does the a human judge say that that does the evaluation.

They get an F1 score of 0.83, which is pretty damn good. So these are at least reasonable proxies for what a human would, would score on or how a human would score these things. Turns out Claude 3.5 sonnet new with a very simple agentic scaffold gets a score of 21%. So one fifth over one fifth of papers successfully replicated by Claude 3.5 sonnet new. That's pretty wild.

And then anyway, they get into subsets that are potentially a little bit more cherry picked that happened to show oh one doing better and, and, and all that stuff. But still very interesting. Result in, I think this is where we're going, this tells you we're closing that kind of final feedback loop heading towards a world where recursive self-improvement starts to look pretty damn plausible.

You know, 21% of exquisite cutting edge papers today can be replicated this way with caveats galore, but those caveats start to melt away with scale and, and with more time spent optimizing agent scaffolds and so on. So I think this is just a really interesting sign of the times. Yeah, exactly. They do have a variant of a setup they call iterative agent, basically letting the model do. More work making it not stop early.

They get up to 26% accuracy replication with oh one high, so high compute cost, and they give it up to 36 hours in this case, and that gets you 26%. for reference, that's impressive. 'cause replication is not necessarily straightforward if you are just given the paper to read and not of a code.

And, and to give you an idea, some of these papers are all in one simulation based inference, sample specific masks for visual, reprogramming based prompting, test time model adaptation with only forward passes. Things like that that are, you know, yeah, the kind of research you see getting awards at AI conferences. Next we have a paper called Crossing the Award Bridge, expanding RL with verifiable rewards across diverse domains.

So reinforcement learning with verifiable rewards is one of the big reasons that deep seek and these upper models worked well. You, they were trained with reinforcement learning on math and coding where you had these exact verifiers, right? You can know for sure whether it, what it did was good or not. And so this paper is trying to essentially generalize that to diverse domains like medicine, chemistry, psychology, and economics.

And as a result, they are saying that you can get much better performance and, you can kinda generalize their approach. Yeah, it, it's sort of interesting, right? 'cause like there's this classic idea that okay, you might be able to make a, a coding AI or a math AI that's like really good by using these sorts of verifiable rewards verifiers. But yeah. How do, how do you hit the soft sciences? How do you make these things more effective at, creative writing or things like this?

This is an attempt to do that. So they try a couple different strategies. They try rule-based rewards. So these are like relatively simple kind of yes or no based on exact matches of, you know, does, does a keyword is a keyword contained in the answer. That's kind of rule-based binary rewards. They also have rule-based soft awards where they use this measure of, of similarity called KAR similarity just to kind of measure. Roughly does, does the content of the output match the, the right answer?

So they try those, they find that they, they don't actually scale that well, so you kind of saturate beyond a certain point. I think it was like around 40,000 tokens or something. Yeah, or sorry, 40,000 examples where you just, you start to degrade performance in these non kind of quantitative tasks. And so what they do is they introduce another strategy model-based rewards, and, and really this is what the paper is about fundamentally, or this is what the paper wants to be about.

So they use a distilled 7 billion parameter, LLM to basically train this like model-based verifier, model-based issue, model-based rewards. So the way that works is they start by training a base model using reinforcement learning. They have some really large LLM, highly, highly performant LLM, that they use it as a judge and. They're gonna use that judge to give these very nuanced rewards, right? So the, the judge is a very big LLM, very expensive to run.

And it'll determine, okay, did the model, the smaller model that's being trained, did it do well at this task? Right? And it's actually gonna train the smaller model on a combination of math and code and the kind of softer sciences, like econ, psych, bio, that sort of thing. And so that's step one. They'll just use like reinforcement learning rewards to do that with the big model. The greater after that, they're going to take a fresh base model and they're going to use the base model.

They just trained using RL the source of truth, essentially. As, as the source of text to evaluate, they'll provide correctness judgments from the, the big teacher model and essentially distill the big teacher model into the smaller model. They'll use about 160,000 distilled training samples from anyway, from the data that they collected earlier with the, in that training loop. Bottom line is, it's a giant distillation game. It's, it works well. The result is, is interesting.

It's just, I think, good as an example, the kind of thing you're forced to do, if you wanna kind of go off the beaten path work, not with like quantitative kind of data like math or code where you can verify correctness. You can actually compile the code and run it, see if it works. Or you can use a calculator or something to get the, you know, find a way to get the mathematical result. In this case, you're forced to use, you know, language model judges.

You're forced to find ways to, anyway do distillations so that things aren't too expensive. basically that's the high level idea. I don't think it's like a, a breakthrough paper or anything, but it gives us some visibility into the kinds of things people are, are forced to do as we try to push LLMs out of the math and code or agents, I should say, reasoning models outta the math and code zone.

Yeah, it's a very kind of applied paper no sort of deep theory understandings, but show how to achieve good results essentially on a given problem. And we do also release a data set and the trained reward model for researchers to build upon. And, and this is a pretty important problem, right? 'cause that's how you're gonna keep improving reasoning models beyond just math and coding. So cool to see some progress here.

And speaking of reasoning models, the next paper is inference time, scaling for complex tasks, where we stand and what lies ahead. So this is, as it sounds like, a reflection on the status of inference, time scaling, which in case you're not aware of a term, you're trying to get better results from your model, just from more compute after training. You're not doing any more training, but you're still performing better.

And that has been a hallmark of things like deep seeks R one and O one and other models. Here this paper is evaluating nine foundation models on various tasks. Also introducing two new benchmarks for hard problems to assess model performance. And they basically have various analysis and, and results that showcase that in some cases, basic models that aren't reasoning models are able to do just as well. In other cases, reasoning models. Do better in some cases high token usage.

So a lot of compute does not necessarily correlate with higher accuracy across different models. So in general, right, we are kind of in a, in early-ish stage and there's some confusion, there's a lot of mess to get through. And this does a lot of this sort of showcasing of where we are. Yeah. And, and this is a point that I think, often gets lost in the inference time, compute the kind of high level discussion about it. I think it's pretty clear for anybody who's steeped in the space.

But inference time compute is not fungible in the same way that training time compute is. So what I mean by that is, you know, training time it's not actually the simple but very roughly, you know, you're like doing text auto complete and then back propagating off the result of that right to, to first order. At inference time, things can differ a lot more, right?

Like you kind of have the choice You know, one way to spend your inference time compute is to spend it on generating like independent parallel generations, right? So like you sample a whole bunch of different answers from the same model at usually a high temperature. And you get a whole bunch of potential responses and then you have some kind of way of aggregating the result using some sort of operation.

Like, you know, you might take the average of those outputs or some majority vote or some measure of the best outcome, right? So, so you might use those techniques. That's one way to spend your inference. Time compute is just generate a whole bunch of potential outputs and then pick from among them. The other is you have one stream of thought, if you will, and you have a critic model that goes in and kinda like criticizes the stream of thought as you go in sequence.

So you're imagining kind of one more nuance, sort of chain of thought that you're investing more into than a whole bunch of like relatively cheap in anyway, in compute terms, relatively cheap parallel generations. And so this is, these are two fundamentally different approaches and there are many more, but, but these are the two kind of core ones that are explored here. And there are a whole bunch of different configurations of each of those approach of each of those approaches.

And so that's really what this paper is about is saying, okay, well, you know, if we spend our inference time compute in different ways, you know, how does that play out? anyway, it's actually quite an interesting paper. It helps to sort of resolve some of this some of the questions around, you know, what's the best way to spend this compute? there's apparently very high variability in token use, even across models with similar accuracies, right?

So if you take a given model you know, that tends to, to get, I don't know, 70% on GPQA or something. What you'll find is, it won't consistently use an average of, I dunno, a thousand reasoning tokens to respond, or 10,000 or a hundred thousand. There's like a lot of variability. And so what that strongly suggests is that there's a lot of room left for improving token efficiency, right? So like you've definitely got some highly performant, performant models that are overusing tokens.

They're cranking out too many tokens, spending too much inference, time compute, and they could be optimized in other ways. Apparently that's even a problem with respect to the same model. So the same model can yield highly variable token usage for a given level of performance, which is kind of interesting. They also highlight that quite often. Although it's true, like inference time scaling does work.

As you increase inference time scaling, in other words, you increase the number of tokens you're using to get your answer. Performance does tend to rise. It can also indicate if you see a given model pumping out a whole crap ton of tokens, it can be an indication that the model's getting stuck. And so this, this tends to be a, a kind of black pit of, of token generation and inference time compute spend when the things just sort of going in circles.

And so a whole bunch of interesting findings around that. They do find that continued scaling with perfect verifiers consistently improves performance. and this is both for, for reasoning and sort of. Conventional or like base models. Which is interesting, right? 'cause that means if you do have a reliable verifier that can provide you feedback on the task. Truly inference time scaling does work.

But as we've just seen for a lot of tasks, you don't necessarily have these robust always correct verifiers. And, and that's a, a, a challenge. And so I think there's a lot of really interesting stuff to dig into here if you're interested in token efficiencies where yeah, inference time scaling is going and all that. and some cool curves, by the way. Last thing I'll say is interesting curves that show the distribution of token use for different models. So kind of comparing.

Claude 3.7 sonnet to oh one on a math benchmark and seeing like what's the distribution of tokens that's used. And it's sort of interesting to see how that distribution shifts between those different models. Like which models tend to, you know, use fewer tokens when they get things wrong versus when they get things right, for example. So anyway keeping time in mind that's probably all we have time to go into here.

Lots of, lots of figures and numbers and interesting kind of observations that you can get from this paper for sure. And we have just one more paper to go over. It's titled, overtrained Language Models Are Harder to Find Tune. And there you go. That's the conclusion. Over-training is when you're pre-training, right?

When you do the first kind of basic training pass of Outta Complete, it has been observed that, you know, there's a theoretical amount of training that is optimal, but you can actually do better by over training going beyond that and the general kind of common wisdom is over. Training is good.

What this paper is showing is there is an idea called catastrophic overtraining, where when you do too much pre-training and then you do, what's called post-training or instruction tuning can adapt your model to a specific task or make it behave in a certain way that you don't get from autocomplete that actually makes it perform worse. So quite an important, I think result here. Like the, to quantify it, they try.

So the, the instruction tune 1 billion parameter model OMO one B. It was pre-trained on 3 trillion tokens. And what they find is that at, if you pre-train it on 3 trillion tokens, it leads to 2% worse performance on a whole bunch of fine tuning LLM benchmarks than if you pre-trained it on 2.3 trillion. Tokens. That's really interesting. So this is like, you know, 20% more pre-training tokens.

And then if you then down downstream, so after that you take those two different models and you fine tune them, you get 2% worse performance on the model that was trained, pre-trained with more tokens. The mechanism behind this is really interesting, or at least will have to be really interesting. It's a little bit ambiguous in the paper itself. So they, they highlight this idea of, of progressive sensitivity.

The way they measure this is basically if, if you take Modifications of equal magnitude models that have undergone pre-training with more tokens exhibit greater forgetting. They're more likely to forget the original capabilities that they had. This is something we've seen before, right? When you, when you fine tune a model it will forget some of it's kind of other capabilities that aren't related to the fine tuning that you've just done.

So this is presumably suggesting that the pre-trained model, if you pre-train it on a crap ton of tokens, just becomes like, I guess, more fragile. Like it, it's more, maybe more generally capable out the gate, but the moment you start to fine tune it, that structure is like, in a weird way, almost like unregular. It's almost like it's overfit. I wanna say to the pre-training distribution.

and they do say actually to that point, regularization during fine tuning can delay the onset albeit at the cost of downstream performance. So to me, this suggests there is a kind of regularization thing going on here where yeah, if, if you just like undertrained them, or not undertrained, but if you pre-train on fewer tokens, the overall model is less overfit.

Again, it's, you're not necessarily doing multiple epics, so you're not passing over the same data so you're not overfitting to the specific data, but to the potentially the distribution of training data. And I think that's maybe the, the thing that's going on here though, it's, it's not like I didn't, I didn't see this discussed in, in the depth that way with that angle, with that depth. But I, I might have just missed it either way. Very interesting.

Result and huge implications for pre-training, right? Which is a massive, massive source of cap spend. Exactly. Yeah. We, they primarily empirically demonstrate some of these phenomena in this paper and, and showcase that this is happening with some theoretical analysis as well. But as you say, like we don't have an exact understanding of why this happens. It's kind of a phenomena that's interesting and, and has to be researched in more depth. Onto policy and safety.

We begin with taking a responsible path to a GI, which is a paper that is, presenting an approach for that. This is from DeepMind and I guess is pretty for fair, general idea. They have previously introduced this idea of levels of a GI, which is sort of trying to define a GI as a level set of levels of being able to automate like a little bit of human labor, all of human labor, et cetera. And so they expand on that.

They are also emphasizing the need for proactively having safety and security measures to prevent misuse. And it just generally goes over a whole lot of stuff. So let you take over Jeremy and kind of highlight what you thought was interesting. Yeah, no, I mean, generally nothing too shocking here other than the fact that they're saying it out loud. Taking seriously the kind of, the idea of loss of control.

So one notable thing about the blog post, not the the long report itself, but just the blog post, is that although they look at in the report for different risk categories, so they look at misuse mistakes. So sort of traditional AI accidents, you could say structural risks and misalignment. The blog post itself is just basically about their thoughts on misalignment. There's some misuse stuff in there, but it's a blog post about loss of control. That's what it is.

It's clear that that's what DeepMind wants to signal to the world. Like, Hey, like, yes, everyone seems aligned on this other stuff, but guys, guys, can we please it's, this is really important. And so anyway there's a lot of interesting. Stuff in there about, like, if you're familiar with the Google DeepMind research agenda on alignment. There's lots of stuff in there that'll be familiar.

So debate, let's keep super intelligent, AI honest by getting them to debate each other and we'll use an AI judge and hopefully if a super int intelligence is being dishonest, then that dishonesty will be detectable by a trusted judge. There are all kinds of challenges with that that we won't get into right now, but there's a lot of stuff to hear about interpretability sort of similar to some of the anthropic research agenda and all that.

So they do they also flag the MONA paper as a, anyway this is something that we covered previously. You can check out our discussion on, on Mona. But basically it's a like a performance alignment trade off option when. You essentially are forcing the model to just reason over short timelines so it can't go too far, you know, too far off the deep end and do really dangerously clever stuff. You can try to ensure that it's more understandable to humans. So anyway, thought that was interesting.

One thing that for me personally was interesting is there was a note on the security side for alignment. They say a key approach is to treat the model similarly to an untrusted insider. Motivating mitigations like access control, anomaly detection, logging, and auditing. You know, one really important thing that the labs need to improve is the, just the security generally. Not, not just even a lot from a loss control standpoint, but from nation state activities.

And so I think a lot of these things are converging on security in a really interesting way. So there you have it. A lot more to say there, but we'll keep it to time, right? Yeah. The, the paper itself they published is 108 pages long. It basically is a big overview of the entire set of risks and approaches to preventing them. And they also mention in the blog post that they're partnering with various groups that also focus on this.

They have the A GI Safety Council they work with Frontier, model forum and they published a course, the Google DeepMind A GI safety course. Yeah. On YouTube, which is interesting. So yeah, clearly at least some people within DeepMind are very safety focused, it seems. Next article is this AI Forecast predicts Storms Ahead. This is an article covering I suppose a report or an essay, I dunno what you'd call it, called AI 2027.

This was a Corin by some, I suppose, significant intellectuals and highlights kind of a scenario that could be of why you should worry about AI safety and, and how this stuff might play out. So I don't wanna go into too much depth 'cause this is quite a detailed report of theirs. And, we probably just cannot kind of go into, 'cause it is a detailed, fictional scenario.

But if you're interested in kind of getting idea of sort of ways people are thinking about safety and why people think it's very important to be concerned about it, I think this is a pretty good read on that. Yeah. It's, it's about this, writeup as you said, called AI 2027. It's by a really interesting group of people. Daniel Kota is one of the more well-known ones. There's Scott aan Scott Alexander from Slate Star Codex, or Aral Codex 10.

They're kind of famous in the ai what alignment universe, the very kind of Yeah. Niche ecosystem of AI alignment and all that. Daniel Coyo famous for. Basically like telling open AI that he wouldn't sign their really predatory, it must be said non-disparagement clauses that they were trying to force employees to sign before leaving. The Daniel is known for having made really spot on predictions for the state of AI going back, you know, three, four years, right?

So he basically said like, here's where we'll we'll be, you know, in, in like 20, 26 or so back then. And if you like, you, you should take a look at what he wrote up. It's pretty remarkable. Like it, it's kind of on the nose. So here he is essentially predicting that we hit super intelligence by 2027. I mean like I've had conversations with him quite frequently and he's been pretty consistent on this.

I think one of the only reasons he didn't highlight the 2027 Super intelligence possibility earlier was that it's just kind of, things get really hard to kind of model out. So they tried to ground this as much as they could in concrete experimental. Data and kind of theoretical results today. And to map out what the, the government's response might be, what the private sector's response might be, how CapEx might flow.

You know, like I have some, some quibbles of the margins on the national security side in the sort of China picture of this, but that's not what they were trying to really get down pat. I think this is a, a really great effort to, to kind of make the rubber meet the road and create a concrete scenario that makes clear predictions and we'll be able to turn back on this and say, did they get it right?

And if they did, if they do get things right over the next, you know, 12 months, I think we have some important questions to ask ourselves about how seriously we then should take the actual 2027 predictions. And anyway, it is quite an interesting read and it goes through again, really interesting 'cause Dan k himself is a former OpenAI employee, so like he's familiar with how people at OpenAI talk about this.

And this certainly is consistent with conversations I've had with OpenAI and Anthropic DeepMind, like all these guys. Yeah, 2027 seems pretty plausible to me at least. Yeah, and I, I would say, I tend to agree that 22 7 is plausible, at least for some definitions of super intelligence. For example, they are saying there would be a super intelligent coder that can effectively be a software engineer like good software engineers at Google. Pretty plausible for me.

So war for Reed, it's, it's a very well done sort of, story you could say of what might happen. Another story about OpenAI. Next we have a article titled The Secrets and Misdirection Behind Sam Altman's Firing From OpenAI. So this is there's a new book coming out with some of these kind of lured details, and this article is presenting some of them a lot of stuff you've mentioned already about kind of tensions between the board and Altman.

Just kind of more specifics as to actual seemingly lies that were told and, and sort of toxic patterns that led to this happening. Pretty detailed article. If you wanna know all the details, go ahead and check it out. Yeah, I thought this was actually quite interesting because of the specifics. At the time the board the open AI nonprofit board fired Sam Altman and kind of like refused to give us a sense of their actual reasoning, right?

They famously said he was being fired for not being consistently candid. At the time, I think we covered this, I distinctly remember saying on the podcast and elsewhere, like, I think unless the board comes out with a very clear reason.

The obvious like result of this is people are gonna be confused and Sam a is gonna have all the leverage because you've just fired the guy who created $80 billion or so at the time of market cap and made these people's careers and you don't have an explanation for them. It was pretty clear that there actually was tension behind the scenes. You know, anyway, if, if you're in the space and plugged in and you know, people in the orbit. You were probably hearing these stories too.

There are a lot of concrete things that like, are serious, serious issues, right? Like, so in particular Sam Altman claiming a allegedly. That there were big model releases and enhancement to g PT four that had been approved by the Joint Safety Board, when in fact that does not seem to have happened like that. This was an outright lie to the board. All kinds of things around releases in India of an, of an instance of G four that had not been approved.

Again, the claim from Sam Alton being that that had happened there's. Oh yeah. Then the OpenAI Startup Fund, which Sam Altman did not reveal to the board that he essentially owned or was a major equity holder in it, was he sort of like managing it. While at the same time claiming that he, anyway, claiming some level of, of detachment from it. The board found out by accident that he was actually running it. They thought it was being managed by OpenAI.

And so, you know this again, like over and over again. It does seem like Ilya Suki and Mirati were behind the scenes. The ones driving this, so it wasn't even Helen Toner or any of the other board members. It was Mira and Ilya who were like, Hey, we're seeing patterns of toxic behavior and not toxic in the, you know, kind of political sense, right? But just like this is a dude who is straight up lying to people and on very, very kind of significant and substantive things.

And so yeah, apparently they, they were concerned that they just, the board was only in touch with, with some of the people who are affected by this stuff very, very sporadically. This is consistent with something I've heard. We'll be reporting on fairly soon, actually. Just sort of criticism from former opening AI researchers about the function of the board as being essentially to pretend to oversee the company. Like, so this is really, really challenging, right?

When you're leaning on the board as to make the case that you're doing things in responsible, kind of secure way and preventing, among other things, nation state actors from doing bad things. So anyway, lots of stuff to say there. The take home for me was I am now utterly confused given the actual strength of these allegations and the evidence why it took the board so long to. To come out and say any, like, they actually had cards, if this is true. They actually had cards to play.

They had screenshots. Yeah, they had exactly right. This like Mira Martis, like freaking Slack chat. Like, what are you doing guys? Like, seriously. Th this was, if, if your goal was to was to change out the leadership like great job giving Sam a all the leverage. Great job. Like creating a situation where Satya could come in and offer Sam a job at Microsoft. And that gave Sam the leverage, like thi this. I mean, I can't, I can't really, I mean that's what it seemed at the time.

The board like went about this in a very clearly they were secretive because we didn't want Alman to be aware of this whole conversation, but they went about it in a very confusing. Perhaps confused fashion and this just backs it out basically. Yeah. Like, if this is true, again, if this is true then, then this was some pretty amateurish stuff.

And there's a lot of, I mean, it is consistent with some of the effective altruist vibing that that does go on there where it's like everyone's so risk averse and trying not to make big moves and this and that and like unfortunately that just seems to culturally be in the water. So yeah, I mean, like, sorry, there's no such thing as a low risk firing of the CEO of the largest privately held company in the world. Like that's, you know, big things are gonna happen.

So anyway I thought a fascinating read and something for the history books for sure. And with that, we are gonna go and close out this episode. Thank you for listening. Thank you hopefully for trying out the demo that we have launched at aade. Yeah. And as always, we appreciate you sharing providing feedback, giving your views, but more than anything, just continuing to listen. So please do keep tuning in.

Transcript source: Provided by creator in RSS feed: download file