#215 - Runway games, Meta Superintelligence, ERNIE 4.5, Adaptive Tree Search - podcast episode cover

#215 - Runway games, Meta Superintelligence, ERNIE 4.5, Adaptive Tree Search

Jul 08, 20251 hr 56 minEp. 255
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Our 215th episode with a summary and discussion of last week's big AI news! Recorded on 07/04/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

In this episode:

  • Cloudflare's new AI data scraper blocking feature, its potential implications, and technical challenges
  • Meta's aggressive recruitment for its Super Intelligence Labs division is covered, highlighting key hires from OpenAI and other leaders in the field
  • Anthropic loses significant talent to Cursor, with details on their new economic futures program focusing on AI's impact on the labor market
  • Notable open-source AI model releases from Baidu and Tencent are also discussed, including their performance metrics and potential applications.

Timestamps + Links:

  • (00:00:11) Intro / Banter
  • (00:01:43) News Preview

Tools & Apps

Applications & Business

Projects & Open Source

Research & Advancements

Policy & Safety

Transcript

Intro / Banter

Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with ai. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can check out the episode description for the timestamps and links to all the stories. I am one of your hosts, Andre Ov. I studied AI in grad school and I now work at Regenerative AI startup. And what's up everybody? My name's Jeremy. The other co-host of the podcast.

Yeah, co-founder of Gladstone, ai, ai, national security stuff, and all that jazz and more. There's other jazz too. What a week we, I, I gotta say I did, so I did my prep one day earlier than usual this week. Mm-hmm. And there's some weeks where that's fine and nothing happens on the, you know, Thursday before the show, and then there's other weeks where it's just like a giant double middle finger. Fuck you. All the news happens. And, and maybe you missed a couple big stories earlier.

So this is what happened. I had to frantically catch up this week on the Baidu releases, the Tencent releases, all the, like a bunch of big stuff. And so, yeah, it just seems like a, it was gonna be not that big of a week until suddenly it was. And, and here we are so excited for this one for sure. Yeah, this is, uh, we have quite a few stories on the block and it's a week where there was no gigantic news, nothing that was like the talk of a town.

but there was a lot of stuff that happened that is worth talking about. So this will be a pretty decently sized episode. I expect,

News Preview

to give people a preview of what we'll be talking about. Tools and apps. There was actually no release unlike, I dunno, the last few months. You know, we had Gemini, CLI, variations of cloud code, et cetera. Nothing really big like that this week. But some interesting kind of smaller tools, applications in business. We'll chat a bit more about Meta's, super intelligence push, which was one of the big kind of fun, slightly drama things of a week.

and some, news about philanthropic and, uh, as usual hardware stuff. Project open source, as you mentioned, some pretty big stories there with, uh, Ernie 4.5 and Ion A 13 B, and quite a few others. Like we have five stories in there, research and investments, some more scaling research and some more research on sort of the place where we are at with LLMs. Finally, policy and safety, quite a packed.

Section there as well with some discussions of Safety Institutes of China, security risks, the state of AI for security. Lots of various stuff. So yeah, it'll be a pretty, pretty good week.

Tools & Apps

And let's go ahead and get into it. So in tools and apps, first story is actually not a tool most people use, but if you're building a tool, you might be using it. So I thought we should cover it. CloudFlare has introduced default blocking of AI data scrapers. So this is a setting that allows websites to automatically block AI companies from scraping their data. And that would require website owners to explicitly grant access to bots for data collection. This is, uh, kind of a big deal.

CloudFlare is a very big company. Lots of websites go through. And so the defacto, I guess standard for websites up to now has still been to a large extent that, you know, unless you built it in yourself, AI companies will be able to eat up your data and they probably have, I guess that is about to change. Yeah, I think the, the ultimate question is always gonna be to what extent can you meaningfully determine what is AI traffic versus what is not. That's already challenging.

And then it's just gonna get harder as, as we get into, computer use. Right? You can really have, ultimately the inputs come from the very same channels as human computer usage in the limit, right? And, and even mimic you know, human delays on clicking on things, human-like movement. This is sort of all part of the, uh, the end of the capcha era, if you will, in every, every possible sense of the term, right?

So I think there's a, a question as to how long this will meaningfully be a constraint, especially given the massive economic incentives to scrape. But still, I like it's a, this is is a big move. It's precedent setting, and it's also, as you say, cloud CloudFlare is already like a massive player. So it is the, the default now for the internet in some sense.

Yeah, so phase transition a little bit here, but I, I think it will be temporary again, don't, I don't expect in the long run that we won't be able to have bots and in fact, you know, open AI or whoever else, if they're dedicated enough, we'll be able to get by this pretty easily. Right. And I think at this point, they've already scraped most of the internet, so we can, I think previously they would've had to have pretty cheap bots or, or simple bots, right? To just go through everything fast.

Now they can probably get around it, but, uh, if you're being sort of a. Good, and lawful. Then you do have to kind of make some, uh, I think specific request types of your bot. So technically speaking, if you are going by the rules, this should be able to block you. and yeah, as you said, cloud player, really big. Apparently Cloudflare's Network handles approximately 20% of global internet traffic.

So yeah, substantial And moving on Next, we have actually a fun tool that, uh, isn't out yet, but is coming runway. The company that's focused on making it easier for people to edit videos of AI and generate videos lately is going to get into gaming. So they have announced the idea of letting people generate video games with ai. Apparently this is just a, a plan to release a new interactive gaming experience next week. So we'll see when it comes out.

seems a lot like, uh, if you remember AI Dungeon basically sort of interactive story that generate text and images and you go on a little kind of d and d esque adventure where you. Try to go through some scenario. It's like video game in a very loose sense. Uh, and so the preview of what we've seen will be, screenshots, uh, indicators like that, but more polished, which personally I think is, is pretty cool.

That's, uh, a fun use of AI that has maybe been, not entirely explored to the extent that it can be. Yeah. the context here too is so they, they've been known for moving into Hollywood and helping big studios put together movies cheaper. As, you know, as they say, like, like 40% faster, something like that.

They're contrasting though the speed at which gaming companies are moving to adopt this with the traditional kind of Hollywood movie industry as being a lot faster, which does make sense, right? I mean, you think about, the baggage that's inherited by the Hollywood kind of movie complex, right? There's a lot of stuff. Even just looking at the labor union side, you know, screen actors, guild, all that stuff just sort of slows your implementation ability in a lot of, a lot of ways.

Whereas in the gaming industry sort, that's less of an issue, right? You have a lot of indie gamers, for example, are gonna be very, very quick to pick this stuff up, and even big kind of AAA studios. So really interesting from just a, a pacing standpoint. just another note here, they have been in talks, uh, runway has apparently with meta about a possible acquisition there. It seems like, and this alluded to in this article, kind of as a non-sequitur, but it's still, worth flagging.

So they were talking about it and Valenzuela. So one of the co-founders of Runway says, I think we have more interesting intellectual challenges being independent and remaining independent for now. So seems like the talk of the meta acquisition now sort of falling through based on that which itself is interesting, right? Because for meta to start. Playing with runway.

You could see why they wanna do that, obviously for content generation, uh, especially given meta's recent challenges in AI and, and having, uh, actually sort of high-end generative capabilities. It could be an interesting acquisition target, but especially on the gaming side, starts to look a little bit in a way like Netflix's move into, into gaming. You know, that's kind of one angle you can start to imagine meta playing with, but that looks like it's not gonna happen.

So, uh, so interesting that fell through. At the same time, meta obviously developing its internal capacity for AI on the super intelligence side of things with all those acquisitions. So, uh, kind an interesting story about many different things. But runway definitely is an interesting company to watch. Yeah, I'm, I'm a fan.

They've been around for quite a while in the landscape of AI startups, and have focused very much on sort of a professional side of tooling rather than the cool side of generative models. for the most extent, they do have their own generative models, which are not, frontier that aren't quite as good, but they focus a lot on kind of deep integration to make them usable as part of a more sort of editing tool suite kind of setup. So, curious to see what this has sadly don't have access to this yet.

Uh, I'm seeing you don't have access to game worlds yet. want to try it out when I can. Yeah. And one, one last comment too on, um. on the Frontier model side with companies like Runway, and you're talking about the, the tooling as being the thing they focused on historically. Less so. The models, you know, there's this, uh, famous story of Microsoft in the early days as they're deciding which way they're gonna go strategically as a company, they end up going in the direction of.

Obviously, as we know, making software that's really expensive in high margin on the basis that there's a bunch of companies that can make laptops like, or, or computers, right? So the, the famous phrase here is commoditize your complement. So software is complimentary to hardware, and if there's a bunch of different companies making hardware, then the price of that hardware gets driven down. Not to zero, but it gets driven down. The margins get driven down to zero.

So you've got a bunch of people competing really hard to make really cheap hardware, but if you own the integration point of that stack, which is the software, the operating system layer, the application layer, Microsoft does, your margins can crush it, right? So this in a sense, is what a lot of companies are trying to figure out in the AI landscape.

You've got all this competition at the model layer, a million companies making AI models, especially computer vision, you know, especially video generation, that sort of thing. But where is the value bottleneck, right? Where is that aggregation point? What is the compliment? To the commoditized models, and maybe it's the tool chains, maybe it's some kind of home base for user accounts or something like that.

It's really as yet kind of unclear in nascent, obviously hardware is one, one choke point in, in the value chain. But, this is all sort of part of, you can think of runway strategy there as being, okay, let's not compete with everybody else on the thing that's already commoditized. Let's try to focus on an area that's less touched and that may become a choke hold in the value chain.

that's the only way you can really compete if you're not open ai, if you're not anthropic, right, that you need that either scale or favorable kind of value, aggregation points in the, in the chain. Moving on a couple quicker stories. First up, we have Google embraces AI in a classroom with new Gemini tools for educators, chat bots and students. So they've introduced 30 new AI tools for educators, at, apparently an tech conference.

and that includes a version of the Gemini app tailored for education. Gemini AI Suite is now available for free for all Google workspace for education accounts, and that has various features like lesson plan generation, personalized content creation. teachers can create custom AI experts called gems, which you can also do in in Gemini and so on. yeah, it's, it's a pretty. Sort of customized, wrapped, version of their tooling.

If you go and, and see, they have actually a little kinda special UI for classrooms in this suite of errors. And, with Gemini there is a ton of sort of prebuilt things like, uh, generating a lesson plan, a quiz, brainstorming project ideas, lots of sort of suggestions for use cases. And this is following up, I believe, uh, Ivo Open ai Andro also has a EDU version of their service. Kind of notable in a sense because, that's one of the big use cases.

Students are using this stuff like crazy teachers also using it quite a bit, from what I know in terms of grading and and preparation and so on, they certainly need the help in terms of managing the workload. So having more and more sort of, native for education versions of these things is significant because I do think, you know, the educational sensor, uh, sector needs to figure out what to do with alums. How do you change education now that.

You know, homework is, let's say much easier to do than it used to be. Yes, that's right. Yeah. It's, it's the, the age old question, right? Are LLMs calculators or, or are they Yeah. Something that actually atrophies, brain function and that sort of thing. But yeah, one of the, I think one of the interesting questions is when do we really move from, you know, the frame here is Google AI tools for educators. When do we move from that frame to Google AI tools as educators?

Because we are on that continuum. Uh, you know, speaking from painful experience, doing my, my physics degrees like, at least in physics, and, and Andre, you can tell me how this isn't Cs, but like, profs are, are like terrible teachers on average, right? You'll have like, you know, two or three profs who knock your socks off and then the rest of them are just garbage. Like, you know, they wanna do research. They're not there to, to teach. Really, it's not their passion.

It's not what they're great at. And so they're kind of competing against a suite of products that has been increasingly optimized to do a better job than they can. And at a certain point, I wonder if, you know, teachers, professors, all the stuff, there's a transient where they basically are just Sherpas guiding you towards the best kind of generative AI tools that they're aware of to get you started over time. Even that function obviously gets automated away.

but I would expect it'll come with a lot of resistance, especially from, you know, on, on the teacher end because, well, I mean, you've got unions, you've got entrenched interests, nobody wants to be replaced.

But that transition point is gonna be really interesting because the people who are by and large doing the measuring of the effectiveness of these tools, right, who are writing those educational fury papers or whatever, are people in the system who have an incentive to potentially, eventually pretend that these tools can't automate quite as much as they can.

So I actually think this is gonna be an interesting point of friction as we look at how can regulation, how can entrenched interests, slow adoption where really you would want more adoption faster. Anyway. My 2 cents, but there it is. Yeah. I, I wouldn't say most professors are terrible, but, uh, it's, I'll tell you, in physics, in physics, it feels that way.

Yeah. I think in N Cs, because their classes are, are super popular, typically, maybe there's, there's more and there's, you know, many TG students who are very motivated to do a lot of work, right? But professors are typically overloaded and have very limited bandwidth. And that's, I think true for a lot of teachers across Absolutely. Education in general. so these tools that are specifically aimed ED educators could hopefully help with that.

And one of the thing noted in the article is that this is coming alongside updates to manage Chromebooks more on the side of students, including, uh, a new teaching mode where teachers can correctly, uh, direct with students. And, I think that reminds me that, uh, Google had, has made some inroads on the device side of Chromebooks.

They are at least, to some extent, I don't know the comparison of, uh, Chromebook versus, tablet, but pretty significant inroads in terms of what, devices students get to use. So together this, this could be significant. And next meeting. Not a new tool, but I think a, a fun discussion of where certain tools are being used. The title of the article is No one likes meetings. They're sending their AI note takers instead.

And this, starts with a little, fun anecdote of a person named Clifton Sellers going to a Zoom meeting where there are more AI note taking bots than human participants. So this is, if you're like, at a Zoom meeting and a teams meeting, we will meet, et cetera. You can presumably often be seeing this. Now in, in a lot of major companies, there's many providers of these things, including Google Meet and Teams and Zoom themselves, plus some others.

So yeah, I'm using to me as, as someone who works at a tiny startup where excessive meetings aren't really a thing. Like if you go to a meeting, you're gonna be doing some talking and, and some, uh, actual useful work. But, uh, having been or having known people who work at big companies and the general reputation they have kind of not too surprising to see the trend. I've always sort of struggled with, obviously, you know, meetings are, are, uh, are the enemy, by default in startups.

Like people just naturally have an aversion to not having meetings unless they're necessary. So I, I definitely, this is like a more foreign thing, but I, yeah, I mean, you know, it seems like, something that, that could easily happen and I wonder what the failure modes end up looking like when everybody, or most everybody in these meetings is just an AI agent, but yeah. Mm-hmm. couple more stories from Google, just there wasn't any major stuff, so it wound up being a little heavy.

Uh, on the Google side, we have a new app from 'EM called doppel, which uses AI to visualize how different outfits might look on you. This is an app that is being released both an iOS and Android in the US and yeah, let's you try out outfits. You upload a full body photo of yourself and you can use images or screenshots of outfits to see how they, uh, would look. This is generating actually both static images and AI generated videos, which is, uh, kind of neat.

Yeah, something we've seen already in the AI space. I'm sure this has already been built by companies and so on for quite a while, but this is, obviously better than what you might get with the use of VO and, and tools like, uh, Google Gemini, editing. Nowadays, you'll have a very, very good preview, with ai. And the last story is somehow Google's image four. So this kind of sneaked under the radar.

Google has introduced their latest text image model, Ima Imagine four, which is, you know, the follow up to their main line of text to image generators. Uh, now they had Imagine three for a while. Now we have Imagine Four, also Imagine Four Ultra. And they say that this is better at, handling, you know, very specific prompts.

Basically what we've seen with AI image generation, now, the focus is on prompt adherence and being very good at things like spatial layout, preserving text like all of these, subtler details than just making realistic looking stuff. And yeah, it's, it seems like it should be a big deal, but nobody was really excited about it from what I could see. And it does sort of not introduce anything fundamentally new from Imagine Free or anything else we've seen. It's just the latest and. Greatest.

Yeah. Again, with these image model, and, and this is strictly a function of the fact that, you know, I don't spend my time focusing on image generation in my work, right? This is like something, I'm, I'm following this more or less as a passive observer, but incremental advantage of these models over each other is something that feels pretty opaque to me.

Like, how is imagine three better than imagine or imagine four better than imagine three on things like, like writing, you know, what kinds of things could imagine three write that Imagine four can't is a bit fuzzy to me. I, I, I, I get the impression we're saturating this stuff. Surely we must, but who knows? Maybe at a certain point you wanna be able to, you know, put paragraphs and paragraphs on an image and have it be faithfully represented. I imagine that's gonna be the case.

You know, if you look at like, movie, content, like, you know, assets that you wanna put in a movie or something. don't know, uh, super high faithfulness, high resolution use cases, but uh mm-hmm The images that they show are beautiful, no question. They always are. So I'm kind of like, looks good. Yeah. We do have some impressive examples, uh, in the release.

So with Imagine four Ultra, we have a prompt, like a free panel cosmic epic comic panel, one tiny Stardust in Nebula radar shows anomaly text anomaly detected hall, text Stardust pilot whispers panel two violinist and Leviathan emerges console. Red text warning panel free. Leviathan Chase is shipped through Asteroids console text shield, critical screen text debate, pile of screams, SFX.

Anyway, and it's, you know, it does all that very they faithfully, there's some quirks in the rendering that you might say are not quite right, but way, way, way beyond what you've would've been able to do previously with, and, things where you involve composition and and placement and so on.

Applications & Business

And moving on to applications and business. First story is about Meta and Mark Zuckerberg's Drive to set up the Meta Super Intelligence Labs division. So we've known about this some time. We've been covering some of the stories. I believe we talked last week about some of the hires that have been announced in terms of people from OpenAI. But this week, it kind of got formally kicked off. Uh, Zuckerberg announced it internally to Meta, and a bunch more people, have been, announced to have joined.

And in fact, I believe this is a new update too. In addition to Alexander Wang, former CEO of scale ai, Nat Friedman, former CEO of GitHub will also be joining to lead this division, which now has, 11 AI focused employees from, philanthropic Google DeepMind, and in particular open ai. We have like eight people from open ai from across the company, people who've worked on very significant, things like or three mini and, G four oh and and so on and so on.

there's been some leaks as to the pay packages, not quite. As absurd as, uh, I believe what Sam Altman was saying in terms of offering people a hundred mil upfront, but still higher. I, I, I've, I've heard, Dylan Patel. Mm-hmm. Right. Was on a podcast recently talking about what he had heard was a $1 billion package that was pitched at somebody. I don't think they took it. But it was pitched at somebody.

And then I've seen other headlines about 300 million, like, so it seems like there's a range, and there are some people who have been offered more and some a lot less, which, you know, it's what you'd expect, but it, it's, uh, it seems very ambiguous at this point.

It's, it's significant to the extent that, within OpenAI, Sam Ottman sent a memo, to the staff on Saturday and, and basically addressed this saying that yeah, meta has been pretty aggressively recruiting, senior researchers, and it was cast as apparently a quote from his memo is, someone has broken into our home. And, uh, later, I think that was Mark Chen, if I recall. Like Mark Chen said, it feels like someone's broken into our home. Yeah, yeah, it is. From messages on Slack.

So yeah, it's clearly a big, effort and a big investment on Meta Spartan. The stock hit an all time high on these announcements, so I guess it's. Paying off from a stock perspective. I mean, you know, I think I, I alluded to this last episode, but I mean, what a repudiation of y Koon's philosophy, which could not have been less open ai, right? And then all of a sudden, basically Zuck says, you know what? We're doing super intelligence. We're calling it that.

And also we're gonna do it with like eight out of 11 or something of our hires are gonna be open AI guys. And then Alex Wang, like one of the few non-op AI guys is like literally from a company called Scale ai. But who also has the, uh, inclination towards the super intelligence perspective that that deviates quite significantly from, Y Koon. So it's quite an interesting situation. I'm very curious where Y Koon ends up.

He's been interestingly silent this whole time that's kind of noteworthy for a guy who's usually bombastic on social media. Yeah, like, you know, obviously my, my biases are well known on this podcast, but, it does feel like it's, it's meta sort of coming around to that view after spending so long in the woods. But, but this is a really interesting series of, of acquisitions of personnel, right? So, what you have to do if you're meta is you have to.

Shake up the, the game board in a serious way, right? People were just not interested in joining a fast follower open source lab or even a slow follower open source lab as it started to look like. And so, you know, you gotta hire the best, get people jazzed about working at Meta again. And that means you need to refound the company. They were refound the AI part of the company and make it clear that you have top cover. The pitch now is gonna be very compelling actually, right?

You look at the caliber of the people who've joined the fact that Zuck owns the majority voting shares, for meta so he can make unilateral decisions that other people just can't. That's a really interesting pitch, And so, you know, you've got this massive compute fleet, you've got a lot of data. 'cause you're meta, you've got zucks backing and now this, this really interesting team behind you.

The big open question remains, what is Meta's position on alignment on technical safety, on security, right? I mean, again, Koons is well known, fairly dismissive of it with some asterisks here and there and, you know, there's nuance. But Alex Wang, you know, certainly safe super intelligence, Daniel Gross coming over from there.

These are, and you know, one anthropic hire, I believe, uh, these are, are places, including DeepMind that have historically oriented towards more of an alignment friendly perspective. I'm personally really curious what we'll hear in the coming weeks and months about their position on this. But they are a live player. I would, you know, call 'em a tier three player for now, along with. You know, SSI and, to some degree XI with, you know, maybe Anthropic, I would put it number one right now.

And, and opening AI number two, something like that. You could debate those all day, but it seems like they've, they've made themselves a, a live proposition with this play. That's at least how it, how it reads to me. Mm-hmm. Yeah. No and Lacoon is interesting. Now, as far as I know, meta will still have fair, their internal AI research division that publishes quite a lot of research.

And, to be fair, they've published a lot of very significant research on all s They're not sort of refusing lms, but Yna Kuna has pretty famously been arguing that LMS will not lead to, a GI or a SI and some other techniques beyond that are needed.

So this will live alongside the existing research group and presumably be less focused on publication and, and traditional academia and more focused on going head to head with open ai, philanthropic in terms of building out the tech and probably not doing as much research on the alignment side as philanthropic for instance has been doing.

and on the compensation side, yeah, reports are saying not so much upfront offers, but you see offers upwards of a hundred million on the first year, 300 million over the span of four years. Yeah. And there's of course nuances there of like stock versus cash and so on. But, some crazy big numbers and I think numbers that they probably need to make for people to leave open ai, philanthropic and deep mind.

Because at the end of the day, from what I know of people, who go to these social companies, typically they're a bit more startup oriented. They're more interested in smaller companies. They're sometimes they have some background at Google, for instance. But even DeepMind is a bit different from Google itself, right? As being more of a research division up until recently.

So, in some sense, not too surprising, they had to go all out to convince people to flip from deep mind philanthropic and open AI to go to meta And next up actually, let's talk about philanthropic. We've got some news that they've lost a major, pool of talent or a couple major leadership, positions for talent. Uh, and those went to Hemisphere, the makers of Cursor.

So Cursor has hired Boris Cherney, who led develop development of Claude Code, and he'll be joining as Chief architect and head of engineering starting this week. And Kat w the product manager for Cloud code philanthropic is joining as the head of product at Sphe. for reference, if you dunno, cursor is the leading, or maybe not the leading. I'm not sure how the market stacks against VS Code, but at least among people who use AI heavily, cursor is seen as the tool for ai for coding.

the development environment that is leading the pack in terms of quality and have grown explosively by valuation is, crazy. Cloud code did disrupt that to some extent. I think certainly as someone who has used Cursor, once I've adopted cloud code, I find myself using it a lot less and even moving back to VS code. So kind of makes a lot of sense for 'em to make this aggressive move. The article also notes that this is happening as Philanthropics revenue is hitting 4 billion, annual, which is.

Quite a lot. Yes. You heard, you heard it here first guys last 4 billion is quite a lot. Sometimes I gotta put in that commentary, you know, that's really deep insightful stuff. You know, I feel like we've gotten so desensitized to hearing like numbers, like billions and gigawatts and all this stuff like it. It's, uh, it happened pretty fast, didn't it? Over the last few years. Yeah. And that's, that's revenue too, right? We're not talking about valuation. Of course.

The, the big question is, has been historically, can these companies live up to the hype? Can they actually generate revenue? Andros done really impressively on that front. I think their last fundraise had them at a $60 billion valuation. I believe they're currently fundraising now. So we'll see what, what, uh, you know, that valuation looks like. But they're, they're also for context, I vaguely remember they're like burning net, uh, something like three to 5 billion or so.

So, you know, there's still a lot of, a lot of burn going on for all of CapEx and, and, and all the talent as well. But on the talent side, yeah, this is interesting 'cause Anthropic has been by far and away the big winner. And we, we covered this, I think a few episodes back on the recruitment or talent wars, right? Like something like a, an eight to one ratio of people leaving open AI to anthropic versus people leaving anthropic for open ai.

So far, far more people moving towards anthropic than away from it. Similar, in fact, even more extreme numbers, relative to DeepMind. And so it's rare that you see these big poachings from Anthropic. And, and part of that, presumably too is that philanthropic does have a kind of it's clear what they stand for in terms of the, the safety side, the security side, and all that. And so people are not necessarily just working there for money.

At least that's that's what I've seen quite clearly talking to folks at all these different labs. But this is, this is interesting, right? So you've got cursor moving in on the, on this talent. Obviously all of Cursor's chips here are on the code generation stuff, so they can afford to kind of plow more money into these poachings.

And when we, when we talk about, you know, cursor hitting $500 million in a RR compared to Philanthropics 4 billion, that may sound smaller, but you know, it's over 10% of what Anthropics making. So these are like, you know, they're, they're in the kind of, maybe not quite the same orbit, but approaching that, that same orbit. So, in any case, we'll see what ends up happening here.

But, the cursor anthropic thing, you can sense the nerves, the anxiety about, you know, maybe upsetting anthropic that's coming from any Sphere, which is the, the company that owns Cursor. So one of their co-founders, Sule, Asif, I'm probably butchering that, did refer to Anthropic as quote one of our closest partners, in the context of this article. So you can kind of see it's like how nobody wants to piss off Jensen right now, or Nvidia you know, sort of similar thing.

There's a lot of dependency here. And these are, these are dicey moves. Yeah, exactly. if you use Cursor, the VIN tool, you know, it allows you to use any of LMS really, but from a Vinet, uh, you can choose which LM you want to use Gemini or, or cloud and so on. And, that I think is a big reason why you wouldn't wanna upset this relationship. Uh, it is pretty important for.

Any sphere and cursor to remain friendly with the providers of Frontier LMS at the same time it does feel like cloud code and, and generally the movement towards agentic, tooling that even goes beyond the traditional developing environment with Cloud Code being kinda a standalone thing outside of where you look and write code, which is what Cursor is. and has really kind of upped the amount to which, agents in agent AI can do a lot of coding without you manually supervising it.

So yeah, to me it's the surprising bit is not just that they hired these people, but they, gave them these roles of chief architect and head of engineering and head of product. I mean, that is big, right? Well, it again, you know, you got people who work for stuff that's not necessarily just money and if you're gonna convince them to move over Yeah. They're, they're gonna need a sense that they're able to shape the direction of the company. Mm-hmm.

So this is part of the, you know, part of the pay package in a sense. And next story also about philanthropic. Actually kind of not directly related to any business updates, but the related to economy, I guess in general, philanthropic slash the economic features program to research AI's impact on the labor market and global economy. They want to provide, yeah, basically evidence, based insights into the economic effects of ai.

And, this is coming after in recent months, philanthropic, CEO, Dario Ade has been talking about pretty dire consequences of ai saying that AI could eliminate half of all entry level white collar jobs in the next, uh, one to five years. And, and that would unemployment up to, 20%. So not as surprising. They are moving in this direction, I suppose. And, you know, we just talked about cloud code and, and cursor definitely gonna be a hit on the job market for that. I think it's already here.

In fact, if you're an entry level engineer, if you're in the programming, uh, business and more junior, it's probably getting a lot harder to get a job. I had a conversation with somebody pretty senior up at, at one of the Frontier Labs who just straight up said, our expectation is that we will never hire another developer with less than 10 years of experience again. that's pretty amazing, right? Like, and, and obviously that's the Frontier Lab.

They have access to the internal models that they build all the best stuff. But if that's not a canary in a coal mine, like, I don't know what is right? Like the, even if you froze AI capabilities today I think it's quite credible that, yeah, you would see that kind of effect at a minimum propagate, uh, throughout the economy. And that's, that's a big issue, right? Those entry level roles are. Or how people jump to obviously mid-levels of, of seniority.

But, so this is a, a really a really interesting play from Anthropic. It's, it's a kind of big package of things. So they're looking at, uh, putting together this economic futures program that includes economic futures research awards, which are like five, uh, sorry, $50,000 grants for empirical research on economic impacts from ai, which again, I think is this really important space. Like the empirical side. There's a lot of theory happening right now.

You know, you got Epic AI coming out with their plots, and then you've got, uh, AI 2027 coming out with their plots and people are comparing plots and, you know, its semi, it's semi evidence-based. But then also, you know, we need just more empirical research to ground these predictions and analysis in they're also setting up symposia so they can bring people together to talk about stuff and setting up, uh, strategic partnerships with research institutions.

So, finding all these ways to kind of, drum up more interest and, uh, resources pointed in this direction, which again, seems like, uh, something that's probably helpful if you believe anything like the 2027 timelines that, uh, a lot of people in the space do. And moving to some hardware stuff. First we have the story about OpenAI is saying it has no plan to use Google's in-house chip.

So apparently it has been recent reports saying that OpenAI had been considering using tpu, had been testing tpu, an alternative to GPUs with, uh, open air, having invested heavily in Nvidia GPUs, as have most other companies. open AI is also developing its own ship, to compete with VTPU. I guess this came out in the context of Open, having signed up for Google Cloud Services, in continued trend of splitting slightly from Microsoft.

We have been on Azure from early on as a primary source of compute. So yeah, significant, perhaps primarily because of, what we know about them trying to create an alternative to TPUs. And it's also, it's also interesting from a, a almost a historical standpoint, right? You've got two trends that are now intersecting. On one hand, we've got this trend of, OpenAI trying to shake itself loose from Microsoft more and more, right?

We've had a lot of stories that have covered this in different ways, but fundamentally, you know, Microsoft was supposed to be their cloud partner of choice now getting a, a bit of cold feet in terms of providing all the infrastructure support that is needed for things like Stargate. And so they've allowed OpenAI to work with Oracle, and allowed is the right word here, by the way. Microsoft does have right of first refusal on these, these, uh, uh, big kind of infrastructure buildup contracts.

And, and that's where, you know, Oracle and Cruso and all these guys are coming together to do Stargate and OpenAI, and is then looking for other partners outside Microsoft. At the same time, Google is now just starting to push out in the direction of third party partnerships, trying to find ways to make their TPUs available to companies that are not Google. That historically was not a thing, right? Google was a lot more protective about access to their TPUs.

And so, these two, two trends are kind of intersecting each other and, and causing this, what otherwise would've been an almost unthinkable partnership. Because Google so famously is the home of Google DeepMind, which for so long was the one big rival to open ai. And so I guess time heals all wounds, and at least they've been testing out these TPUs. It does seem that they're not gonna. Actually go for them ultimately, which itself is interesting.

You know, this is something that, the information, I won't say got it wrong, but certainly their initial headline earlier this week made it seem like OpenAI was actually gonna go forward with this. And there was some corrections, uh, issued in that article. So, yeah. Last quick note, I guess is that, you mentioned OpenAI is developing their own chip. They're ready to hit that tape out milestone this year.

So tape out is when you ship the finalized design for manufacturing, presumably in this case, the TSMC. They are partnering with Broadcom on this, but they basically have their chip design that will be finished this year. That's a big, big deal. And then we'll have to see how long it takes to ramp up production and all that. But that's a, a big thing and something you can, you can believe we'll be keeping an eye on too.

And next up, Nvidia is one of the investors in a new startup that has just emerged out of stealth. The startup is Emerald ai and their focus is on kinda a deeper connection of data centers into the energy grid. So they provide software that allows you to change, AI workloads at and between facilities and basically connected to the local power usage to.

Not put as much strain on the grid, which is increasingly the case at, I suppose, us and, some specific regions where the hyperscaler are building their data centers. Yeah. This is kind of an interesting play, right? When you have so much power that is being soaked up by data centers and that power is being used in very inconsistent ways, right? Like during a training run.

I mean, locally you have these giant boom bust cycles of power consumption with, you know, back propagation and all this shit. but over, over longer stretches of time, you also just have like, you know, sometimes a training run is, is going sometimes, you know, there's fine tuning. GPU utilization isn't always as high as it could be.

And so you've got high amounts of variation and what you might think of as a 50 megawatt or eventually a one gigawatt data center is not exactly always consuming 50 megawatts or one gigawatt of energy. And so the question then becomes, okay, well if data center number one is sort of in a slower period and data center number two is ramping up, can we arbitrage over that? Can we have some kind of orchestration function on the grid that allows us to, to sort of load balance in a more dynamic way?

And that's really what Emerald AI is doing. They think their modeling suggests that they could unlock up to a hundred gigawatts, of US data center energy supply, which would take loads off the, the grid and allow you essentially to build, to free up more gigawatts for fresh data center builds. That's super interesting.

A structural challenge, vulnerability they may have is if we just develop more efficient, high utilization algorithms and hardware stacks for these models such that you're, there's just like less value in arbitraging between them. But this is certainly a really interesting play. Like, haven't seen anything like this. It almost, it almost turns load balancing into a fresh source of significant power on the grid. It's the equivalent of building like huge amount of, of power, right?

A hundred gigawatts. I, if this works as, as promised. Interestingly, the founder and CEO is Varun Ra, who has a background in physics and has worked as a senior aide to Excl Envoy, John Carey. So yeah, some notable names amongst the backers of the startup, beyond Nvidia. They include some individuals like Google Chief Scientists, Jeff Dean, Fefe Ali, and, and some other ones. And one last story also dealing with hardware.

The story is that TSMC Arizona ships are reportedly being flown back to Taiwan for packaging. So packaging, as we've covered in our long ago hardware episode, which we refer to like every other, or every single episode, uh, packaging is an important step, in the production pipeline for chips.

And the fact that the SMC is flying chips made in Arizona back to the one basically means that, Taiwan is still, let's say you could say a bottleneck where you could say, you know, is crucial to the, chip, supply chain for AI in particular. Yeah, and this is actually something that we called out in our latest investigation, you know, the America's Super Intelligence project thing, where we looked into supply chain risks among other things.

You know, thinking about how, how China could undermine American Super Intelligence research, uh, or add risk to it. You know, we pointed out like. Hey, everybody's talking about packaging as if it's the solved problem. But if you actually look under the hood, there are structural reasons why you might think it could take a little bit longer than expected, to have that online and that creates issues just like this.

So essentially go back to our hardware episode to look at the details, but you've got two fundamental core kinds of chips on a Nvidia GPU. You've got the memory, the HBM, and then you've got the logic dye that actually does the computing, right? So the memory stores the data, the logic die does the computing, and there's this dance between them. They've gotta talk to each other crazy fast. 'cause the logic die has to fetch data from memory to do math on it and then send it back.

So to get them to talk to each other really fast, you have to set them both on substrate as part of this packaging process. The packaging process requires today this process called COOs. And, and you know, COOs L is kind of the latest, well, I guess COOs R is. But anyway, there's a bunch of COOs processes that can only be done really in Taiwan at scale. And so we've onshore a lot of the fab for the logic dies. What we haven't done though, is onshore the packaging.

And so we still have to ship them back to Taiwan. By the way, big winner from all this is Taiwan's. Ava Air, which is the demand for air logistic services, has massively risen recently, and this is why. So anyway, kind of, kind of interesting, there are, by the way, plans to increase and ramp up packaging quickly. including by the way, a $165 billion investment that TSMC announced for the us.

there were plans to create advanced packaging fabs that could do COOs here, but there's kind of been no progress yet. So until that gets solved, you know, it's all well and good to have great logic dyes being FD onshore. But you can't package 'em, you're not making chips.

Projects & Open Source

And moving on to projects and open source, as we mentioned at the beginning. First up, we have the, uh, announcement of Ernie 4.5, the release from Baidu of a whole bunch of models under an Apache 2.0 license, Apache 2.0, meaning you can use it for whatever, including commercial, uh, directions. And, uh, the largest of these, there 10 models here in total, including just text to text, VMs, image plus text to text, MOE models, dense models, it's, and so on. But the big one is.

A model with a total of 424 billion total parameters with up to 47 billion active parameters. And they release a bit of evaluation comparing primarily to other open source, LLMs without a reasoning focus. So they compare to deeps seek V three que free. and for those, models in the particular, they have better performance on most of the typical benchmarks.

And, yeah, this is somewhat significant because, you know, we have llama, we have deep seek V three, on the big model side of, of like actually quite big models with 400 plus parameters. Now there's another option for people who want to develop seemingly a quite good option to build on top of. Yeah, and this is actually, uh, kind of an interesting structural choice, right? So they have the full version. This is like Ernie 4.5, 300 billion parameter version has 47 billion active parameters.

And so it's half the size as deep seek V three, which is the base model for R one, of course. But uh, it's got more active parameters and so it's maybe got three times on a, you know. Total parameter basis, the number of active parameters as deeps seek V three. And in a smaller scale. So sort of interesting trade off there in terms of, you know, like, let's say, uh, flops versus versus memory.

Anyway, they, they do make a point of saying it outperforms deep seek V three, the full version on the vast majority benchmarks. 22 outta 28. I will say this is not super shocking, right? Deep. I mean, V three came out in late, late 2024. Yeah, that's right. So that what, six months ago? More than six months ago. And so when you have a, a model that comes out six months later, it's like, oh, okay, can BB three? Sure. I mean sure, fine, but, uh, it's not exactly a ringing endorsement.

It doesn't mean that much at this point. And this article also comments, I don't know, this is a terribly meaningful comment. There are no published comparisons with industry leaders like Deeps seeks R one open Eyes oh three or cloud four. Well, yeah, because these are age agentic models, right? So that's not apples to apples. It is fair to compare the base model earning 4.5 directly to deeps, seek V three.

What's gonna be interesting is when they come out with X one, so this is going to be the age agentic model that will be directly comparable to R one. It's not part of the current release yet. But we'll see, you know, we'll see what the performance is like there. Yeah, ultimately, think we gotta wait and see what the actual usage is. Obviously usage of, of deep seek R one is down significant, I think like something 30% or, or something, quarter over quarter, in the last couple of months.

So. You know, there is a bit of an effect here of the export controls, the, the lack of an R two release that's competitive with some of the, the other open source models that have come out. So we'll see, this may be part of the export control story as well. Right. And, it's also a bit of a shift for Baidu. to my knowledge, we haven't open sourced Ernie, for context. Baidu of course, is sort of comparable to Google within China. Gigantic super, major company in the cloud space.

Uh, and they do search and they've been pretty early to the LLM game, like this is Ernie 4.5. You know, they released Ernie quite a while ago, not too much after OpenAI Hatch, LGBT, this is their first at least major entry in the open source space.

In addition to the models, they released a technical report that was like 40 free pages, not counting the appendix, going into a lot of detail on the training, similar to what you saw of deep seek that went into all the nitty gritty of the infra and so on. and they also release tooling, so they release, uh, the training framework. they have earning kit, they have some other stuff as well. Fast deploy deployment toolkit. So a bit of a shift in terms of the dynamics and. Uh, yeah.

Over this year, I guess since V three and R one, we've seen more and more open source releases of big models coming out of China. They're kind of taking the lead on where you can squeeze out performance, as opposed to, let's say, laa. And speaking of that next story is about another Chinese giant open sourcing something. This time it's Tencent. another, you know, one of other gigantic companies that is really something that a lot of, maybe a majority of people use.

And they have introduced ION A 13 B. this is NMOE model. So this has 80 billion total parameters. Only 13 billion are active at a time. apparently it's been trained quite a lot. 20, 20 trillion token pre-training phase. And it, uh, notably has a lot of training for, uh, thinking. So they did post-training with reinforcement learning for tasks, specific rewards, and they support fast and slow thinking out of a box.

They say that they have a state of art agentic performance on benchmarks like math, cmaf and GPQA suppressing even bigger models, which as you said, I think. Would be very, believable where, you know, given that it's been quite a while since R one, there've been a lot of insights as to post training for reasoning. no reason really why we shouldn't be able to squeeze out comparable performance with smaller models. And that's seems to be the case here. Also released with a permissive license.

Yeah, it's, it's an interesting release. I mean, the, the benchmarks, as you say, do look good. All the standard caveats that you mentioned, the most recent model that they compare it to here is Quinn three. The, the full kind of, uh, 22 billion active parameter version and there compares favorably but not, it's not like it blows it outta the water. It's better in some areas, worse than others. It's more sort of the incremental movement that you would expect from the space.

I don't think that there's a, a huge, huge story here. Some of the standout features, well, the, one of the standout features is that the thing that's not a standout feature, which is that fundamentally this reflects yet another model build in the Chinese ecosystem that effectively mirrors the deep seek, curriculum or, you know, training approach, right? So you got like, yes, a pre-training phase.

You have long context adaptation and then, like fine tuning and then there's gonna be an RL stage that is as yet unreleased with fast and kneeling as well. So, fast and kneeling is this idea where you rapidly, towards the end of the training process, you decrease the learning rate, so the model gets updated less. Let's quickly, let's say, as it gets trained on tokens towards the end of the, the run.

And, and the idea here is just like, as the model gains, more experience learns more and more, like over time you should expect it to be making smaller and smaller adjustments, sort of honing in on, on what it needs to look like rather than continuing to kind of flop around. So they, they do that. the second standout feature is the fact that they use this thing called dual mode, chain of thought, right? So you got two different modes. One, and it's the same model, right?

So it's not like we're shipping. A query to one model to do fast thinking and another model to do slow thinking. it's like one model that can route to either sort of sub circuits within itself. So one is this like low latency, fast thinking mode for more routine queries. And then you've got this a slow thinking mode for multi-step reasoning. So you can control this with a tag system.

There's like this no think tag for, you know, faster inference, and then for more reflective reasoning, you pass it a think tag. So it's pretty straightforward to use. They also use reinforced learning, uh, reinforcement learning for task specific reward models, or sorry, with task specific reward models. So basically you have like reward models that are being used for the RL loop.

And those are designed kind of bess in a bespoke way for a bunch of different tasks, uh, that you're training the model towards. So nothing too shocking, either from the standpoint of the, the architecture or the performance or the optimization routine or the data. But overall it's another impressive player. You know, Tencent showing in, they actually can contribute, they can play in the, at least the open source game, uh, with other big companies.

Yeah, they did release a paper alongside this, that has at decent detail, particularly on the training with regards to this, Specialty, uh, I forget the term, but with regards to the training in particular, they, uh, seem to generalize a bit. So they focus on a whole range of tasks, not just math and coding, but also creative writing, knowledge based, qa, multi-term dialogues, and so on. And, and for each of those, they do supervised training and reinforcement learning.

Uh, so they might be trying to generalize to domains outside the typical reasoning, things. And the focus is, uh, more on the, efficiency side. So the abstract is kicking off at the very beginning. they presented as a model that optimizes the trade-off between computational efficiency and model performance. So the gist here is, four 13 billion activated parameters during inference, a huge amount of parameters. This seems to give you quite good outcomes.

Next we've got another, open source release. This time it's more of an RL trained coding agents coming from together, ai, they call it deep SWE, deep Software, engineer. And this is based on, on top of a Quinn 3 32 B language model. So the focus is on particularly the training of a model two b, a software engineering agent. And, uh, among, uh, open weight models, this is a leading one on benchmarks like SB bench verified.

I will say, I think we don't yet have benchmarks that really capture the types of things you can see cloud code doing in terms of very impressive, uh, tool calling, uh, code-based exploration. A lot of these benchmarks focus more on solving GitHub issues and things like that. Not quite the same, but uh, regardless. Yeah, clearly, making a push in a direction of having a cloud code, capable, uh, open model. Yeah, and this is, uh, noteworthy that it comes from together.

Ai, we've covered a lot of their stuff in the past, right? They have this, philosophy of trying to do like, kind of fully distributed open source AI training, aggregating, compute from around the world. There are a couple of different, different companies that are pushing in this direction as well, uh, philosophically. But pretty impressive results. And trained on 64 h 100 GPUs over 60 days. So, you know, not a huge, huge workload here, uh, but certainly non-trivial either.

It's sort of like meso. anyway, also a partnership between together a AI and agenta, which is this other company that's focuses more on frameworks for post-training, language model agents. They have this framework called RLLM, uh, that, uh, that was apparently used here, presumably somewhat experimentally. So, yeah. Pretty cool. continue to see more of these open source RL agents. N nothing like, nothing jaw dropping in terms of performance here. It, it looks like it's solid.

It, it's again, a solid incremental, boost to performance. And especially on suite bench, right? So this is, a benchmark that does matter a lot because it is, or I should say suite bench verified, right? This OpenAI kind of. Scrubbed version of the original SW bench benchmark. it mirrors pretty closely some standard software engineering practices.

So to the extent that you're seeing this model score close to 60% now, uh, as an open source model that, you know, that's a pretty interesting and pretty, pretty big deal. So, uh, yeah, I mean a pretty solid move by together ai they continue to impress. Yeah, and they also released a pretty nice report on the model, with a bunch of offers, and not quite as deep as these previous ones in terms of the details disclosed.

But, some pretty nice details in particular about the training recipe and the empirical results of training. So pretty useful for insights and, and research in the area. Next we've got, some papers talking into some slightly different directions. First, we have GLM 4.1 V thinking towards versatile multimodal reasoning with scalable rl. So this is getting into the vision language space we've seen. More research, uh, moving beyond just reasoning and text and, uh, towards reasoning with images.

And that is, the frontier of research. and so this is, uh, they have a nine B thinking model that, apparently is outperforming, uh, a bunch of alpha models, even surpassing really big ones like Quin 2.5, VL 72 B. So, pretty kind of, uh, meaningful results in the reasoning with visual, uh, input space. And they open source both for reasoning model and the base model. Yeah, there's, I think a lot that's interesting about this, this paper and this sort of philosophy, this approach.

So, it is still modular, so they have a, a vision encoder. The goal here is to combine or have a model that can look at images and texts at the same time and, and do reasoning on them. So yet they have a vision encoder that basically looks at it could take videos actually. So if you're familiar with convolutions, convolutional neural networks, typically those are two dimensional convolutions. So you kind of look at a, a patch of an image and you sort of, anyway, you do some math on it.

You apply like a filter to that patch and it gives you a smaller patch, and then you can apply another filter to that smaller patch and on and on and on. And that's kind of how convolutional nuts are made. Well, this is that, but instead of a two dimensional patch of image, it's a three dimensional patch of video. So in the same way you're gonna take where the third dimension is time, basically. you can do s strided, convolutions if. You know, strided Convolutions doesn't make sense to you.

It's fine, don't worry about, but fundamentally, you're looking at three dimensional patch across time now of, uh, of your image or therefore of your video. And from there they're able to encode you know, the image of the video in the model. And then they take that encoding and then they pass it through basically just a simple feedforward neural network, like a, an MLP to map it in, to get the dimensionality, to match the dimensionality of their language model.

So basically they're just like taking now the, the vision encoded yeah, latent representation. And they're mapping it to something that matches the dimensionality of the latent representation of the language model. This, uh, GLM based language model that they use, which is from, JPU ai. And then they're just gonna concatenate those two together. So you now have a unified image and language representation that then you can basically just do, do LLM work on in the usual way.

The language model itself though is really interesting. They use bi-directional attention, which, because we're doing, a speed run on this story. It's in, it's in the, uh, lightning round. I'm not gonna get into why that is. If you know bidirectional attention, then, Anyway, it, it just makes sense to do this when you're doing reasoning on images more than just going full auto aggressive.

And, and just like a decoder only model, Anyway, the, the, the reasoning behind that is, is sort of intuitive, but, I guess only if you have the intuition, it's intuitive. If, if it's intuitive, you know? Yeah. Like, if this is a main story we, we we're going Yeah. Yeah. Yeah. Lots, lots going on at open source this week, clearly. Uh, and as you said, this is from gpu, AI and, Chico University, they release Yeah. Uh, a pretty detailed report on this one as well.

Uh, 18 pages and they show how it can be used for things like long document understanding we agents video, understanding coding, uh, stuff like that. so seems to be a pretty strong entry in that space. And last up, just one more new model. This one coming out of the US for once, uh, actually from Apple in collaboration with the University of Hong Kong.

Not really the US even, you know, mostly, mostly the paper title is Diffuse Coder Understanding and Approving Masked Diffusion Models for Code Generation. So they trained a 7 billion parameter diffusion, LLM. quick recap. Most l LMS do other aggressive, inference. Basically, you train the model to predict, one token at a time or maybe a few tokens at a time, given your input. Diffusion is totally different. You're sort of like predicting everything all at once.

Typically what you do for images, but very rarely what you do for text. But, you can do it for text and there's increasingly research on that direction. This is, some of that research. They show that, you know, some insights into training effectively and, get pretty good results even compared to non diffusion models like Quinn 2.5, uh, coder. Yeah. And we've, we've skipped over, I think, so this is the third paper this week that has its own take on GRPO.

So this like group relative policy optimization, the reinforcement learning optimization routine that deep seek famously used and kind of made popular. Anyway, this is its own variant. I feel like that's maybe next episode we should carve out just like five minutes because the, the GRPO stuff I is really important. It's, it's something that you'll, you know, you'll wanna understand to, to get what the differences between these different papers and their approaches.

But yeah, yeah, one these days, uh, so many topics we could do deep dive on. You know, we should do a reasoning, Episode clearly. Probably. Yeah. What should be in that? Yeah. Yeah. Might we'll try. You know, we are always want to do more, but life is busy. Yeah.

Research & Advancements

And moving on, research and advancements and speaking of doing more so we've got this paper wider or deeper scaling. LLM inference, time compute with adaptive branching tree search. Okay, so Monte Carlo tree search is this well trodden path in reasoning where you have, you know, imagine that you start off with like, some attempted solution.

you now have the choice as to whether you want to like make, let's say you make a modification of the solution, you try to refine it, and now you have the choices to whether you continue to try to modify the thing you just modified or whether you go back to the original branch and try to spawn another variant of it, right? So do you explore more from that starting from that original node? Or do you go deeper and exploit more, go deeper down the path, right? That you had sort of started down.

And there are two extremes, right? Like there's some models that just continuously refine the same pro or not prompt the same output more and more and more. And in that sense, they risk getting stuck in a rut, right? They don't do much exploration. another extreme variant to this is to say, okay, let's just like. Sample a crap ton, like generate a bunch of different possible, uh, solutions starting from the same prompt, but never go deeper in, you know, and refine any of them.

And that's the other kind of more exploration heavy thing. Now, in practice, what you wanna do is balance those two, right? This is the classic reinforcement learning And so what this paper's going to do is ask the question, what if we could vary in a principled way, the number of different, kind of branches that we try versus the depth that we push at in each branch, right? What if we could trade off in a principled way, exploration versus exploitation in a tree search setting, right?

So that's what they're gonna do. Normally what you do in multicolor tree search is you'll like fix the number of child nodes with a fixed hyper parameter. So you'll, you'll do that, then maybe you go to the next level, or you pick one branch and expand. But what they wanna do here is kind of dynamically do it. And so.

It gets a bit involved, but the fundamentals are, they actually have an internal model that is being trained along with the main training loop to predict the value of creating a new node of creating, like doing more exploration in other words, versus doing more exploitation. And they have different models essentially that predict the value of each, for each part of the tree.

So they kind of decompose the tree and go like, okay, in this chunk of the tree, you know, what is the value of adding a new node versus the value of pushing forward in the existing nodes. And essentially this is the mechanism that they're gonna use. really interesting actually, the details, if you wanna look into how they implement it, it's quite interesting.

It's called, they, they use this technique called Thompson's, sampling, which is, sort of Bayesian friendly way of making this kind of principle trade off. And so, the results are really impressive. they end up basically like. outperforming on average all models in terms of, uh, the benchmarks that they look at. The best average ranks really across all benchmarks. They, yeah.

Anyway, they, they compare it to all the variants we talked about, just repeated sampling from the same node or just like doing sequential refinement the whole way through. Or a standard Monte Carlo tree search where you fix the number of, uh, branches at each layer, all that sort of thing. of the really amazing things is the arc a GI sort of experimentation that they do. one last detail I gotta give you.

So as they're doing this right, you, you imagine like you start off at a given node, you can ask yourself, should I should I try refining the solution or should I try going up a level and spawning a new alternative? Another question you could ask yourself if you're spawning a new alternative is, which model should I use to spawn this new alternative? And they're actually gonna include that. They're gonna allow the model.

To build another model to predict which model it should use to spawn those branches. And in that way, essentially integrate together. They're, they're creating like a complex of systems here or a complex of models integrating together many submodels where at each node you have the choice of explore. and then within explore, which model do I use to explore or exploit and refine, right? So that, that's kind of a very unusual sort of hybrid model approach.

On, uh, on Arc a GI two, they end up doing like really well, like scoring around 30% on, Pass at two 50. So 250 at bats here. With a, basically a number of LLM calls budget of 250 calls. They end up hitting like around 30%, which is pretty wild. That performance, by the way, is highest when you allow the model to use the most. Or the framework, I should say, to use the most models.

So when they combine O four mini Gemini 2.5 Pro and deep seek R one, the May 28th version they end up getting the best performance. As you start dropping models from that list, you start to see progressively worse performance. And so this really is a framework for getting many models to play nice together and to choose which model to use to expand nodes and also whether to expand nodes or iteratively refined. So it, it's a really interesting paper in a sense, a meta paper or a metamodel.

I just, just thought this is a, a fascinating read. Yeah. This is, uh, somewhat significant, practically speaking because this is one of the standard ways to structure reasoning, right? And, on the usage side, what this looks like is on their benchmarks, for instance, on coding and on RKGI is sort of quasi puzzles that measure intelligence. The meaning of tree search would be, you know, a given node is a solution and you're able to sample multiple solutions.

That's the breadth, the width, or you can iterate on a solution that's with depth and you, you should be getting some feedback such as test outcomes or, or scoring. And one of the limitations of searches you do need. A scoring function of some sort to base the, tree expansion on. So as you said, uh, another interesting bit here is kind of the adaptation of multicolor tree search to the context of LLMs.

It's not super kind of intuitive because you can, you know, within a node also generate more tokens potentially. and there's various nuances, quite a bit of detail in the implementation. But the gist is it allows you to, in a very principled, sort of classic way, Monte college research is, with default, right? It's, it's one of the big algorithms as is Thompson sampling very highly researched. So they, very much like adapted in a very clean way for reasoning and, and seemed to get good results.

Next we've got the automated LLM speed running benchmark, reproducing nano GPT improvements. So they want to evaluate an AI agent's ability to reproduce scientific results, and they focus on not OGBT speed run tasks. so they want to test the reproducibility of scientific results by ai. That's a motivation, right, to kind of get the results of a given, research paper with your own implementation using ai.

And this, uh, non GPT Speed Run task is designed to assess the efficiency and accuracy of AI models in repeating scientific improvements. So, now we have a benchmark for doing that. And as with any benchmark, it sets a goal to target and, at least for now, has room for improvement. Yeah, I like to think of this as a sort of companion to or compliment to the, uh, meter AI evals when they look at you know, what is the time horizon that AI agents can successfully automate on?

So, you know, tasks that take humans an hours tasks that take humans two hours it, they tend to look at AI research tasks for that because they're interested in the sort of recursive self-improvement loop where we get to the point where AI is just automating all of AI research. This is in that same spirit. And so nano GPT is essentially a, like a version or an an instantiation of, GPT two, an implementation I should say of GT two in PyTorch that Andre Carpathy put together a while ago.

And, so what's happened since GT two came out is we've had a series of breakthroughs, right? A series of innovations where we've iteratively improved the training time for GT two from 45 minutes to under three minutes. And there are 21 successive optimization records that have happened. So you think of it as like 21, steps down the ladder that people have taken.

And each of the tasks associated with this benchmark tasks, whatever agent you're testing with reproducing one of those 21 steps, right? So, starting with the original g PT two or at least its nano g PT version, modify that. In order to achieve the first speed boost that we got historically. And then you see, okay, well what fraction of that speed boost did the model, did the agent successfully recover? And so that enters your, sort of evaluation framework and then you, you try the next one.

Okay, so, so now that we're here, let's, let's reset, let's give the model the correct sort of version of GPT two that reflects that next step, that next breakthrough, and start from there. Now replicate the second breakthrough. See how, what fraction of that acceleration the model is able to recover. And then repeat all the way down the line 21 times and then add up together. Or one eval would be to add up together all the time savings that your agent was able to achieve.

Compare that to the, time savings of about 42 minutes that was actually achieved by, human researchers over time. And then use that essentially to compare different models. And so, yeah, this is really quite interesting. O three mini is the best performer. They, they looked at here. It recovered about 40% of the speed up with hints. So this is a variant that they have. They have versions with like three different layers of hints that you can give to the model. and then also.

Versions with no hints at all. And, you know, three different layers of hints. It's like one is a description of the pseudo code that would be required to reproduce it. the second layer is natural language explanations, and the third layer is a full mini paper format with all the technical details. So gradually kind of giving the model more and more. But there is a version where they just, zero shot it and you're not given any, any more context.

And so, CLO 3.7 sonnet also comparable to O three mini apparently in some conditions. but deep cq R one basically just like does really badly, sometimes does even worse with hints than without, which itself, it's kind of interesting. And then Gemini 2.5 Pro, basically just bombed it, got basically zero in their aggregate measures. And so all kinds of interesting observations about what works, what doesn't worth looking at the paper.

This is kind of more, I think of it as more encyclopedic knowledge that will go bad pretty quickly. but certainly the benchmark itself seems like a really important and interesting contribution that I would keep an eye out on. you know, again, think of it as a compliment to those, very famous meter evals. I think it's a great way of looking at that. Right. And also just a fun, I guess. Way to do this.

I did not know that there was a speed running challenge to train GPD two in the shortest amount of time. That's where Nano G PT comes from. 'cause by these days, GPD two is considered nano, uh, I believe GP two was what, like one ish billion, 2 billion parameters. So the, the version they do here is the, 124 million parameter version. Right. But yeah, tiny by today's standards. Right. And speaking of meta and their evaluation suite, that is our next story. We have just an update.

So they posted saying that, just to quickly recap, they released a paper several months ago now measuring AI ability to compete to complete long tasks. And they basically have a task suite where they roughly know how long a given task takes, could be five minutes, could be 10 minutes, could be an hour. And they measured the ability of various models to reliably complete those tasks with, for instance, like there's a 50% chance you are going to get this done in an hour or less.

So, since that release, there's been a couple months have passed and they published an update where. Claude four Opus now reaches, 50% time horizon of 80 minutes. So 50% chance it completes an 80 minute task, uh, in a time span. Sonnet reaches, uh, the sonnet four reaches the 65 minute, point. So they're now exceeding an hour, you know, slotting into the, trend, the kind of prediction, fit that came out with a paper. Yeah, this is really interesting because there's so few data points, right?

Like every frontier model basically is a data point. That's, that's where the bar is. And so figuring out exactly what the trend says is really hard and even small. If you look at the plots, even these like small adjustments, right? In the slope of that log plot are the difference between hitting, you know, uh, a SI or, or hitting, let's say, AI agents that can do month long tasks coherently in say, two years versus three or four years, right? So like, things can be very sensitive to that.

So these small little updates, every time you get a new a new model, you wanna fit it to the plot really quickly and be like, oh, how does that affect the, you know, the slope of the curve. Certainly when oh three came out, that was a big, big update. If you look at the, plot, you know, oh three seems to, in concert with other models like Sonet 3.7 and oh one, in Sonet 3.5 from back in October, really seems to suggest there's an even steeper, trend than is otherwise, indicated.

Claude four Opus does not, by the way, beat O three. It actually, so O three is above an hour and a half. Claude four Opus, uh, is, uh, few minutes shy of that, like a, I don't know what, an hour and 15 you said Andre? Something like that. Mm-hmm. it's notable, you know, it has been a little while and we're not, we're not exceeding now. I would say that's still within noise looking at the plots, but it could make you update a little bit.

This, this is the debate that's happening right now, right? People are trying to figure out what this means exactly and probably overthinking it. We're gonna have to wait until the next open AI or next philanthropic ag agent model drops. But definitely you know, this is something really, really to, to keep an eye on just because of the implications, right? If these, if these curves really do curve the way they seem to, then we could be in for a hell of a party over the next few years.

And how much of a party, well, that's contingent on a relatively small number of data points. Yeah. And, uh, you know, this is a tricky thing to evaluate obviously, because, you know, you have a set of tasks that they evaluate on and, you know, how do you really know how long it takes? But my two sense is, it seems pretty plausible from using cloud code like that. It can autonomously finish a one hour task. It's definitely getting there, in my opinion. And next story, a research paper.

And this one is titled Performance Prediction for Large Systems Via Text to Text Regression. The gist of it is the focus is on predicting the outcomes of some sort of configuration of a system. So for instance, you have a cluster, you have some way to configure a cluster, and you wanna be able to predict latency of that setup. tricky task, very useful task to do well on.

And they are training a model that does that for you based on, system logs and things like that go from previous data to a prediction of, your performance with a new setup and they get, as you might expect, really good, correlation and performance with this approach.

Yeah, I guess the core of this is a debate that's been happening for a long time, in especially language modeling, but elsewhere too, as to whether the, decoder only architecture is the right way to go, or whether you should use an encoder decoder structure, right?

So an encoder decoder is model that will start by, through many layers, specialize in just generating a really good encoding of an input, and then have separate layers that specialize in massaging, that encoding to turn it into a decoding in different clumps, essentially in different stages with an optimization routine that reflects that intent.

So the advantage people will claim for the decoder only version is you have essentially like an integrated thing, like one model that's able to to kind of address dependencies and interactions between the lowest layers and, and the highest layers without having to go through a, a bottleneck where you have a well-defined encoded latent representation. whereas the encoder decoder side would say, well, uh, it's good to have specialization so you can make a really good encoding.

And then separate that out from the, the decoder step, which is kind of this fundamentally different operation. And I mean, the answer is to which of these approaches is best? Does seem to be context dependent. Certainly this paper strengthens that argument. Uh, what they're gonna do here is have two encoder layers, again, that specialize in just taking in this semi-structured data about the state of a system.

As you said, they used Google's Borg compute, architecture as their testing ground for this. So they use system logs as inputs. They use all kind of like, the equivalent of like check engine lights and things like this that they feed in. And some of this data is in text form, by the way, but this is not a language model. It, it's not gonna learn to understand the semantics of the text.

It's only gonna learn to understand those indicators insofar as they are correlated to the one metric, the number that the model ultimately is going to predict. And so this is not an auto regressive model. it's taking these raw inputs and it's predicting a number, which is like kind of a measure of the efficiency of the overall system, the predictive efficiency. So you've got two encoder layers, two decoder layers for 60 million parameters in total. So relatively small model.

so this is a really effective way it seems of making these predictions on how the system is gonna, is gonna work. The Borg cluster scheduling system, this is the kind of source of data they use. It's like big, cluster or sort of, uh, orchestration system that Google uses.

Is, all the, the raw data that they're using to, to train this, they do it using cross entropy loss over response tokens that they get from the system that indicate how it's doing, that indicate like what the, status is of the overall system. And so, anyhow, it it's pretty interesting. It's, it is more of a, a niche. Sort of application, it's showing that just using a raw LLM trained from scratch on text data is not necessarily, the best play if you have a, a more structured problem.

Again, this is not perfectly structured. You still have language inputs, but those language inputs, again, are you, you think of them more as categorical variables and that's sort of the interpretational frame that they're applying here. Next up just a couple more papers. The next one is, does math reasoning improve general LLM capabilities? Understanding transferability of LLM reasoning, we've really gotta get going.

So just a very short gist of this paper, they are exploring if you trained specifically to do better on math, are you gonna be able to do better on reasoning in general outside of math, like, I don't know, uh, science problems, for instance, or coding. They find that depending on how you do it, uh, you might actually get negative transfer. So supervised training just doesn't work as well. compared to reinforcement learning.

Reinforcement learning kind of has a more subtle effect that doesn't mess up. your initial model as much and generally seems to result in better transfer. And last up, correlated errors in large language models and other. A kinda empirical analysis paper. They are looking at the correlation, uh, of the errors among different large language models using several, uh, data sets. And they look at 349 LLMs on 12,000 multiple choice questions. And so the question is like, I'm wrong, different lms.

How similar are they in terms of what they get wrong? They found that the correlation is pretty high. Uh, models agree on incorrect answers. About 60% of the time on the helm leaderboard. So much more likely than then, random chance. And, suppose not entirely surprising perhaps, but still interesting from, empirical analysis perspective. Yeah, absolutely.

I mean, this may, it remains true too regardless of architectures used and and model developers, which sort of leaves, I mean, and, and presumably optimization routines as well, so that, that basically leaves like. The data. Right? And, and that it kind of makes a lot of sense right there. There's only so much internet data and everybody will be using a highly overlapping data sets as part of their, their training. So in some sense, maybe not surprising in others.

I mean, one of the things you would expect is, like, I'd be interested in seeing the overlap of this with just general, like frequency of errors from these models. Because as the frequency of errors drops, the errors themselves get more and more rare and scarce. And so you're getting a more and more distilled picture of the, I mean, I don't want to, it's definitely not the irreducible entropy of the training, data, but it's, it's sort of gesturing in that direction.

So anyway, kind of, kind of interested in, in, uh, what that looks like if, if they're plotted together, but interesting paper. Yeah. Yeah. And they, they, uh, focus, in particular on this area of, uh, job applicant screening, as kind of a outcome of this analysis. And as you might expect, having lower correlation is better because if you have lower error correlation, it means you can look at several LMS and potentially avoid an error because one LM gets it wrong and everyone gets it right.

And the gist of it is you know, you need to sample quite a few LLMs to be able to get to lower error because of a decently high correlation among them.

Policy & Safety

And up next onto the policy and safety section, we're starting with forecasting biosecurity risks from LLMs. Okay. So there's been a lot of talk about whether LLMs actually do make it more likely that, bad actors are gonna be able to design or release more dangerous bio weapons. And, you know, famously there was this like study from a year ago that said, guys, don't worry. It's, or not, don't worry, but like, good news.

There's no meaningful uplift from, and I think at the time it was like G PT four or something, and then opening AI came out with something a few months or weeks later saying, actually we tried this, or something analogous to it. And we have access to the full unlocked version of gbd four. And we do get a quite significant detectable increase in the probability that people with a, a little bit of training or, or a significant amount of training are able to access, dangerous bio weapons.

And so, this is another take on this. and we've seen, by the way, other benchmarks too, in system cards from anthropic, from opening AI since that have meaningfully increased that even further and quite significantly. So this is another take on it. Instead of throwing these, throwing these models directly at tasks that people think will be correlated with high bio weapon risk. What they're doing is they're turning to a bunch of experts biosecurity and biology.

So 46 domain experts in biosecurity and biology and 22 expert forecasters, right? The so-called super forecasters. And what they look at is, okay, for all these folks, we want to get you to predict, the probability of a human caused epidemic causing over a hundred thousand deaths, by, I think it's 2028. Yeah. And so they, that's the, you know, the base question. A whole bunch of other questions, correlated with that or, or that, that follow from that. But that's kind of the base question.

That's the, the meat and potatoes. And then they divide the group up or, or the overall group up into, in different ways to look at how how accuracy, or how, sorry, how that predicted probability changes depending on who you ask, right?

So, for example, the overall, assessed probability was, uh, somewhere between it was around 1.5% with AI and 0.3% without, so they're predicting a very significant increase in the probability of a, you know, 100,000 deaths from, again, a human caused epidemic by 2028, which is quite significant. but it turns out that the people who most. Believe who, who assign the highest probability to this, are also the people who are more accurate on predicting the progress of large language models.

They're also, the people who have the most experience in biosecurity. They're also the people who get, highest accuracy on, low probability questions that they, they ask otherwise in, in the survey. So. That's kind of bad news, right? The people who are the best at forecasting this sort of stuff tend to assign the highest probability to this ultimately, you know, leveling out at, around, you know, one to 3%, something like that. But, but still significant.

And so, this is something that does suggest, hey, there is meaningful uplift, at least according to these, these forecasters. So take it with a grain of salt. The last thing I'll mention is they do say that, uh, mitigation measures are probably gonna be enough to buy down this risk.

When they were asked about mitigation measures, like including mandatory screening of synthetic nucleic acid orders, and just basic AI model safeguards, they reduce their risk forecast back to close to baseline level. So they basically figured if you do put in the right mitigation measures, you should be able to essentially buy down all the risks that comes from this. I'm personally really skeptical about that.

I think people very much, sort of overestimate the effectiveness of a lot of these safeguards for reasons we could talk about. But yeah, anyway, I think a really interesting study, and again, a fundamentally different angle, right, from these more empirical studies that, that Rand has put out, that OpenAI has put out, that Anthropics put out that are in their own right, very useful. And Rand gets credit for, for kicking off that trend. so many, months, I, I wanna say over a year ago now.

Yeah, quite an interesting read. And they do go into quite a bit of detail. So they start out with this unconditional question of what's the probability of a hundred thousand deaths due to a pathogen in 2028? And then, then they then condition on various hypothetical advancements in n LMS to see what the changes. So, they begin with this 0.3% baseline that rises to 1.5 conditional on the several hypothetical animal capabilities. And then musically, they then ve check actually they have.

And the, uh, people responding to the survey thought that would not happen until 2030. So, that is an interesting data point saying that maybe the forecasters are underestimating the, uh, uh, degree to which LLMs are moving and, and are able to achieve these advancements. Which could color your prediction or at least would mean that this 1.5% probability, is their actual prediction given the state of lms. And obviously, you know, people listen to podcasts. You're aware of my bias on this.

Like I, I do think that AI is moving a lot faster than most people realize or want to admit to themselves. And one tell is things like this, this happens over and over and over again. You'll have people say like, you know, oh, we're, we're not gonna have this. When you actually get people to give you dates by which they think certain capabilities will emerge, they tend to just like hilariously predict like, oh, it, it'll be another 10 years, another five years.

And the usual case is like, the thing gets done in like a month or two. But in this case, this is such a great example of it because it literally had already happened. It's so hard to keep up with the space in fairness, like it does move that fast and, and we ourselves are surprised by things all the time. But that's kind of part of the problem, right? That you need to have some amount of like, yeah, I guess, uh, epistemic uncertainty, when it comes to this stuff and, and factor it in.

If you find you continually get surprised by how fast things are moving, then you know, maybe that implies you should just change your world model. And, anyway. I think a lot of people are banging that drum these days. Yeah. And, and the specific hypotheticals here are quite specific. so we're talking about AI enabling 10% of non-experts to, synthesize the DNA fragment. I think of some influenza from 1918 in a laboratory.

They are looking at the virology capabilities test, which came out just a couple months ago. So these are, you know, not sort of just like, oh, you do this well at, some coding benchmark, very specific to wet lab work, virology work, things like that. which is obviously quite relevant and I think does lend this, more credibility as an analysis. Speaking of predictions, next we have AI task length horizons in offensive cybersecurity.

So this is, an adaptation really of the methodology of a matter we just discussed about predicting the length of time, tasks that alarms can do. this is less formal just FYI, it's, it's a blog post by just one person of a kind of estimated length of tasks for various, benchmarks by the, sorry, this, this is less formal. In a space where it's like, now the, the formal version, of course, is a pre-print slapped together and thrown on the archive without peer review. Right. Just formal.

It's just funny that that's like Yeah, yeah, yeah. I, I'm just, uh, outlining it because the blog post itself itself Totally, totally. I had this double take where I agreed with you, but then I was like, wait a minute. Like, what's the bar? Yeah. It's not like these things are being released in journals. But anyways, in this slightly more informal, analysis, they have tasks ranging from 0.5 seconds to 25 hours in human estimated times. And they are seeing that, still pretty early.

the current models can solve six minute tasks with 50% success rates, but, as with matter, you know, you can do a little analysis showing that these models are likely to double every six months or, or sorry, four months or so this is exactly the debate Yeah. That we were talking about earlier. Right. How do you fit that curve? Yep. Some ways of fitting it. You get four months. Some ways you get six, some ways you get seven. It's, uh, yeah, it's pretty unclear. Yep. So there you go.

And more empirical analysis, and obviously related to safety in a sense that cybersecurity is a huge challenge and LLMs are kind of an obvious fit for hacking, for things like that, where unlike virology, for instance, where you need a wet lab and you need to like work with a human here, you could very easily see an agent going off and doing some hackery. Yeah. And, uh, with the launch of AI coding as well, it's gotta be a lot of cybersecurity stuff going on in the next few years. It's fine.

Everything's fine guys. Uh, yeah. Yeah. This benchmark, by the way, is, in my opinion, extraordinarily, badly needed. Any threat model that you have that runs through, you know, ai, self replication, loss of control, weaponization, right? The cyber use case is. Arguably, and, and you would have to argue this, but arguably, is the most sort of real and present that you might expect impacts from in, in the near term.

And so you should be very frigging interested in measuring how, successful these models are at long horizon tasks that look like capture the flag challenges, uh, that look like malware generation challenges. You know, natural language to bash translation, you know, that sort of thing. Uh, so they look at five different buckets of tasks that have different time horizon characteristics, five different benchmarks, I really should say.

So there's like s bash bench, which is the shortest timeline tasks, which were actually created by the author of this paper, presumably because there just aren't tasks short enough that like GT two can do anything meaningful whatsoever. So that's like one to 32nd tasks. At least that's how, how he assesses and, and, you know, we could get into this and at some point we may, but.

Uh, the open question is always, how do you assess the amount of time that it takes for humans to complete these tasks? That's itself a very interesting question, especially as you get into very, very short and very, very long tasks. But anyway, NL two bash, enter code, capture the flag. NYU capture the flag side bench side bench, by the way. Interesting because the task length there range from two minutes to 25 hours.

So you're really covering quite a wide, kind of temporal range from models from 2019 to mid 2025. And it's all the usual curve fitting stuff. Check out our first, podcast on the meter evals where we did a deep dive into their methodology. Uh, that'll give you a good sense for like how this is being assessed here. It looks like a five month doubling time here, so, it was six minutes today, but it's doubling every five months.

so the five month doubling time suggests that you would reach a week long task within about five years. There's a lot of caveats here. I think one of the really interesting things to note though, is that the time horizon here is so much shorter than the time horizon we see with the meter evals, right? The meter evals are showing us hour and a half for oh three as we just talked, about. And yet here we're talking about like six minutes. So yeah. What's the delta there?

Well, you know, part of it is the labs are directly optimizing for recursive self-improvement, right? This is not even a secret. This is just straight up what they will tell you in their blog posts, which itself is a super dangerous thing to do and shouldn't be done. But that's part of the roadmap. so there, you know, there's optimization pressure pointed directly along that axis here.

You can think of the cyber capabilities, the offensive cyber capabilities, at least for these models that are kind of public and publicly used, as being a side effect of optimization against the kind of core a GI benchmarks. And so that's one reason why you're not seeing necessarily the same impressive uplift here. But you're still seeing the doubling time. That's quite interesting, right?

It suggests some robustness to the broader trend of exponential coherence length increases in, in these, uh, AI agents. Yeah, I think as a matter, like similar caveats, even more so here in terms of like the themselves of, of time to complete by a human. be reliable. And here, there's only one person able to do that. The task distribution is also more heavily, leaning towards the short side. So, uh, yeah, quite like a large majority of the tasks are between one second and 10 minutes.

And, you do get a decent amount going up to one hour, but then outside of like two hours, you get very few tasks. Yeah. So, yeah, very impressive effort by a single individual. But, do take it with a grain of salt, regardless, clearly the case that LMS are able to do a bunch of cybersecurity stuff. All righty. Just a couple more things. Moving on to the policy side.

and we start with, the US where we've been dealing with the saga of the one big beautiful bill, which just passed, uh, yesterday through the house and is going to, president Trump's desk to sign. The one big beautiful bill is primarily about the budget, about various tax cuts, for the rich and various cuts to services by the government.

But in it tucked away, there was, a section that we covered previously that would have banned regulation of AI by the states for 10 years, I believe, which, uh, became a bit controversial once it was highlighted, it was, then removed from the bill. The Senate voted nine nine to one to remove this proposed 10 year moratorium on state level AI regulations. So it's out.

and this article goes into the aggressive lobbying for the moratorium led by A 16 Z and Meta and others, and, and, uh, it must be said seemingly open AI as well. this is quite interesting because the case has been made. So there, there's this, guy, Adam Terrier, I wanna say, who's kind of famous in DC for having come up with the idea of a 10 year state moratorium on on AI regulation.

His big claim has been to trot around this number where he says that there are over a thousand state level bills that are imposing regulations on ai and that this would create an untenable mishmash of like state level like regulation that you had to have to adhere to. And then of course there may be a federal. Package that comes through at some point, and this makes it impossible for small companies as Andreesen Horowitz calls them Little Tech, uh, to compete in that space.

There's just a little problem. That 1000 figure is, for all intents and purposes like embarrassingly made up, it basically seems to come from a search of a database of like state level regulation that just like uses ai. So a lot of these things are just using the term ai, defining it, or, uh, even in some cases finding ways to advocate for it, to get it to be used in education and things like that. And don't actually introduce any meaningful constraints on company's ability to use or develop ai.

And so it's sort of disingenuous, frankly, to talk about it as like there's a thousand things like this when you whittle it down. It seems like the estimates I've seen are around 40 of these things that actually are material. The majority of those will not actually pass either. So the, the number gets whittled down pretty fast from by about like two orders of magnitude, which I, I think is significant.

The other piece of this too is the argument was historically that like rather than having the states regulate this, we should regulate at the federal level, pass a law at the federal level, which sounds like a great idea until you realize that the federal government has been gridlocked on the issue of AI regulation legislation for. Forever, right? It is, I mean, it's been five years since GBT three. It's been three years since chat. GBT.

We're still in this endless cycle of having committee testimony and hearings and, and investigations and all this stuff. And it never really goes anywhere. And this is a recognized pattern in the Valley. People know, lobbyists know that there's this gridlock at the federal level.

And so by saying, Hey, let's, preempt any state level legislation for 10 years, by the way, which is like open AI internally believes, super intelligence gets hit within like, you know, five years tops, more likely three years, something like that. So the idea that like at the state level, there's no regulation. 10 years seems pretty insane. And then there's a basic question or fact of the matter that states are different, right?

I mean, California, which has OpenAI in their borders, which has, you know, a lot of big labs in their borders versus Idaho, which, you know, or Virginia, which has a bunch of data centers, but no Frontier Labs. Obviously these states are gonna have fundamental differences in the way that they need to regulate and legislate ai. So it actually does make sense that you should have some freedom.

like I'm old enough to remember when states rights was, was a thing, among kind of more libertarian leaning people, such as actually, you know, myself in this space. So, yeah, it, it kind of like, seems like a weird play to try to strip away state's rights, fortunately and. And I think this is a, a reflection of just, good sort of education in the Senate on this issue. This was thrown out, very recently.

O overwhelmingly voted against this provision on a 99 to one margin to rip out this, state level preemption. So I. That's a pretty remarkable defeat. By the end of it, I, I think Ted Cruz was sort of like championing this thing forward using the, China scary line, which I actually take as well. Right. You know, if you're tracking the work that we've done, uh, nobody is more on the China scary camp than we are.

but there's a, a, a kind of fundamental misunderstanding here of the role that state level AI legislation can play in a context where there's nothing happening federally. Like this is just being real about it. That's the consequence. And you have to imagine that's exactly what the, you know, the companies that have been lobbying for this, especially entrees and Horowitz, were thinking, and now the backlash.

You know, the problem is when you do something like this, it is so obvious to people that what you are trying to do is like, lock in a competitive advantage for the, you know, the open eyes of the world. That yeah, now you're gonna get a backlash. What this looks like is exactly what it is.

It looks like Mark Andreessen stepping into the federal level, trying to backdoor his way into putting some pretty extreme legislation on the table that's reactionary in, in just the same way as a lot of like burdensome regulation that's been proposed at the federal level would be reactionary in the other direction. And so. I, I think that this could actually backfire in some pretty concerning ways.

You just gotta be more careful about this and, and especially like, you know, taking the, the temperature of the public on this. People are, are interested in regulating. And so this doesn't kind of match the public perception. So I'll get off my soapbox, but I, I just, I feel like this is a, a bit of a, an own goal for the people who are looking for this sort of thing. Like, federal nullification of state laws in this way without a federal framework in place is straightforwardly unprecedented.

Like this would never have been done. And so the bet here is literally that like we are so confident that we don't want any state level AI legislation over the next 10 years as super intelligence may come and go. We are so confident that we won't need that, that, we'll, we're gonna lock it in at the federal level. Like, that's, that's some balls, dude, that's some real, real balls. I, I wish that were the case.

I I think we need a little bit more flexibility on this and, and, you know, Ted Cruz is doing his best and everybody is, but I think ultimately it just reflects a misunderstanding fundamentally of, of the trajectory of the technology. Yeah. And worth noting, this is particularly important in the case of the US because there is no real federal regulation, uh, at the national level.

And there is not going to be any, uh, least until this president is out of office, just based on who he chose to be leading on the tech side. so effectively the states are the ones doing any sort of regulation. Uh, for instance, by the way, I, I'm a little skeptical. Like I think that the, the feds may well come in and regulate this, but the point is that they retain the option to, right? Mm-hmm. They're not like saying, Hey, we're not gonna put in any regulation for 10 years blanket statement.

Like, that's what's insane to me, right? Yeah. It's like, let's enshrine this in the law of the law. Like what? Like it just literally, like, let's take options off the table, right? Sorry. Yeah. It just seems so, it's also a bipartisan thing, like Marjorie Taylor Greene famously came out and said, Hey, like. I, I voted for this bill before I realized that this fucking crazy thing was in there. Now that I see it, I'm like, holy shit, I never would've, here's their quote, right?

I'm not voting for the development of Skynet and the rise of the machines by destroying federalism for 10 years by taking away states rights to regulate and make laws on all ai. Like that's, this is a bipartisan issue. That's why it's 99 to one in the Senate. This is an insane, like it to, to think you can backroom your way. I think this turns a lot of people off the, the kind of, uh, Washington and lobby, treadmill here.

it just kind of seems to be exposed for an, an attempted, I won't just call it like a fraud on the American people here, but this is like an undemocratic play that was attempted. and that's not good, you know, I mean, anyway. Yeah. Yeah. So, uh, also just. Kind of weird that they try to sneak us in in the budget reconciliation. Uh, there's like nuance where the states would lose some federal grants if they go against it.

Also notable because in hindsight or looking back so far, the biggest effort to regulate AI was last year with SB 10 47 in California, as we've covered, that was defeated in significant part due to large lobbying by tech. It went to, as far as the governor, the, the governor vetoed it. So this would basically have prevented that. Right? And, and so. Likely it's gonna happen again in California. There are efforts to do a revamp 10 47, and, and quite significant in that context as well.

And, and, and by the way, just like to, to make the point, if anybody is clutching their partisan pearls on this, right? This, again, not a partisan thing. SB 10 47 obviously passed the highly democratic, legislature in California, but it was vetoed by Gavin Newsom, right? Like the, the most liberal governor, basically in the entire country under pressure from Nancy Pelosi, among other people at the federal level writing in and, and saying it should be scrapped.

So it like, this is a really fascinating issue in that it just crosses nukes, all partisan lines. I love issues like that, by the way, because they prevent us from just seeing things through this lens that we, you know, we want to see it through, you know, Republican versus Democrat. The reality is that's not what's happening. but figuring out what the right play is, I, anyway, I genuinely feel bad for anybody who's in the, in the hot seat of having to make the calls on this.

They're hard to make, but surely preserving optionality ought to be part of our, our basket here. Yeah, we'll say, I dunno if Gavin Yusef is the most liberal, but in a very democratic state, to be fair. anyways, moving on. One last story. it's about Denmark. They are going to tackle DeepFakes by giving people copyrights to their own features. So we're gonna be amending copyright law. To give individuals rights over their own body, facial features and voice.

Uh, it's one of the first, initiatives of its kind in Europe has brought support and it is gonna take a little while. It's still being submitted for consultation and will be formally submitted in the autumn. kind relates to some efforts, uh, elsewhere. Certainly in in Hollywood there's been negotiations, but to my knowledge, not too much on the law side as far as copyright over your own appearance. Yeah. How, how you actually define the bounds on that too is gonna be really interesting.

Right. That's always a AI has a way of like fuzzing the boundaries around everything. And so, you know, how much can you modify a face until it's not your face anymore? All this stuff. But yeah. Really interesting. 'cause we, we accept makeup, right? We accept hairstyle differences, all this stuff. So at what point is AI an adornment versus a fundamental change of appearance? Uh, anyway, uh, some of the interesting philosophical questions we'll have to deal with.

And with that, we are finished with this episode as promised. Kind of a long run, we lots discussed. So hope you enjoyed that. Thank you as always for listening. In particular, if you make it all the way to the end, we also appreciate it if you share the podcast, if you review it, and so on. But more than anything, do keep doing it.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast