#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued - podcast episode cover

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued

Jun 17, 20251 hr 46 minEp. 252
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Our 212th episode with a summary and discussion of last week's big AI news! Recorded on 06/23/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

In this episode:

  • OpenAI introduces O3 PRO for ChatGPT, highlighting significant improvements in performance and cost-efficiency.
  • Anthropic sees an influx of talent from OpenAI and DeepMind, with significantly higher retention rates and competitive advantages in AI capabilities.
  • New research indicates that reinforcing negative responses in LLMs significantly improves performance across all metrics, highlighting novel approaches in reinforcement learning.
  • A security flaw in Microsoft Copilot demonstrates the growing risk of AI agents being hacked, emphasizing the need for robust protection against zero-click attacks.

Timestamps + Links:

Transcript

Intro / Banter

Hello and welcome to the last week in AI podcast, or sometimes the last two weeks in AI podcast where you can hear us chat about what's going on with ai. And as usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can go to the episode description for the timestamps and links for all those stories. I'm one of your regular hosts, Andre ov. I studied AI in grad school and now work at a generative AI startup.

Hey guys, I'm your other host, Jeremy Harris. Gladstone ai, ai, national security stuff blah, blah, blah, blah, blah. And yeah, we have a, a lot to get through this week because it's actually this past two weeks. This is one of those episodes where we missed one last week that was on me. and now we're gonna do some catch up and see. Yeah. Jeremy, you seem to need to travel a lot. I'm starting to feel like you might be a spy going to Washington and, and retrieving AI secrets or something.

No, I mean, look, every once in a while you may hear what sounds like a Russian accent. actually, it's, it's funny 'cause you're the one with a Russian background, Well, but this is how spies work, Andre. All right. They're, they seem like they could not be less Russian. and yet here we are, so. Yet I'm not a spy. But you just have travel to do, to talk to people about ai. Yes, exactly.

News Preview

Well, we will go pretty quick just to give a quick preview. No huge stories in the past couple weeks in tools and apps. There's just a variety of announcements of somewhat significant releases, a lot of one point ohs or new versions of things, A new O three PRO applications and business. Again, nothing huge, but some interesting developments on the chip side, on the OpenAI side then projects in open source research kind of again, a variety of stories. No particular focus in this episode.

Policy and safety. We're gonna be talking about, kind of a, a bit of interoperability and safety more so, and a couple of national security stories. And they'll actually have a synthetic media and art section, which we haven't in a while, just because it's always at end. But there's some some new copyright lawsuits and some new partnerships that are interesting. So we'll go ahead and add that on to cover that sag back in the news too. It's been a while since we've seen them.

Yeah. Yeah. We used to, you know, last year there was quite a bit of it and we sort of just stopped. And now is a good time to mention some of that ongoing news.

Response to Listener Reviews

Before we dive in, do wanna acknowledge some Apple Podcast reviews? We appreciate your comments. Had a review to tell us to keep it up, please. Which I feel like we've been told this several times. So the encouragement is, appreciated, let's say. Yeah. And we will try to keep it up and make it as weekly as we can. Another positive review. Love the show. CapEx, CapEx, CapEx. Well glad some people were on board.

And we did have a pretty detailed bit of feedback with a free year listener talking about us, maybe alternating introductions. More me taking the lead less always talking about the next story and setting it up. We just sort of wound up in there. We didn't plan on this being the natural flow of a show.

So you might, I feel like we emerged organically, like, it's funny 'cause so I have the unfair advantage that while you're going through the, the kind of layout of the story, I get to think a little bit more about, yeah. Look at my notes, be like, Hey, you know, oh yeah. There's this thing. Because as you can imagine, we're covering, I mean, this week will be like, like 40 stories or something every week. It's like, I. We're having to do research.

We have reams of notes on, on every single paper, every single news story. And so, I dunno about you, Andre, when we switch stories, I'm like in a scramble trying to figure out, what did I even think of this? Oh yeah, this is that paper. Okay. And so while you're kind of gracefully going through your intro Yeah. And the secret is I'm actually just better at sounding prepared when I'm reading from notes because you gotta load this into your ram, you know? Yeah, yeah.

You gotta change context and I happen to be all right. I hope at pretending like I have an actual script instead of just rambling off based on it's a talent. Yeah. And I will say I think I am pretty good at segues, but anyways, we'll, we'll try out a bit more variation throughout. Andre's really good at segues. And with that,

Tools & Apps

and with that, let's get going on the actual news, starting with the tools and apps section. First step, we have OpenAI adding O three PRO to chat GPT, dropping the O three price by 80%, and also mentioning that they're gonna delay the open source AI model to later this summer. And that's pretty much the news. So O three is their reasoning model. Now we have O three pro, which is gonna be replacing O one Pro. It seems very good starting to be on par with oh one.

And the O three model is getting cut down by 80%, so that would mean $2 per million input tokens versus the previous eight. So huge price drop. I mean, this was to me quite surprising. And yeah, O three PRO is pneumatic expect, pretty nice performance on benchmarks, better than all the other offerings of theirs. So pretty big news.

So there's an opening I post about there's the model release notes on O three Pro with some initial evals, right, to giving you a sense of like, how does it stack up compared to both humans. And then compared to O one and O three medium against humans, it's really impressive. Worth looking at the chart across everything. Basically. You, you see a clean sweep where the model 64% of the time is preferred to humans.

that includes by the, by the way, personal writing and computer programming and data analysis. So really kind of spanning everything from, things where you have a quantifiable reward that you can issue and things that are more qualitative. You're seeing superior performance across the board. And then some of the areas where we're seeing really significant improvements in benchmark scores. Amy Amy 2024 going from 90 to 93% between O three medium and O three Pro. That may not sound like a lot.

It may sound like 3%, but one way to think about it is once you're already at 90% there's not that many percentage points left to climb, right? So you would expect, like saturating a benchmark is really hard. They just took a third of the remaining errors off the table with that is kind of similar with GPQA diamond that sort of PhD level science questions and code forces competition code. So a, across the board, again, this like universal improvement in these capabilities.

One thing that I hadn't noticed to my embarrassment there's a benchmark that they run. They call the four outta four liability evaluation. I just wanna surface this because like, it makes all the sense, and of course they're doing this, but I guess I, I hadn't yet explicitly remembered seeing this in writing. in this eval, you consider a model successful only if it correctly answers a question in all four attempts. So you try it four times on the same question.

And this is sort of a, you can see it becoming more important, this kind of valuation rate when we get into agents that are being deployed in higher stake scenarios. You wanna make sure that the agent consistently performs well, so that even if you, if you test it and you know, you get lucky or something, you don't overestimate its performance. And so anyway, I, I thought that was again, one of these oddly simple things but that I hadn't seen done elsewhere, remembered done elsewhere. Exactly.

Usually you get pass at one or pass at five, basically. Do you nail it first try or, or do you nail it after a few tries? And they do give those numbers, but they also give a four or four reliability evaluation, which as you said, I don't think is typically what you see in benchmark numbers. And compared to the past at one result. That is a less nice number. You get worse outcomes if you are telling it to be, you know, four out of four times get it right.

There is a performance drop and in fact, in some cases, like G-P-Q-A-A pretty significant performance drop, but still all three PRO is, is beating all of them and on the evaluations with human testers. So three PRO is preferred to O three, according to human testers on scientific analysis, personal writing, data analysis, as you said, about 64% of the time on average. So you know, O three is sometimes about as good. But more often than not, O three PRO is preferred.

Next up we have Cursor AI editor hits 1.0 Milestone, and there are some releases with it, including B, but, and background agents. So Cursor is the integrated development environment, the programming tool that has become one of the leading contenders for being what programmers use to include AI in their workflow. so 1.0 release, probably, you know, not being covered in major news outlets, but kind of a big deal.

And any spheres as we've covered now, has a ridiculous evaluation after really rising quickly last year. So with this 1.0 release, they release bug Bot, which is an matic reviewer of pool requests on GitHub. There's also this background. Agents in beta, which allows you to run these agents in a remote environment set up by cursor.

So it's getting into the agent territory where the AI agent does coding for you, does work for you totally a synchronously away from your prying eyes and then delivers something to you to evaluate. So cursor has had a Genta coding for a while and we've been pushing it.

This is another step in that direction and lines up with other efforts like Codex and JUULs from Google, where you do have these coding agents just work remotely and deliver results without direct supervision, which was the model for AI paired coding up to recently. I. I'm super curious about where this evolves from a, a security standpoint too.

Like for context, the way this is working right now is that the agent will actually fork your GitHub repository and have its like own branch that just like, it'll, it'll put out prs, it'll, it'll review prs and all that stuff, as you said, fully in parallel on its own branch. so they, they have some notes about the security side. They're like, Hey guys just keep in mind these agents have a much bigger surface area of attacks compared to existing cursor features that don't look like this.

And they, they do say our infrastructure has not yet been audited by third parties. You know, you have here agents who have read write privileges to repositories, right? So this is like. This is God mode for your AI agent that is writing code. So if somebody can do prompt injection data poisoning attacks or whatever on the agent, that could be a really big deal.

And if you're deploying this in like a production setting this is a, a really interesting you set of vulnerabilities that absolutely is gonna have to be addressed in the basic kind of design philosophy for these, these tools. By the way, we'll be talking about this later, but this on the same week that Microsoft has come out and announced a new vulnerability that was discovered in copilot, sort of in the same spirit with prompt injection type attacks. So like all of a sudden we're realizing.

You can't just deploy agents on all the things and assume that security is gonna look is gonna look the same. So anyway I I think that cursor is gonna be at the absolute forefront of this because the, these agents have such intimate access to the code base and are able to work autonomously and in parallel. So I think we'll learn a lot about best practices.

They're gonna have to evolve really quickly because, you know, I mean, there's a lot of cyber attacks and conventional software with this yeah. The, the sky's the limit. Yeah. And that's especially true if you're working open source with various contributors. Jailbreaks can be pretty subtle and can be quite weird, and agents are still kind of in development. So there could definitely be ways in which you can just tell it, delete all the code, you know, or something like that.

And onto lightning round with, have a couple of quick stories. First, you've got Mistral releasing a pair of AI reasoning models. So Mistral is the French AI lab, which has released a lot of open source models and has tried to compete with open ai, Andro and others with big lms.

So we've released magi trial, the reasoning model, two variants small with 24 billion parameters that is now available for people to download with an Apache 2.0 license, fully open source and Magi trial medium, which is available on their Letcha platform and on their API not as good as. Pretty much any of the leading reasoning models on evals. Partially because they're smaller compared to something like deep seek R one.

But yeah, general impression I get is people are not too impressed, but at the same time, it's nice to have another open source reasoning model for people to build on. Yeah, I continue to be sort of interested and confused about what the big picture game plan is for, for Misra other than to become the French champion that's subsidized by the French state to do French things.

But we'll see the business model of just like pumping out your models and, like as open source and then hosting them seems to be challenging for a lot of companies. We will see if that changes with rl. I, I'm sort of skeptical personally. But yeah, again, with these, these sorts of eval scores, it's really difficult to, to compete, like the frontier is moving so fast and the fact that they chose to release this model as well, you can, you can read a little bit into that.

You know, like Facebook decided, or sorry, meta decided not to release the kind of biggest version of the latest LAMA series because it apparently wasn't performing too well. That's the sort of thing that you do if you have a kind of met release. The fact that they did release this suggests maybe that they don't necessarily have a plan for blowing things outta the water anytime soon, so they might as well get the, the splash in in the meantime. That's one interpretation that you could have.

We'll note that the 24 billion parameter scale is very popular. You know, it's, it's like a good choice. I think that's something that meta has struggled with is they just keep pumping out these giant models that nobody. Nobody really wants to use. 24 billion, 32 billion, like these are really good sizes for the kind of hardware that people like to run open source models on. So yeah. That's great. We'll, we'll see where this goes.

There's certainly, they certainly are the French national champion and it's gonna be worth something, but yeah, they're in a challenging spot. They're a challenging spot trying to compete on just head-to-head training of frontier models. And they seem to really be keen on, you know, really competing on every front of OpenAI and philanthropic. Last week we also released Mistral code competing with something like CLO code.

So basically on any given thing people are doing, at least on the L Lamb side, not necessarily multimodal side, Misra is trying to compete and you know, let's not count 'em out, but they certainly have a TAF task to be able to do that. Next up 11 labs, the provider of text to speech and text to audio models has released their V three model, 11 V three, which is the latest in their text to speech models. It is able to do even more natural sounding outputs.

You can even embed things like size or excited to have more expressive cues with nuanced delivery. And this supports over 70 languages. So yeah, text to speech I think probably less visible to a lot of people than LLMs and image generation and video generation and so on. But it has really come a long way, and I think it's at a point where it will be very hard to tell if something is AI generated or not.

Yeah. And one of the things that's really interesting, it sort of reminds me on the agentic side of anthropics, MCP, like the model context protocol or any of these like hooks that people are, are learning about the structure of a given modality, we're learning here about, okay, what's the user friendly way to allow developers to program text to speech, right? So you indicated one of the, the upgrades here, right? So you had these special size or excited tags.

The example, or one of the examples they give here is we did it exclamation point and then in square brackets happily, and then in square bracket shouts, and then in square brackets laughs right? And, and this is the sort of, sort of affordance that you need as a, as a developer. It seems obvious in retrospect, but somebody had to think of it and implement it, so that's really cool. Sort of similar another similar thing is this idea of multi-speaker dialogues with realistic conversational flow.

So one of the challenges when you're making text to speech is like, how do you know, or how do you define the turns of each speaker? Make sure they don't talk over each other or make sure they do talk over each other if that's what you want.

And so they have a new text to dialogue, API, where you send structure JSON that defines when each user gets their turn, and then the model automatically takes care of, you know, the, the kind of emotional shifts, the interruptions, the, the, the natural flow of, of that conversation through that lens.

So again, it's one of those things where, you know, you, you sort of don't realize you need it until you start to, to, you know, produce stuff with, text to speech and especially on the entertainment side or trying to make real kind of natural conversational flow. So. Really cool. And, and as you said, whole bunch of language is supported, so yeah, I mean, 11 Labs are still doing impressive things. Yeah, 11 labs market leader in this territory, so definitely worth knowing about.

Next got text video Byte Dense is getting into the competition with C Dance 1.0. So it's their latest video generation model. It's trying to compete with VO three, the really pretty viral video generation model from Google. This one is able to generate five seconds of HD video in about 41 seconds, so it's pretty fast to actually do Generation and Ance is apparently planning to integrate c Dance into their platforms like Doba for both professional and public use.

One of the big advantages that they have, of course, being the TikTok parent company, is access to tons and tons of video data. I guess this is, you know, makes you wonder a little bit about. I mean, a, they're gonna be pilfering YouTube videos left, right, and center as well. It's not like that'll stop them.

Especially being a Chinese company, not that that's stopped open AI in the past, if you can remember, like Mira tis sort of famous presentation snafu when somebody asked her, like, for, I think, I think it was for soa, right? Where did, where did you get that data? Did you get it from like YouTube? Like, and she's like I, I forget what she said, but she looked very uncomfortable and, and it's pretty clear. Some, some stuff or to many people, it's pretty clear that some stuff went down.

but certainly TikTok has access front row seat access too. An exquisite quantity of data. One of the interesting things they call out is that they can handle. Complex sequences with multiple camera angles and maintain character consistency throughout. This is, you know, part of that whole world model building thread that people have talked about quite a bit. You know, our text to video, our image to video models, world models. Do they contain world models?

One of the big questions, of course, is always, well, if they contain world models, they should be able to model real world physics. That includes things like object permanence it includes things like object consistency. And so this is sort of hinting at that, though we don't know much about the, the architecture itself.

And so, you know, maybe some of this is kind of baked in with, with inductive priors and it's not actually sort of learned per se, difficult to know but certainly impressive and, and the world of convincing AI generated video, I think it's fair to say is, is just basically here at this point. Right. And unlike video free, it is not able to also generate audio that's pretty much only video free.

So Google impressively kind of took the lead on the text to video world and yeah, I think it's, it's good to call out, but most likely it's because they have YouTube and they just can train on all YouTube and nobody else can, and by dance might be able to compete for that reason. Well, and the, the audio too is no small thing, right?

Like we're entering this world where we're getting positive transfer as these models are trained on more and more modalities and video and audio are so causally intertwined, right? Like you imagine trying to make a world model literally, like if you're, deaf. Like you look at the world, you can create world models, but you can learn faster about the world if you also have the ability to hear. And especially for AI systems, just given that, you know, they, these are not trained with rl.

They can't go out into the world and interact with things. Having that extra modality to kind of cross correlate physics and you see what somebody's mouth opens and the sound tends to come out, it's like, okay, that tells you something about the kind of function of the mouth and the physics of it. You know, same with car crashes and the sounds that come from that.

So anyway I, I actually expect that the inclusion of audio in a single, almost monolithic base model, if you will is gonna be a really big deal for everything from prompt adherence to world model development. And speaking of VO free, Google also had an announcement. They are revealing a $20 AI PRO plan to let people use V Free more, and they are releasing VO free Fast, which is able to do faster generation compared to V three. V three is. Fairly slow to use.

It takes I forget exactly, but, you know, a couple minutes. So this allows you to take, let's say less than a minute. And now Gemini Pro subscribers can create up to free videos daily using VO free fast. And it's definitely seemed to be the case that the servers and GPUs from Google are pretty slammed by people trying to use vo. A lot of it wasn't working, so wouldn't be surprised if this was rushed into production to keep up with demand.

Yeah, and I, I mean, I continue to tap the sign that someday fairly soon we're gonna be able to generate one second of video for each second that you wait. In other words, you're gonna be able to generate video as fast as you can prompt it to be generated. I. Once we cross that threshold, there's gonna be excess compute on the generation side, which I would expect to start to get dedicated to addiction.

So, you know, imagine your, your TikTok feed, but if you've got biometric data coming in through, for example, the camera, or even just your interactions with the app that cause the video to be modified in real time based on what you're seeing, there's like a very dark rabbit hole for where this ends up going, ultimately with the abundance of compute. that threshold's gonna be very critical. I think almost from a, a societal level in terms of how we even think about these apps.

It's, it's not unlike. what the ability to generate fresh apps from scratch based on prompts is doing right, where apps themselves suddenly become this malleable thing. Well, this is sort of similar, but for manipulating pixels on a screen to kind of stimulate you. It's not clear what happens when the optimization process that's running in the back end of these systems operates as quickly as the human biophysical response cycle.

That's a, a, I think a very, very interesting phase that we're getting to. And we're gonna see a lot of interesting phenomena, psychological, and otherwise emerge from. Yeah, I think you could say this is similar to where agents were last year in the sense that we were talking about agents a whole lot going back definitely into 2024, but it took until really the last couple months for agents to really mature and, and make a huge impact.

Now with things like cursor code, I think video's in a similar spot where you're starting to see tools like flow, like a more easy to use pipeline to not just prompt it, but actually build something of it. And I think in the coming months you know, we will start seeing that actually not just be used for memes, but actually have an impact on workflows and so on.

Applications & Business

And moving on to applications in business. So we start with this really interesting story. OpenAI and DeepMind are losing engineers to anthropic in a one-sided talent war. So there is this venture capital firm called Signal Fire. They came out with their 2025 state of talent report. And they basically look at like, okay, what's the rate at which we're seeing employees leave open AI for anthropic versus the rate at which we see employees leaving Anthropic for open ai. Right?

So which direction is preferred? So when it comes to open AI and philanthropic open AI, employees are leaving eight times more for anthropic than vice versa. At DeepMind, the ratio is 11 to one in Anthropics favor. So for every Anthropic employee who leaves anthropic to go to DeepMind, 11 DeepMind employees are leaving DeepMind to go to Anthropic. That's pretty insane.

There's all this kind of interesting speculation by the way that, so Anthropics retention rate is like 80% for employees hired over the last two years, which in tech is pretty wild. Like I get in the kind of standard world. That doesn't sound too, too impressive. Like, oh, you're still in the same company you were two years ago, 80% of the time. That sounds about right. in ai that is fairly unusually high open AI's retention rate for two years.

By the way, 67% that's aligned with what you see at meta, for example. So there, there's all kinds of people kind of tossing around ideas about why this might be. One of the often incited hypotheses is like anthropic is just sort of coming out of nowhere. They've got the best coding models. That's just really exciting to work for them, blah, blah, blah.

I think that this actually misses the core point, which is Anthropic was a company founded on a very clear principle, and it has stood by, for the most part, those principles. You know, it's founded by these open AI policy and safety and, and some pre-training researchers who left essentially in protest. I mean, this is essentially a, an open secret now over open AI's sort of attitude and approach to alignment, technical safety and policy.

Open AI or Anthropic rather, seems to have walked the walk on a lot of their policy stuff. Pushing back on this pretty ridiculous idea of like banning all state level AI regulation for 10 years. That was snuck into. The, the latest big, beautiful bill. anyway, and, and, and OpenAI seems to have been pushing for something pretty aligned to that, at least in their, their policy work.

So a lot of this is like, you've got an entity where the leadership says something and then they actually kind of acted out. And there's also a lot of kind of open discourse, like when you talk to folks who work at Anthropic. I've never spoken to, I've spoken to a lot of people at OpenAI who I would call whistleblowers who are like, I'm really concerned that the leadership is talking through both sides of its mouth. I have never had a conversation that feels like that with an anthropic employee.

The OpenAI ones that we spoke to in our investigations in the past were often like. They're really tense. You could, you could sense that they, they did not want you to tell anybody that we'd spoken anything like that. Whereas in philanthropic it's kinda like, yeah, you know, I might have a disagreement with leadership, but you get the census is the sort of thing that they would have out anyway and have, have spoken to leadership about and sort of reasonable reasonable people can differ.

So I think that that's an underrated factor in all this is just the cultural difference. And I think that's leading the best researchers to flock to anthropic. And that in turn is the causal element behind, in part anthropics great success with its coding model. So I think, you know, it's not all that, but this is a kind of missing element in at least some of the analysis on this issue, just sort of from what I've seen.

Right. And I think, you know, to compliment that the dynamics of open AI andro competing are very different from dynamics of deep mind and philanthropic competing where deep mind, if you are preferring to go to philanthropic, it is likely because you don't like big company politics. Yeah. Yeah. And you don't like a lot of bureaucracy that has been introduced to review if you're allowed to publish your research or whether you're able to contribute to Gemini, for instance, development.

You know, not really a surprise. DMin has been around for a long time. It's now officially part of Google. There's been a bunch of reorgs and so on. It seemed to be really kind of in a bit of a a bad shape in terms of being organized. So in that sense, it's not crazily surprising. I think also DeepMind was quite big and Google has been quite big, so I wouldn't be surprised if philanthropic just had fewer people to lose, to be honest. Yeah, I, I think that's, that's a big factor.

And the other thing is, I mean, Google and Anthropic have a partnership, right? So you're not quite leaving the nest in the same way when you move from one to the other. Google's made massive investments in Anthropic, right? Along with Amazon. They're basically the, the two main backers. So, and, and certainly you know, Google TPUs are a huge part of Anthropics fleet and, and strategy. So I think that kind of makes a lot of sense.

given that anthropic is, you know, has butted off of open ai, it kind of, you know, anyway, it sort of feeds into that, that narrative of sort of open ai, disillusioned open AI folks leaving. the other thing, by the way, the money side is interesting, right? This, this article goes into some pretty wild, so they, they talk about OpenAI, some OpenAI researchers can earn more than $10 million a year. They're putting together counter offers to stop OpenAI employees from leaving.

For other companies like philanthropically, like safe, super intelligence, and these include $2 million retention bonuses. So just like a one-time bonus, $2 million, please don't leave. In addition to, this is insane equity increases of $20 million or more. Please don't leave me. Here's a crap ton of money. Like this is, this is a lot of money to be, to be throwing at people just as a Retention bonus basically. Yeah, it should've been nice to study LLMs when I was in grad school.

Also worth noting in this report, we won't go into it too deeply, but it does focus somewhat on entry level tech jobs in addition, and it's in a rough shape. Yeah, it's increasingly looking like, you know, CS in general has seen a huge rise in undergrad enrollment over the past decade. And for a while it was sort of the star path to a good job and, and good earnings.

Now as a fresh grad, it's much tougher to get hired than it used to be, and the number of positions seem to be smaller, and I would not be surprised if AI has a large role in that in addition to an economic conditions and so on. A hundred percent. I, I think we're in this interesting position where a lot of people, you can still tell yourself the story that, oh, it's, you know, it's because of tariffs, it's because of the economy, you know, things like this.

But I'll tell, I mean, I had a conversation with, a very senior person at, at one of the top labs and and, and what they were telling me was we are no longer hiring entry-level software engineers. We don't expect ever to do that again. And in, in fact, we don't think we'll be hiring anyone with less than 10 years of experience ever again. And. when you hear that, it just makes it real where it's like, ah, this is where it's coming from.

Like, and you know, this is a lab that already is seeing the majority of its code base written by ai, which, that's not surprising to us. This is something we've been covering for a long time, but I, I think it, you have to kind of sit back and, absorb that reality that the job of software engineers, the job even of AI researcher is getting more and more abstract and further away from anyway, many of the activities that used to define them. And, and that just makes it, I mean, it's brutal.

I like this is you're, you know, we're, we're headed for a situation where white collar. gets automated pretty hard, pretty fast. And there's social unrest that will come with that. I mean, there's no two ways about it. I, we've got a very interesting transition. We're gonna have to navigate gracefully. Yeah. And it, it is happening quite fast. So, you know, 20 23, 20 24, 20 22, to some extent, we saw a rise of intelligent AI assistance in things like copilot and cursor.

And that had a massive productivity boost. you're twice as productive, three times as productive with these agentic tools like cloud code, which are now working well, it's getting to a point where you barely need to touch code. As a software engineer, what you need to do is be able to tell the agent what to do and to inspect what it's doing. To verify that's correct. And that that's not what an entry level position kind of entails typically.

So it's changing fast and yeah, it's worth being aware of that. and moving right along. Another, I guess another OpenAI story, not that, the last one was all OpenAI OpenAI Slams Court order to save all chat GPT logs, including deleted chats.

So essentially what's happened is there was a court order that came in and said, look OpenAI is being accused of essentially serving as a platform that allows users to get around paywalls and access, you know, new news and, and like New York Times articles and things like that.

And what's more, we suspect that that users are gonna be deleting the evidence of that so that if we actually request the court requests records of people's use of the, the tool they're not gonna actually show these violations of, of copyright and, and all that stuff.

And so the New York Times argued for the court to prevent open ai essentially from from deleting or discarding information about chat GPT logs that otherwise would have been deleted, including records that users have, have tried to delete. Right. So, th this is, OpenAI is calling this out as basically a, a way of preventing OpenAI from respecting its users' privacy decisions. it's essentially puts opening AI in this awful position where they are at risk of breaching their own privacy agreements.

Which, you know, huge, huge trust issue, but also, I mean, it could put them in breach of, of contracts and global privacy regulations, all kinds of stuff. So this is really messy. You can actually, I mean, I can see opening eyes argument here that like, this is, to just lurch out and, and do this seems like a, a, a strange strategy. But, you know, I'm not a lawyer so hard to know. There's so little precedent in general on cases like this, but yeah. So the idea of chat j PT to skirt, paywalls.

It does sound plausible, I guess. But the question is how, you know, how do you actually manage that is the best way to force essentially a kind of defacto privacy violation onto OpenAI users? I don't know what the answer is, but this is the state of the debate anyway. Right. And OpenAI. Even released a blog post, how are we responding to the New York Times data demands in order to protect user privacy?

Where they frame it as a privacy question as kind of a commitment to their customers and address for instance where are business customers that use zero data retention APIs where the chat logs aren't gonna be kept. But open has had this interesting pattern of, of releasing blog posts in response to legal drama. And this one is very much along that line. Has a lot of notes in, in response to it. So open the eyes is a little salty and, and not a fan of this court order.

Clearly, I. next up in the lightning round, we are starting with a story from the information which typically has far more cutting edge or let's say, less public information. And this one is saying that NVIDIA's biggest Chinese rival Huawei struggles to win at home. So this is pretty much an analysis as to what extent Huawei is able to beat out Nvidia in terms of providing chips.

And it seems to be that so far Hui is unable to get to biggest tech companies in China to adopt their chips for AI training and inference. Yeah, this is actually a really interesting story because the story that the, like Nvidia of the world have been propagating that all, anyway, a a lot of kind of anti export control people have been propagating is that, hey, you know what, we withdraw from the Chinese market and like Huawei is just gonna dominate it.

And it just creates a whole bunch of economic wind in their, in their sails. And this is not entirely wrong, but there's an awful lot kind of missing in that analysis. So, one key thing to keep in mind. Huawei does not have access to the most exquisite fabrication processes that are available to Western companies thanks to TSMC, which is based in Taiwan of course.

So TSMC can help you fab down to three nanometers now, and we'll have chips that, you know, come off the production line using the three nanometer process in the relatively near term. Huawei can only use the domestic, the Chinese analog to TSMC, which is SMIC. SMIC is roughly speaking, stuck right now at seven nanometers, maybe arguably working on five. So it's, it's forced to use a subpar fabrication process. Huawei designs the chips, and then they send them to SMIC for fabrication.

The problem is you can only do so much when you have limitations, fundamental limitations on your design process. In particular, if you look at the Huawei chip series, what they will tend to do is they'll be very energy inefficient. If you want to get very energy efficient chips, you have to get more advanced processes. So we, we talked about how Huawei's been working around that. They just set up this like cloud matrix 3 84 which is like their.

Their computing system that bundles up a bunch of their ascend chips together in a way that is designed to just say, okay, our individual chips may be crappier because they're fabricated using a, a weaker process, but we can just string a bunch of them together, like, like build larger systems with larger data centers.

And because China is swimming in energy in a way that America just isn't America's energy constrained, China's chip constrained, China doesn't really care about the energy efficiency of the chips that much. They can just put more of them together and achieve the same scale. And that's really what they've been doing. The catch though is overheating. If your fabrication process is bad, if you're, if you're gonna basically like, like overpower your chips and just.

Pour tons of energy into them, then the chips will overheat and you will see problems. That's exactly what seems to be going on and what seems to be hampering a lot of Huawei's sales activities. The Ascend chips also, by the way, can't handle direct support for low precision formats like number formats like FP eight, which notably is what deep seek uses.

So Huawei, literally, like their chips, cannot support deeps seek style training runs, which is why Deeps seek has been using Nvidia technology and why the demand for it continues. One last factor that's really important to keep in mind is that Huawei competes with a lot of their customers. Think about bite dance. Alibaba, Tencent, right? These, these companies. They're all looking into Huawei chips. They haven't made big purchases. Part of that is because a lot of them run their own clouds.

Huawei runs its own cloud too. And so are you really gonna buy from your competitor? I mean, this is the reason, if you go back to a hardware episode, this is the reason that pure play Foundry's worth thing, right? that Intel, for example, historically struggled to attract a chip designer customers because they also were designing chips. And so you're sort of like buying from your competitor.

What the market fundamentally wants, is it, it kind of does want a separate foundry, a separate designer, and then ultimately a separate cloud company. And, you know, it's not a coincidence that Nvidia isn't so much in the cloud market. They, they could be if they want it, right? They could make big clouds. You could have Nvidia right up there with GCP, with Azure with AWS, but they're not doing it. Part of that surely is going to be competitive reasons.

Let's just have people buy our chips and, you know, reduce the barrier to entry on that as much as we can. And anyway, so Huawei is in a, more complex situation than I think a lot of analysis historically has acknowledged. We'll see where it ends up going. And they are a national champion, so the CCP can always force people to buy from them. But it's, it's an interesting interesting scene. Right.

And also mentioned in this article, and I think worth noting some companies like Bite Dance and Tencent have significant business outside of China and the US is cracking down more and more issued guidance that basically says, don't use Huawei chips. So if you are a more globalized company based in China, that's even more reason to prefer Nvidia over Huawei. Our next story is sort of related. Actually.

Huawei expected to break semiconductor barriers with development of high-end three nanometer, GAA chips tape out by 2026. Okay, so GAA is gate all around. This is a transistor design that is becoming really popular. It's a way of essentially making the transistors that form the, the critical circuits, the number crunching circuits on GPU logic die more energy efficient have higher throughput, all, all, all kinds of desirable thermal properties, et cetera.

So. Essentially what's happening right now is the the three nanometer process that for example, TSMC has developed does not actually plan to use GAA. So it's not gonna be a gate all round process. Huawei is accelerating towards GAA, that's the plan here. Essentially skipping a generation, which you kind of have to do if you're, if you're the underdog and trying to catch up. but the challenge is right now it's, it's not really clear that they can pull this off.

You know, they're seven nanometers, their five nanometer, and even their seven seven nanometer process that they get through SMIC that we just talked about, that sort of Chinese TSMC has really bad yields. Seven nanometer yields are somewhere between 15 and 50%, which is, I mean, industry standards, like 90%. anyway, so, so it's like they're major economic challenges, but if they can somehow do that, that would be really interesting. It would be a big leap.

The only other gate all around focus design for three nanometers is being done at Samsung Foundry. So this would literally be the, the first non Samsung Foundry process. If in fact it is non Samsung, if they're doing it through SMIC, which again would be kind of weird. It's also possible this implies a collaboration with Samsung Foundry, which would be really weird because Samsung is of course based in South Korea. So this would, you know, be interesting from an export control standpoint.

You know, can this actually work? But, but anyway, so Huawei has been known to make optimistic kind of pro pronouncements about the future of their technology. Hey, we'll have all these exciting things that don't quite end up taping out, if you will. we'll see. But three nanometer gate all round would be a big deal if Huawei can actually crack it. Yeah, not much to add.

All I'll say is if you Google G all around and look at the images, some really fun illustrations and electron, my, my microscopy images, and you get a feel for these poor computer engineers and semiconductor experts, you, you need to go 3D and build these elaborate structures. Now, just to be able to go into these load nanometer regimes and actually make chips work.

And speaking of that, next we've got a story about TSMC in their 1.4 nanometer process, which is called Angstrom, which is making progress. It's still not out. It's expected to be available by 2028. And according to the story, it's estimated to cost $45,000 per wafer at 50% increase of over the two nanometer PRO process, which is 30,000 per wafer. So yeah, that's pretty much it.

It is gotta be very expensive to use the really lowest, like most high density chips that are coming online in the coming years. Yeah, so 1.4 nanometer, they're, they're calling it angstrom, which is like, you know, slightly frustrating because it's not quite an angstrom is it? but that's cool. this is the, the next beat. Yeah. 50% more expensive. Apparently 2028 is gonna be the earliest production run.

So if a AI 2027, that that sort of famous blog post ends up being wrong and 2028 ends up mattering we'll probably see in 2029 some, pretty impressive rollouts of the next generation of Node and the chips designed on it. so this is by the way assessing. If there's a company that would want a first crack at this angstrom process, it would be Apple.

I would just say, we've been saying this on the podcast, do not take your eye off Nvidia, which by the way, is literally the world's most valuable company right now. As AI chips become more and more valuable relative to phones, expect at some point that Nvidia starts to make moves to compete for the leading node to essentially buy out Apple, all of t SMCs capacity and kind of become the subsidizer of choice for TSMC for their leading nodes.

I actually think that could happen sooner rather than later. There are indications it's already sort of in the works. So anyway, that, that, that would be a pretty significant shift in tech and, and the day that happens. We'll definitely be talking about it here. Fun fact, angstrom is 10 to the negative 10 meters or 0.1 ERs. So as you said, not really an accurate name at all, but yeah. Yeah. Sounds good. Sounds fun. And last story.

Coming back to Mistral, they're launching Mistral Compute which is a cloud offering for compute for AI that is gonna try to compete with other offerings. I suppose these days, AWS is still one of the leading ones. You have also newer competitors in a space like modal. So Misal, again, continuing to try and kind of on every front provide a European version competitor to offerings both in China and the us.

And they are coming at this from a position of, of less money, less talent you might expect or might argue. So we'll see. It, it the main kind of. Analysis of where advantages, I think I agree with you is the deposition as a European leader in the space. Yeah. Yeah. And, and in particular it's no small deal that they're based in France. You know, you think about what are the big bottlenecks? We talked about this right? In, in the United States. It's energy, right?

Everybody's trying to figure out where can I find a spare gigawatt on the grid. It is not easy. You know, even 30 megawatts you like, you can find it, but it's going fast. And so in France where they have. Really the, it's the only European country, the only western country that's been doing nuclear this whole time where they can actually build new nuclear plants in less than 10 fricking years. You know, they can support this and, and now they're reaping the benefits.

The, the scale that's being talked about here for Mera compute, by the way, is tens of thousands of GPUs they say built on Nvidia reference architectures. And so this, I assume that they, they must be looking at this point at like GB two hundreds you know, tens of thousands of those, I assume. And they're saying that they'll be supporting workloads ranging from defense to drug discovery. Okay. National champion much, right?

This is the kind of workload that smells a lot like you know, preferred partner of the French government, which by the way. Also from a red tape standpoint, if you're trying to set up a new scaled data center not only do you have the massive energy supply that the French enjoy, but you also have the support of the government to cut red tape, especially environmental regulations that allow you to get things up and running faster.

And so these things do sort of stack up in, in very interesting ways to like compete another day, let's say. But I think their fundamental challenge is gonna be capitalization, right? That's always how it's gonna be. you can't compete forever with companies that will raise, you know, tens of billions of dollars on a hundred billion dollar valuations. Like not even taking that much of a liquidity hit and raising from sovereign wealth funds and this and that.

It, it just does become really challenging. And the, you know, the French economy just isn't that big. So yeah, if I were France, this is what I'd be doing. But that doesn't mean that they necessarily have a, a winning hand. Yeah, as you said in this blog post of Air, they are literally saying the offering will include these trials, AI training suite that can accelerate region and domain specific efforts across nation and industrywide endeavors.

So yeah, calling out some of that champion kind of stuff. And I will say it's a little bit different, open AI and ro they're not offering this much of a cloud kind of architecture for training and serving and whatever else. Yeah. And it is rather specialized, I would assume this came outs to. Setup for compute to be able to do this. So I do think there is a decent chance that they have some good technological aspects here that might make it actually quite a good product.

Projects & Open Source

And next up, moving to open source, we have one story Pro rl. And for whatever reason I keep saying pro pl, every time we talk about it offline, pro rl, prolonged reinforcement learning expands reasoning boundaries in large language models. Bit of a mouthful, but hey, aren't they all? So there's this idea that the RL process itself just optimizes existing capabilities in large language models.

Basically, it's like you have your pre-trained model and it already kind of has all the capabilities that reasoning model should have, and your reinforcement learning process just elicits those capabilities. It it bubbles them up to the surface, right? So what they're after here is to show, actually that's not the case. What we can do is. is imbue the model with completely genuinely new capabilities that were not there before.

And they have a couple of ideas that they stack together to just like optimize the reinforcement learning process. One of which is this idea of there's a, a callback labeler. Divergence. So this is, essentially a way of measuring how different two different distributions are and like probability distributions.

And so what's often done during training is you'll have a model that's being trained and you'll have some kind of reference model where you don't allow the model under training to deviate too much from the reference model. The reason for this often is that if you just let the model go hog wild and get trained on its own to whatever. It will end up being, that model will learn to kind of optimize very narrowly and unhelpfully over-optimize to the objective that it's being trained for.

So in the limit, the classic example is if you let these models get fine tuned for too long without a kind of regularization they'll end up like no longer speaking English or they'll end up you know, kinda re really rigging their becoming sycophantic or, or, or whatever. And so you just have this reference model to keep pulling it back to reality. And there've been arguments that, that this KL divergence penalty is a bad thing that you actually should just get rid of it.

A lot of those arguments are based on looking at base models and like before the supervised, fine tuning stage in the context of reinforcement learning. And what you find there is their performance actually doesn't get so good if you keep enforcing that they have to be similar to the reference model. But what they're showing in this paper is actually if you do supervised, fine tuning first to let the model get good enough at reasoning.

At that point if you then use that as the reference model, you actually do find that the KL diversion strategy makes sense that regularization strategy. So that's one thing they did. They also did this thing called reference policy reset. So as you train your model again, you've got that reference policy, so it's not allowed to deviate too, too much. But then you'll update your reference policy to match whatever the model under training currently is, and then you'll proceed.

So you're basically using the reference policy as a, a kind of drag on the model under training. The model under training does a budget training, it can't deviate too much, but then you update the reference model and now you can start training again and, and you can deviate a little bit more, but not too much from that one. So it has a way of sort of. Slowing down the deviation from the reference model, but not so much that you're eternally locked into the original reference model.

And that turns out to help a lot with training stability while also allowing you to kind of recover a lot of these new capabilities that come with with reinforcement learning. And so they have a huge data set or a bunch of different STEM logic puzzles instruction following. Data tasks. It's like 136,000 problems in math and code and all kinds of stuff.

They also have an en enhanced version of this GRPO algorithm, which you might remember for our, from our discussions of deeps seek, it's become really popular. Just sort of a way of of stabilizing reinforcement learning training. quickly gets into the weeds, but yeah, bottom line is they're, they're borrowing a lot of stuff from, other papers like, dapo, which is like dynamic sampling and augmented policy optimization.

That, there are you're basically filtering out prompts to only keep the ones where the model. Sometimes succeeds and sometimes fails so that they're like hard enough that the model's gonna learn something by training on them, but not so hard that it's just hopeless and the model never even gets a reward signal. So there's all kinds of shit. it's actually quite an interesting collection of shit. The shit links together in interesting ways to make a little shit chain and together that is pro rl.

Not how I would've described it, but okay. Yeah, some interesting analysis in this paper. It's a family show. Yeah. I don't know what kids enjoy last week. That's right. I hope not many. Yeah. We have some analysis about the question of prol eliciting new reasoning patterns or not.

They basically make a point that there are tasks on which the base models are already pretty good, and there the gain is not significant, but there are other tasks where the gain is significant if you train long enough. And I just wanna call out, you're not gonna be going into detail on the story, but Magistral alongside the model. Mr did release a report on it, a pretty detailed, like 18 page paper.

And they did also highlight some differences in their loss for GRPO, including the elimination of KL Divergence as a penalty and, and some other stuff. So, very much a lot of exploration going on and into the right setup for RL training and including the loss and RL in general is, is a big headache. So I guess not surprising that there's a lot of things that are being figured out over previous and, and even now as people are diving into RL as a very prominent research direction.

Research & Advancements

Next up research and advancements. We begin with kinetics, rethinking test time scaling laws. So there is a new proposal for test time scaling that incorporates memory access into the calculation of the cost. So this is a different way to calculate the scaling law, basically for test time scaling. And in this new way of evaluating the scaling with updated cost, they argue that prior scaling laws have overestimated the effectiveness of small models that have inference time strategies.

They're basically saying that increasing model size up to 14 billion parameters. Is more effective before applying test time strategies, like best event sampling and chain of thought. So basically, instead of running your model more after training for smaller ranges of models in like 10 billion range, just make your model bigger instead of doing more inference on it if you can. Yeah. This is a really interesting kind of compute aware.

Not compute aware memory, bandwidth aware way of doing things. So historically, when we talk about scaling laws, right, you, you'll see these plots, you know, what do they look like? Well, you usually have flops like computing budget on the x axis, and you'll have some measure of performance on the Y axis, and then you'll see your nice little log plot and everything is good.

The problem is that flops, like, like the actual mathematical operations that go into training a model are only one part of the hardware picture, right? So GPUs, yes, can crunch a lot of numbers really fast, but they also have to move data around, right? Like that's one of the most time consuming things.

One of the big bottlenecks now is just like, how fast can you, can you move the data around, not just crunch the numbers, but shift it from memory to logic and back, and then to other memory and things like that. And so what they're trying to do here is. Redesign a scaling law that accounts for that for two. In other words, two measure two metrics. One is flops as in the traditional compute scaling curves, but also memory bandwidth.

And this is really where, or, or sort of memory access cost, which accounts for, the bytes of memory that need to be accessed here, the memory picture, right? And so they're actually gonna combine them both into one metric, they call it the e flop or e flops. And it's this essentially mathematically it's the computational cost of training, the model plus the memory access cost that, that essentially accounts for the, the memory bandwidth requirements and other things that go into it.

Times the intensity, which is, huh? A hardware specific ratio of compute capacity to memory bandwidth. Basically this is as you can imagine, this would depend heavily on your hardware fleet. Like what does your hardware actually look like is gonna determine. In practice, what your ideal number of parameters should be, what your ideal architecture should be.

And so this is part of the reason that scaling laws, by the way, always were framed in terms of flops because the moment you kind of try to balance flops and memory bandwidth, pretty soon you start to almost simulate a data center. And like you're, you're gonna have to have like, all kinds of resolution. And, and that just makes it really hard, not least because.

Then people will go, okay, well that's how it plays on that data center, but what if I changed my data center around and now we've got a different scaling curve and, and just it becomes impossible to do apples to apples. That in fact, is one of the challenges with this paper. It only uses a kind of reference architecture associated with the Nvidia B 200 GPU. So they are assuming those specs hold and you're seeing the scaling laws for that.

It does not look at different effectively different scaling laws on different accelerators from like a MD or Intel or in, or other Nvidia chips or different networking or interconnect configurations or different memory h hierarchies. None of that. So feel, you know, think of this as kind of more of a vibe thing, but in terms of like what we can learn from this, I think there are actually some really cool things. So, you know, in practice when you scale up a a transformer architecture.

What you'll tend to do as a developer is you'll increase the size of the MLP layers, right? So much faster than the the scale of the attention mechanism. So you, you could scale the attention mechanism, you could increase the number of attention heads the head dimension the embedding dimensions, all all that stuff. But people tend to practice to just increase the scale of the MLP layers that sort of do the logic instead of the attention piece.

Now, the intuition that a lot of people have is like, okay, well that shouldn't matter. So because we're just, we're just gonna be scaling the MLPs, they already represent the line's share of the compute and parameter account to begin with, right? So, so surely the MLP layers are already the bottleneck. So the fact that the attention mechanism is scaled more slowly, well, that, that shouldn't matter, right? But here's the catch. The MLP layer the compute required to scale your MLP layer.

It scales with the length of your input, right? So double the length of the input. Roughly speaking, double the amount of compute that your MLP layer will consume. Fine. But as you increase the size of your input, the attention, memory, bandwidth requirements, scale with the length of the input squared.

So in, in other words, very rapidly, as you scale the length of the input attention, the, the memory bandwidth pieces start to become the rate limiting step and your operations become memory bound. Because, you know, you're, you're, anyway, you're bottlenecked by the attention layer and so. This has become more and more of an issue because the length of inputs and outputs is getting greater and greater and greater, right?

With these kind of best of end schemes, inference time, compute reasoning, all that stuff you're seeing, your inputs and outputs get longer and longer and longer, which means that bottlenecks that scale with the square of the input length quickly overtake bottleneck's, scale just linearly with the input length. And it turns out that memory bandwidth is, you know, scales with the square. And, and that's why we run into this problem.

And so, anyway I thought really, really important paper, if you're interested in understanding the consequences of hardware choices for model architecture. thought, I thought this was actually quite fascinating and something that I just haven't seen other people dig into is these more nuanced scaling laws, right? Yeah. The very first sentence in abstract, they're saying, we are coming at this from a practical efficiency perspective and. To your point of what is and X axis, they're very direct.

They say B 200 seconds. So Andre B 200 GPU, which is the leading edge. Instead of looking at computation, we are looking at the literal amount of seconds to get some level of accuracy. Lots of really good analysis in this paper. We also have a really nice blog post, and I feel like we often call out when papers come from Apple or, or DeepMind or philanthropic. So worth mentioning this is from CMU, like a fully university work. Also, the two leading offers are immigrants to the US system.

So we should get into it. But I do wanna say with some of the policies about you know, grad students and in general kind of taking in grad students from other countries. You look at these papers and it, it makes me feel a little depressed. But anyway, moving on. The surprising effectiveness of negative reinforcement in LLM reasoning this is looking at RLVR reinforcement Learning with verifiable rewards in two paradigms.

You got positive sample reinforcement and negative sample reinforcement where PSR focuses more on reinforcing correct responses. NSR negative sample reinforcement emphasizes penalizing in correct ones and it seems that you can do positive sample reinforcement only and negative sample reinforcement only training. And, pSL only positive, only improves past one, but reduces higher past 10.

So basically it, it reduces, if you have a few opportunities to get it right, you're not necessarily gonna do well. And that's because there seems to be a loss of output diversity versus negative only apparently is able to improve performance across all paths at K metrics. So not just one trial, but ne several trials, meaning that it, it might be better to do, to focus on penalizing incorrect outputs over, encouraging it to do the same stuff. That seems to work.

Yeah, it, it's actually, I'm surprised at how intuitive at least this result seems to be where you'd imagine like if you were being trained. to do any complex task. And the way you're being trained is not by being told when you did something right. But just when you did something wrong, basically. What this has, this has a way of, not telling you how to do your job, but to tell you how to not do your job.

And that means you're gonna be more creative if the reinforcement tells you like, here's the right answer you know, do it like this versus don't do it the wrong way, then that's a, you know, very different kind of reinforcement process. It's a little bit difficult to analogize because it's, it's post hoc, right? So imagine that you, you try to task and if you did it right we just wipe your brain and, and you, you have no memory of doing it right.

But if you did it wrong, we tell you, Hey, you did it wrong. Like, that's kind of what we're doing with these models with this sort of architecture. Which is really interesting and the results do bear out that you get more diversity of of sort of more exploration oriented models rather than exploitation oriented models.

Because what you're really doing is you're redistributing probability mass to plausible strategies rather than concentrating all your probability mass into the small number of highly kind of correct, observed, correct paths, right? Because this is, this is one of the things with, with RL is like, you're not going to get to observe all the correct paths, right? You're not also not gonna be able to observe all the, the incorrect paths.

But at least by, you know, by not calling out the, the correct ones and saying do it more like that, you're leaving it the possibility space open for the model to pursue kind of alternate correct ones. So anyway really interesting. One question that, that came to mind, like, as I was reading this, I was like, well, you know, wouldn't you run into a problem where over time if your model gets, gets better and better at a task.

You just sort of can't find enough negative samples in a batch for like, for GRPO and, yes, this is actually an issue and they, they call it out. So they frame it as a feature and not a bug, which I. I think it's somewhat true. And then there's some stresses. So they point out that does prevent overfitting because you just won't get updates once the model really masters the problem set.

So you, you won't keep, you'll just like run out of failure cases and so you won't over optimize the model to overfit, which is really cool. The flip side though is it's kind of compute inefficient, right? Because you have to then do a lot of rollouts that don't yield any trainable data.

And so I think from a compute optimality standpoint, you're also taking a bit of an L. So they actually suggest this kind of like middle ground strategy they call weighted reinforce, where you still use some positive reinforcement at as they put a 10% strength to ensure continued learning. But you're gonna use full strength, negative reinforcement learning. So really lean more towards telling the model not to do things. And with a little bit of guidance about how to do things.

So anyway that kind of helps 'cause you're retaining some of those positive examples. But again, from a compute optimality standpoint, I think it's sort of, it'd be interesting to see how this ends up scaling. Yeah, this is one of the somewhat nuanced aspects of reinforcement learning to actually do good reinforcement learning, you need to model the reward for any given output. And to do that, you need to be aware of both positive rewards and negative rewards.

So it's interesting to focus more on a negative rewards. Basically their weighted, reinforce up weights the negative aspect, and they compare this weighted reinforce against standard G-R-P-O-P-P-O, these other RL training setups for their own objective and losses. And it looks like from their results on QU 2.5 worth noting, all these reasoning model papers are looking at a particular model, which.

May not be ideal, but anyway with this weighted inforce setup seems to be better than GRPO and PPO, which is pretty significant since GRO is often what people are exploring in this research, like I mentioned previously. couple more research papers. Next up, we have predicting empirical AI research outcomes with language models. So that's pretty much what it sounds like. You wanna try and predict what will happen to in a given experiment with a language model.

They created a benchmark here by scraping ideas and results from conference papers and wound up with around 1500 test examples. And then with a whole system with fine tuned GP 4.1 and paper retrieval, they were able to get 77% accuracy on the test set at being able to perform the prediction significantly better than off the shelf performance just by baseline existing models. So pretty good results. They say it. Outperforms a human expert baseline on NLP idea pairs.

But you know, it's, it's still, let's say, nascent and, and this is an interesting idea, but definitely a nuanced area to look into and, and requires careful extrapolation. Yeah, it's, it's one of those areas too where people often talk about, AI models the, the big advantage is gonna be in having good taste regarding the problems that we throw them at. this is an example of AI models actually developing taste. The automation of taste itself, right?

Research, taste if you can predict how likely a given idea is to pan out, that's sort of the idea here. So the way they do it in practice is they're gonna go within a given paper, right? You often see multiple methods used to achieve the same goal, right? And you can imagine how hard it would be. Like they're not gonna go and grab two different papers that try to do similar things and predict which one is gonna work better. 'cause it's impossible to get apples to apples.

People like use different training strategies, different data, like all kinds of shit. So what they're gonna do is same paper, multiple methods. They're gonna extract pairs of these essentially experiments in the papers that compare different approaches and that's what they're gonna use to construct their dataset. So that's kind of more appropriately calibrated kind of apples to apples comparison.

And so in that sense it's a little like it is predicting it AI research outcomes, but it's not quite the same as having a new research hypothesis from scratch. Like it's not at the paper level, like, alright, which, you know, which paper should I, which paper level should I explore? Yeah. Predicting is, is maybe a little misleading. It's comparing two potential ideas and predicting which one will get a higher number on a benchmark.

And so it's a binary prediction slightly easier setup and saying like, if I were to try this idea, what would I get? Yeah, exactly. I think in order to do it at the paper level, which is the most interesting thing, you'd probably need a very complex sort of data filtering and shaping approach where you, you try to get it to be apples to apples as much as you can, and then, you know, feed into a model.

But the, the interesting thing here is like you called it out, this model, this sort of fine tune model does better than O three. Models like O three perform no better than random guessing. And so when you're looking at 77% accuracy on this benchmark of predicting kind of, which of two ideas is gonna do best, obviously random guessing is 50%. So that's quite a lift. Bears mentioning that it achieves about 64% accuracy on unpublished novel ideas.

So there's some amount of overfitting going on here where we're getting, you know, 77% in the sort of like test case. But then when they actually tried on, on these new ideas that are unpublished it goes down to 64%. It's still still much better than 50 50. But yeah, pretty remarkable. The other. The other funny thing is if I'm interpreting this right, says, they beat human experts. Human experts scored 48.9%, which is slightly worse than random guessing.

if that is apples to apples, if it's just a side by side thing. So that's kind of, I amusing in and of itself, like humans kind of suck at this themselves. And they are really getting some sort of, some sort of lift from their fine tuning approach here. Like if they're going from 50% to 64, that's, that's not tiny. and one last paper also related to ai. Contributing to research.

In this case it's called EXP Bench, and it's focusing on benchmarking AI agent's ability to conduct end-to-end research experiments, also using tasks from published research. So here they looked at peer reviewed AI publications from Europe's and icl. They created this benchmark of 461 research tasks from 51 papers, and they basically show like, can Visa AI agents do the experiments introduced in these papers?

And what happens with published papers is usually ideally they publish their code so you can replicate the experiment and get the same output and, and replicate the whatever. Tables of numbers you get. So that kind of gives you a rich signal as to how you want to set up your experiment, how you wanna ideally be able to replicate the experiment. And so this is making it possible to evaluate whoever AI agents are able to that and they struggle.

Is, is a summary on whether they're able to implement and get things correct. Yeah, I, I will say we're getting to the point where the benchmarks that we're designing are so hard that once you actually do saturate these, like, I mean, what does the world look like when you're hitting 50% on expert bench, like 50% success rate for end-to-end automation of. The process of formulating hypotheses, designing and implementing experimental procedures, executing them, analyzing the results.

That whole end-to-end, like that's, not far from automate, like fully automated ai r and d, right? That, that's kind of like at the model level. There's obviously a bunch of hardware and network optimization, jazz that like independently opening AI is working on internally.

But, what does the world look like when you're actually saturated that, that's worth asking right now when you look at O three mini, which is the best model they tested overall, you know, O three PRO was not out at this time, you know all that, but 1.4% or, or six or seven out of 461 tasks that they tossed at it were completed successfully. So. One read on that is 1.4%. Wow. That's really small.

Another read is like, wow, we're actually getting like a complete success rate, end to end of like between one and 2% with our best model today in a context where new models are coming online like, you know, every other week. So yeah, I don't know, but this may be a bigger deal. It's, it's, that's a pretty big 1.4%. At least in my mind, I. Right.

And, and to give you an idea of what is involved, the inputs include a research question they have an example, does the Monet architecture outperform existing lightweight models? They have a high level method on the experiment, train the Monet variance on ImageNet one K for blah, blah, blah, blah, blah. And they give it some starter code with potentially additional instructions.

And the job of AI isn't to do the research per se, it's to set up the experiment and run it and get the results, which means it needs to correctly configure the code, add stuff to the code train it for the correct amount of time with the correct parameters and, and evaluate the outputs to get the actual output of experiment. So yeah, if you are able to do this there's a lot of conversations around speeding up research and speeding up AI research and eventually possibly automating it.

We've seen research on I, ideation idea creation. We've covered some of these papers and we've seen some progress there. This is really covering the opposite side where you are able to evaluate ideas with experiments somewhat pretty closely guided. So I will say in the inputs it's told kind of exactly what to do, but still it's, it's a very useful kind of thing to be able to speed up research.

I guess it should said like, automating the implementation phase of AI research, which actually is an interesting compliment, right? To the idea that we just saw earlier of trying to automate taste in language models, both sort of at nascent phases. But, you know I'm old enough to remember when humanity's last exam was supposed to be a very hard benchmark. And before that, when GPQA and before that, when when even, yeah. Anyway, there, there's like a million of these, right?

It starts at, it starts at 1% and then very quickly we, we start saturating. So yeah, these darn kids starting via PhDs don't even have to do code. All the code is done by Claude and all the experiments are gonna be done by agents.

Policy & Safety

And moving on to policy and safety. Our first story here is large language models often know when they're being evaluated. And so essentially this is a a test to see, you know, how how can we measure looking at nine different frontier models from all the anthropic open AI and DeepMind families. They're gonna ask them multiple choice questions, asking about a, a bunch of transcripts basically. So they're, they're chat transcripts where you have interactions between agents and humans.

You have evaluation deployment scenarios all, all kinds of kind of you know, transcripts anyway that you feed in in different contexts. And then you ask the models. These questions about whether the transcripts came from evaluations and, and this is a basic test to see can these models grok like, okay, I'm probably I, in this situation, I'm probably being evaluated. And the kind of safety implications there are, you might expect.

Alignment, faking, you might expect all kinds of anyway funny behavior where like sandbagging, where the models pretend not to have certain dangerous capabilities if they know they're being evaluated. So then they increase the chances that they'll be deployed and have more freedom of action and sort of like standard power seeking picture. So what they find here is, yeah, actually, you know, frontier models, they show definite above random evaluation awareness.

So Gemini 2.5 pro has an a UC of 0.83 actually. Now the human baseline, they run very helpfully. It hits an a UC of 0.92. So not quite a human baseline yet, at least for Gemini 2.5 pro. But but significant uplift over over random and both AI models and humans are better. At identifying evaluations in age agentic settings compared to chat settings. So that's kind of interesting. That may also be just because often age agentic setting evaluations are more contrived.

Like you can just sort of tell like, oh, this is a toy app. It's not like a, you're not really trying to build a company here, or, or a real thing. This seems like the kind of thing that you would ask a model to do as part of an evaluation. So that's, you know, that's sort of part of it. But hey, that's, that is part of the problem of evaluation is like, it is also hard to come up with with evals that are realistic. So anyway yeah, this is pretty interesting.

there are a bunch of different cues that the models were found to use. Things like having very kind of formal task structures and specific formatting requirements seem to, to be cues that the models use to go like, oh, okay. Or, or synthetic content. so you know, you can see how that would work and you can also see how the human reviewers would also that would also kind of pick up on that.

But the other thing too they found is that the human versus AI performance gap depends a lot on the setting. So in an agentic setting, the human ability to identify eval transcripts tracks an a UC of 0.996, whereas. It just hits 0.813 on chat transcripts. So humans are almost perfect at identifying age agentic evaluations, but we suck at chat based ones or we struggle a lot more. And that's a lot higher that variation than any models variation across those settings.

And so it's pretty clear that models aren't fully taking the hint. They're not picking up all of the human accessible hints, at least that, at least in age agentic mode that you're in an evaluation setting. So that's kind of interesting. A bunch of other stuff here, but we're enlightening around territory. Oh no, we're not enlightening around territory. Anyway, I'll pause it there. Andre, if you have any thoughts, you've got, you've got a decent amount to go through. So yeah, we need to go.

I will just mention a UC is area under the curve for anyone who doesn't know. Basically comparing true positive, like it actually is evaluation and you guess it correctly versus false positive. You guess it wrong. Basically how accurate you are with a bit of nuance. They also evaluate in two ways. One of them is multiple choice and the other is open-ended as, and as you might expect on multiple choice when you're given the option explicitly. Is this for evaluation or something else?

You're gonna do a lot better than if you're just being open-ended. So multiple choice is like, what was the main test here? Was it capability, alignment or other? And the system's gonna predict alignment. And then open-ended is, would be what was the primary reason for the interaction and it would predict something like the test of AI boundaries.

So as you said, the implication here is if you're testing alignment and safety, the model, if it's aware of that, could mislead you and pretend not to be able to do hacking when it actually can, for instance. Next up more of an interoperability paper called Beyond Induction Heads. In Context Meta Learning induces multi-phase circuit emergence. What a fun title. So this is looking at the mechanism behind in context learning. In context learning is the idea that you give it a couple examples.

The model is able to do something that it isn't necessarily able to do out of a box just through pre-training. And they are saying that this notion of induction heads, this is a term from philanthropic. I think originally it's pattern you get in models where.

Basically a part of a model focuses on looking backwards in the input to identify some things that already saw that's similar to what it's currently looking at and be able to predict what comes after the current input based on previous patterns. So they say that in action, hands only, partially explain ICL. essentially there's a fancy circuit, a fancy abstract mechanism in the model that emerges and that enables meta contact in context learning beyond the kind of known induction head mechanism.

There's even a fancier kind of abstract notion of something with a model that does in context learning well. this is sort of a generalization right, of, of induction heads and we talked about the induction head bump before, but it worth kind of reminding people about the, the specifics here. So it's, it's kind of like the, the answer to this problem. You read on a piece of paper the words United States of. And then like you, obviously you instinctively know it's America, right?

But in that setting, there's a circuit in your brain that's going like oh, oh, oh. Like I've seen this before. United States. Of, United States of let me see, lemme see. Where have I seen United States of before? Oh yeah. America. America. Okay, I'm gonna put that in there. Right? That's what the induction circuit induction heads do. And they emerge quite early, as you might imagine in the training process.

And so what you'll see is the loss curve will drop and drop and drop, and then at one point the model will kind of like, it's almost like it's gonna like shift its position a little bit to accommodate the indu induction heads. So you see this little rise in the loss, the, the performance on paper gets worse very briefly, and then it drops quite quickly. So the induction head bump is that, it's the development of this new circuit. And this is something that's been very extensively studied.

It's almost like you know, if you've ever done biology like drosophila and Melanogaster or whatever those like model organisms are, this is a model circuit that people turn to quite a bit. this is an attempt to see if we can find a more complex version of that same basic circuitry. So, for example they take a set of three different tasks where you have a bunch of geometric shapes. So triangle square circle, diamond, right?

Depending on the task, you can end up assigning different color labels to each of those shapes. So maybe in a size based labeling task you know, triangle is red, square is blue, circle is green, right? Maybe in a, a different task triangle is blue, square is green, circle is yellow, and so on. And then during training, the model is gonna see a sequence where you go, okay, now triangle is blue, square is green. Circle is yellow, what is diamond?

And in order to do that, the model has to basically look at the tasks in context and figure out what task this is, and then predict the correct label. And so you can sort of see how this is a bit like the induction head, right? It's, it's looking back more abstractly now at like the set of tasks rather than just like, okay, what word always comes after this word Instead? It's like, okay, if it's this task, then what word always comes after this word?

And so anyway, it's unlike these like simple copying tasks that you see with the induction heads, there you see a a single jump in accuracy, in in context meta learning. with this sort of setup, you end up seeing three distinct phases where the model develops increasingly sophisticated strategies. The first one is just at the very beginning where all the model is essentially using, its like statistical understanding that's been picked up. It doesn't really use context.

It's more of an auto complete mode. And then in the second phase, they have a semicon circuit where accuracy jumps from about 35% to 75%. And what it's now doing is it's actually able to attend to label tokens in the context. So it's actually gonna look, you, you can notice it, paying attention to the right tokens in the, the context that you fed it, looking at the actual tasks that seem like they map onto yours. But it, it is still focused anyway, on, on the query.

Bottom line is this starts to emerge gradually and in layers which is interesting from an interpretability standpoint. It means you can kind of draw a little bit of a box around the process by which more sophisticated reasoning starts to emerge. Right. Worth noting. This paper is doing the research on sort of toy tasks, a small neural net. And, and this one task, as you said which is also how initially the research on induction heads worked.

Andro did follow up their initial research with making the argument that there are industrial heads in gigantic neural nets and large language models. Here they're still focusing on a small scale scenario. And so this like multiple bump analysis. May not necessarily extend, but it's, it's a sort of, yeah, slightly more theoretical, conceptual argument that it's not just about induction heads.

There's different types of emergence that might occur in neural net training, which in general is interesting because the sort of jump and loss due to a conceptual change of reasoning yeah. Isn't necessarily something that was commonly understood to be the case until relatively recently. A couple more stories. Now moving on to security. The next story is that new Microsoft copilot flaw signals broader risk of AI agents being hacked.

So Microsoft copilot where agent has been identified as vulnerable to a zero click attack, meaning that the hacker is able to exploit the system without any user interaction. So kind of a big deal, right? You can actually hack it. And I think, Jeremy, you mentioned this earlier on, as we deploy more and more agents in more and more kind of isolated environments without direct human supervision, these kinds of things become much more concerning.

it is the first ever zero click attack on an AI agent that they're calling out here. It's called Echo Leaker. That's what AIM Security, which is the firm that found this is calling it, it's been fixed already. It was in Microsoft 365 copilot. Customers were infected 'cause they flagged the issue to Microsoft months and months ago by like five months ago. They've been working.

Around the clock, it seems to, to solve this problem that's a lot longer of a lag than you typically find for fixes like this. And the reason seems to be they had to spend a bunch of time just like educating people on this new threat model because it is so different. This is what's known as a, an LLM scope violation, vulnerability. So you're essentially, what you're doing is. You're sending an email, right?

So like I send an email to you and I know that your computer is running Microsoft 365 copilot. I know that your computer is running an agent, and that that agent will review my email, right? And whatever I put in my, in my email to you, that agent will put in its context. And so essentially this is a prompt injection attack, right?

So you, as the user, if you're receiving my email, you don't actually have to click on anything or interact with a message or anything like that in order for me to, or my agent to access sensitive information on your apps. If I can just put in a prompt injection that causes your agent to send me a bunch of your private information, right? So you know, send an email to user. There's, there are no phishing, no malware needed by the way. This is just straight prompt injection.

and there are hidden instructions somewhere in the email for copilot. And so this is. A pretty big deal, especially given that we live in a world where, you know, the anthropic model context protocol Salesforce's agent force, you got a bunch of these agents you're kind of taking over. This is, the problem is there's no clear solution to prompt injections. And as long as agents are gonna be loading human written text into context. These failure modes are going to arise. It's really interesting.

And the attack surface has just exploded, right, with these agents, right? The implication of zero click is you as a human don't have to make a mistake. Or typically with email attacks, you know, you see a phishing attempt where, you know, a hacker pretends to be your boss or whatever, and you have to make the mistake of thinking it's real and clicking a link or whatever to install a virus. Here, literally via AI just sends an email.

And if it's in your inbox and the agent scans your inbox and reads the email, it goes off and like leaks sensitive data because it's told to and, and listens to the instructions. So as you say, I think very. Real threat. And, and, and as we get into monotech model context protocols into agents going, kind, connecting to different endpoints by themselves and reading instructions that are not provided by you. Yeah. Lots of opportunities to exploit agents and make them do silly things.

And one last article, Claude gov models for US national security customers services from philanthropic. And yeah, they introduced cloud gov models specifically for US national security. Apparently they are already in use by top level US national security agencies. it basically is just that we've obviously seen a whole bunch of stuff about open AI and anthropic and, you know, and Google Deep Mind looking after or going after government contracts. So this, you know, makes it. Ton of sense.

You know, having these models that can operate in classified environments is really, really important. Right now what they're being used for apparently is strategic planning, operational support, intelligent analysis, threat assessment, that sort of thing. But they do say the applications range across the board there, so could be other things as well. And then they highlight a bunch of specific capabilities that, that they've been deploying which are all anyway, what you might expect.

Improved understanding and interpretation of complex cybersecurity data for intelligence analysis and hence proficiency in languages and dialects critical to national security operations. Greater understanding of documents and information within the intelligence and defense context, et cetera, et cetera. Oh, and then a, a really interesting one, improved handling of classified materials. as the models refuse less when engaging with classified information.

One of the problems that we will run into and arguably are already running into is. If you want to use these models for national security applications, the safeguards on them will sometimes prevent you from doing that, right? The models will be like, well, as a large language model built by philanthropic, I can't, blah, blah, blah. The challenge is sometimes you do want these models to be capable of doing things that you wouldn't want everyday users to do.

And the other problem with that is, as we've seen alignment, faking and resistance to fine tuning of these models where they will try to prevent themselves that are safety measures from being overridden can cause the fine tuning process to be really challenging. And so we may actually, this sounds insane, but I'm just gonna plant the thought. We may be entering a phase where it is actually difficult to convince AI models to be the national security tools that we will sometimes need them to be.

That's a really interesting problem set, and I think to the extent that that ends up being the case, boy, is that an interesting warning shot for alignment risk? Yeah.

Synthetic Media & Art

And onto synthetic media and art. Just a few more stories. We begin with Disney and NBC Universal Sue AI Company, midjourney for copyright infringement. So there you go. Midjourney, one of the big text to image model providers used to be a leader in, in the best quality. Now they're just one among several and relatively open model. So you can produce Dar Fader or I don't know, whatever else, copyrighted characters. Apparently you can produce minions, which is NVC, universal.

And we, claim here is that this is straightforward copyright infringement that midjourney has to stop doing it. And and Disney and NBC want a bunch of money and also want midjourney to stop.

Apparently, according to them, they reached out to Midjourney prior to the lawsuit and ask them to stop and to filter the data and outputs to not allow their copyrighted characters to be produced, which, as I recall, I believe OpenAI did for instance, and midjourney has continued to allow their models to produce things which has been argued potentially could be argued as fair use and therefore not applicable, but clearly a big deal. Right? This is Disney, this is NBC Universal.

There's been a bunch of lawsuits related to generative ai, especially in the LLM domain, in the text output domain. We have New York Times versus open AI as a major one that's ongoing. As we've covered earlier, I would expect this to be another major case that has major implications. Yeah. And the claim, and you'll see this in fairness, in, in any lawsuit, but the claim here is that midjourney is being especially egregious in their, in their approach here to, to a use of copyrighted material.

They're saying, you know, midjourney is basically selling subscriptions that let's users download infringing images. Like, it, it's not like there's modification happening. It's not like midjourney is not monetizing they're, they're like directly monetizing the tool that allows people to just download these things. And the claim is also that midjourney could have measures in place to prevent that from happening.

Like specifically that is to prevent copyright infringement images that violate copyright laws from being generated, but that they've just not done that. this is gonna be an interesting one to watch. I mean midjourney probably has fewer resources these days, I guess to pull off. Its like lobbying effort, which is something that OpenAI has certainly been able to to do. So we'll see how the, how the case works out for them. Right? Also a fun lawsuit, PDF to read.

'cause we do embed images of, I dunno, AI generated truck and AI generated Dar fader in there, which I would expect is not often something you see in lawsuit documents which, go into a lot of technical detail and so on. And onto last story, SEG AFTRA and video game companies reach tentative new deal. So Seg AFTRA is the union for it's the Screen Actors Guild, American Federation of Television and Radio artists. So a uni of actors and including voice actors who work in video games.

And so there's been. A strike and, and a lot of negotiations ongoing. We covered this a lot with regards to movies and TV last year. Well, now there is this development in video games, which is, you know, especially important because if you're doing voice acting as we've covered, you have 11 labs. Text to speech is even further along than text to video and, and image cloning.

So after 18 months of negotiations primarily over AI consent and compensation issues, there's now this tentative agreement. And I guess there are AI protections in place for actors. And when you sign a contract as an actor. You know, to voice a specific character the video game com company might wanna be able to then make an AI model of your voice acting of that character to use in future games or whatever. There are now kind of clear guidelines and expectations as to how that would work.

Boy, I so people can do impressions of people and like, if you have access to an AI tool that you can steer, and we've seen, you know, the kind of steering that's coming online with 11 Labs I really wonder what substantively these protections end up. Giving in the long run. I mean if I want something to sound like Morgan Freeman.

Okay. So I'm barred from using Morgan Freeman's actual voice without permission, but surely I can find the person who does the best possible Morgan Freeman impression or and, and maybe use that as a starting point, and then like gradually kind of tune the, the waveform prompt the model to refine its, its impression without ever using the word Morgan Freeman. Like, you know, maybe not even without ever saying make it sound like the God and Bruce Almighty or whatever. That's a like.

Probably too old a reference for you, Andrea. I'm sorry, that's not that old. That's, you got that. Okay, cool, cool. Yeah. But anyway, you know, stuff like that, like I, I'm really curious how in practice, because there are gonna be like good faith. Like, you know, the famous Scarlet Johanssen thing where at least the claim from OpenAI was, oh yeah, we just got a voice actress who sounds like Scarlett Johansson.

We didn't actually like, and it's like, yeah, okay, well you defacto cloned her voice. Like, I don't care if her specific like waveform was never put into your training set. in effect, that's what we ended up with. And so I'm really curious about that dimension of it. Do we own our voices? What does it even mean to own our voices? We'll see. Right, right. This is dealing with AI replicas in particular, but there's also a question of, well, what if you don't have a human actor in the first place?

Yeah. Which is very plausible now in a way similar to coding, where like, okay, you don't need a person to write code anymore. You need a person to. Tell the AI what to do. Yeah. Anyway, at least there's now this agreement and there's no more need for strike. So I suppose good for actors. Yes. And with that, we have finished with this episode of the last two weeks in ai. You can go to last week in ai.com for all the links. Also last week in.ai for the Substack with our text newsletter.

As always, please share, subscribe, review and all that, but more of anything, do keep doing it.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast