#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi - podcast episode cover

#204 - OpenAI Audio, Rubin GPUs, MCP, Zochi

Mar 24, 20252 hr 49 minEp. 244
--:--
--:--
Listen in podcast apps:

Episode description

Our 204th episode with a summary and discussion of last week's big AI news! Recorded on 03/21/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • Baidu launched two new multimodal models, Ernie 4.5 and Ernie X1, boasting competitive pricing and capabilities compared to Western counterparts like GPT-4.5 and DeepSeek R1.
  • OpenAI introduced new audio models, including impressive speech-to-text and text-to-speech systems, and added O1 Pro to their developer API at high costs, reflecting efforts for more profitability.
  • Nvidia and Apple announced significant hardware advancements, including Nvidia's future GPU plans and Apple's new Mac Studio offering that can run DeepSeek R1.
  • DeepSeek employees are facing travel restrictions, suggesting China is treating its AI development with increased secrecy and urgency, emphasizing a wartime footing in AI competition.

Timestamps + Links:

  • (00:00:00) Intro / Banter
  • (00:01:36) News Preview
  • Tools & Apps
  • Applications & Business
  • Projects & Open Source
  • Research & Advancements
  • Policy & Safety
  • Synthetic Media & Art

Transcript

Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with ai. As usual, in this episode, we will be summarizing and discussing some of last week's most interesting AI news. You can go to the episode description for all the links and timestamps, and also to last week in ai.com on your laptop to be able to read those articles yourself as well. As always, I'm one of your hosts, Andre Ov.

I studied AI in grad school and I now work at the generative AI startup Asate. And I'm your other host, Jeremy Harris. I'm with Gladstone ai, AI National Security Company, which you know about if you listen to the podcast. You also know about aade. Now a bunch. If you listen, you know about all of this, you know about all this.

What you don't know though is that this morning at the early hour, I think it was like three or something in the morning I discovered that I have bats in my house, which is fun, which is really fun, especially when you have like a six month old and you have bats. And then you start Googling things. So anyway, we had pest control come in. That's why. Wow, my hair looks like Cosmo Kramer right now. I've just been running my, running my fingers through it for for quite a bit.

So anyway we got everything on for Showtime though, because the show go on. Yeah, if you, but if you get any details wrong, you know, it's, it's the shock, residual shock of bats you. I'll be on the lookout. Well, let's do a quick preview of what we'll be talking about in this episode. It's gonna be a bit of a relaxed one. There's nothing too sort of world shattering, but a variety of pretty interesting stories, tools, and apps. We have some new impressive models out of China.

Some new stuff from OpenAI as well. Google, philanthropic, everyone launched some stuff. Applications and business as we often do. Gonna be talking a lot about hardware and GPUs. A little bit about fundraising as well, projects and open source. We'll be talking about the model context protocol, which has been all over rage in the AI community recently, and a couple new models as usual. Research and advancements.

We gotta talk about reasoning techniques inference time, scaling techniques but also some new kind of developments in the space of how you implement your models, policy and safety. We have some more analysis of what's going on with China, US national security, things like that. And finally, we will actually talk a little bit about the world of art and entertainment with some news about copyright. So let's just get straight into it in tools and apps.

The first story is about Baidu launching two new versions of the Ernie model, Ernie 4.5, and Ernie X one. So Ernie initially released two years ago, and now we have Ernie 4.5, presumably. I don't know, it sounds like kind of to coincide with four with GPD 4.5. And then Ernie X one is the reasoning variant of Ernie that Baidu says is on par with deep seek R one, but at half the price. And both of these models are multimodal. They can process videos, images, and audio as well.

They also say earn your 4.5 is kind of emotionally intelligent. They can understand the memes and satire, which is interesting. So I think. We, we don't have like a great sense of the tool landscape in China is my impression. I, I really wish I would know, like if you are a user of a chat bot, I. We go to chat GT or Claude to give out queries.

I think it seems likely that Ernie is sort of filling that role and the fact that there's new models and the fact that they're really competitive, pricewise is a big deal. the like number one downloaded app in China just switched to a new AI chat bot that is not deep seek. So things are definitely moving. the big advantage here with the, this launch seems to be cost. At least that's what they're leaning into with a lot of the discussion around this.

So the goal that Baidu has, which, you know, Baidu of course is China's. Roughly China's Google, right? They, they own search there. Their goal is to progressively integrate Ernie 4.5 and their X one reasoning model into all their product ecosystem including Baidu search, which is sort of interesting. So we'll see a, a rollout of the generative AI capabilities in, in that context. Yeah, so ultimately it does come down to price. A lot of it.

So for context, there's a really handy table in one of the articles that looked at this comparing G PT four point fives per token cost to deep seek V three to Bernie. Sorry, Bernie to, to Bernie, 4.5. it's quite interesting, right? So input costs for input tokens, 75 bucks for a million tokens. This is for GPT-4 0.5. Deep seek V three that drops to basically 30 cents. Ernie 4.5 is about 60 cents or so per 1 million tokens. So, you know, you're talking orders of magnitude less.

Also the case that these models are less performant. So that's sort of the trade off there, but where things really start. Yeah, I think just to give a, a bit of a perspective deep CCP free is more comparable to something like GPT-4 oh in open eyes slate models or free mini, for instance, where aggression isn't that crazy? It's maybe I forget, $1 ish. Per million tokens, so they're comparable. G PT 4.5 is just crazy, crazy pricing compared to everything else. And that's the thing, right?

It's the, the way to think about 4.5 I think we touched on this a couple episodes ago, but it, it's a base model, but it's not a base model for let's say mass production, right? These are high, high quality tokens. Probably best used to create things like synthetic data sets or to answer very specific kinds of questions. But you're not looking at this as something that needs to be, that you wanna productize just 'cause you're right.

I mean, it's, two orders of magnitude more expensive then other base models where you actually see the lift here especially for Ernie X one, right? With this is the reasoning model is on the reasoning side, right? So open AI's O one is. roughly 50 times more expensive than Ernie X one. Ernie X one is about half the cost of R one for input tokens. And and actually that's also true for output tokens. So it, it's a like quite significant, especially again, relative to O one and shows you.

One of two things. Either Chinese engineering is actually really, really, really that good, or there's some state subsidy thing going on in the background. I think the latter is somewhat less plausible at this point, though I wouldn't rule it out. Certainly there's some amazing engineering making these these margins possible and it's, that's a pretty remarkable thing here, right? I mean, the cost just collapsing for reasoning.

This implies that there's some reasoning specific engineering going on in the background and you know, you should expect that to apply to training as well as infants going forward. Yeah, and it's kind of, I. Funny in a way. There is a parallel here between Baidu and Google, where Google likewise has quite competitive pricing, especially for Gemini Tube flash thinking. So I could also see it being you know just a company strategy kind of thing. Baidu is gigantic.

They're printing money with search, so they could also kind of eat the additional cost to undermine. Something like deep seek, which is a startup right, to lock in the market. But either way, exciting news. And I guess if you're in China, I don't believe you can use Chad GBT. So if nothing else, it's good that there are comparable tools for people to use and not miss out on the fund of these advanced lms.

Yeah, I I I will say I don't know that, that Baidu would be subsidizing at the level of, at least their base model because they are actually more expensive than Deep Seq V three on Ernie 4.5. That's true. Yeah. Where, where do you see that flip is with the, the reasoning models, which itself is, yeah, that's kind of interesting, right?

I mean, to me at least, that seems to imply something about reasoning like engineering for, for the, like computer architecture behind reasoning or more token efficiency and therefore compute efficiency at the, at the reason, I shouldn't say therefore maybe alternatively compute efficiency at the reasoning stage, but, you're right. There's all kinds of things start to muddy the waters when you start thinking about the economics of these things.

as they represent a larger and larger fraction of the corporate bottom line, even for big companies like Baidu, like Google these companies are gonna be forced to show us their hand in a sense, right? They're gonna have to sell these tokens for a profit, and we will eventually learn what their actual margins are. It's debatable whether we're learning that just yet. Yeah, I don't think we are. It's, it's very much unknown. And I haven't seen any kind of strong analysis to explain it.

There's, you know, yeah, it's just mystery what kind of tricks people are pulling. But I would also kind of bet that the margins aren't great. The one thing we do know deep seek claimed at least that they were making a profit and, and had a positive margin on their models. And I could see that not being the case for. You know for instance, OpenAI where their revenue is in the billions, but real question is, are they actually making a profit? last thought on this too.

On the economic side, like when we think about what it means for deep seek to claim that they're generating positive returns I think there's a, an important question here about whether that's operating expenses or CapEx factored in, right? So we saw in their paper that they famously talked about how they, trained V three on $6 million of compute infrastructure. Now, I. Or sorry, on a $6 million compute budget.

That was, it seems in retrospect the actual, the operating expenses of running that compute, not the capital expenses associated with the Tens of millions of dollars as it would've been of compute hardware. So it's always hard to know, like, what do you amortize? How do you factor in what's apples to apples? Yeah. It's hard to say, like deep seek is profitable, but on a per token basis, just for inference, I believe the claim is we're making money, which yeah. In itself is on an opex basis.

Yeah. Interesting. Yeah. Moving right along. Next we have OpenAI and they are releasing some new audio models. So there are new two new speech to text models, GP four oh transcribe and GP four oh Mini Transcribe, which are basically replacing their whisper models. Open has already had this as a service for quite a while.

The exciting new thing here is the text to speech model GP four oh mini dash TTS, which is more along the lines of 11 labs where you can produce very natural human sounding speech. And along with announcement of the models, OpenAI has also launched a new cite OpenAI fm, which is a, a demo site where you can go and mess around and, and kind of hear the outputs and.

This is kind of a fun trend, I gotta say, where these companies increasingly are launching these little fun toys to get a sense for what these models are capable of. one last thing, again, we probably should comment on pricing. The pricing is very competitive. The transcription for GT four oh, it's 0.60 cents per minute 0.60 cents. So like $0.01, I guess. And GT four oh meaning ETTS is 1.50 cents per minute, which is mod lower than a competitor, like 11 labs, for instance.

So yeah, I think it's interesting to see opening, expanding their model suite to these new domains where they're sort of less, focused we've seen them kinda move away from text to image, for instance, Dali hasn't had an update in forever. Yeah. And so I guess this makes a lot of sense that they have very competitive things to offer given their investment in the advanced voice mode in chat GPT. it's sort of reminiscent of the, the problem that meta faces, right?

Where they're, you know, they, they reach like, whatever, 3 billion people around the world. At a certain point, when your market penetration is so deep, one of the only things you can do to keep growing is to grow the market. And so meta invests, for example, in getting more people on the internet in, you know, other countries like in, in countries that don't have internet access typically, or, or have less of it. And so they're literally just trying to like, grow the pool of people.

They can, they can tap for this. In the same way, I think there's a lens on this that's similar, right? So you are only able to interact with chat GBT, through certain modalities or, or with open AI products, through certain mod modalities. And by achieving greater ubiquity, by reaching into your life more and, and making more of the conversational tooling available to you that that really does effectively increase their, their market, right?

Like, you don't have to be in front of a computer necessarily, or in the same way, or engaged in the same way to use the product. And obviously they've had other, other voice products before, but it's sort of part of, if I'm open ai, I'm really thinking about multimodality, but.

From the standpoint of increasing the number of contexts, life context in which I can reach you, and it, you know, text to image still requires you to be in front of a screen, same as, you know, writing text on chat GPT, whereas audio is just this like, you know, greater reach for modality wise. So I think strategically it's an interesting play for them. Ethically, all kinds of, of issues.

I mean, you know, you think about the, the modality of audio as being one that is much more intimate to humans and an easier way to plug into your, inner world. And that's, I think something, you know, when, when you look at like what Recca did to people just through text, right? The suicidal ideation, the actual suicides, the, the rec subreddit when people had their, you know, AI boyfriends or girlfriends taken away from them, you know, that sort of thing.

When you, when you tie an audio, I think it's, it's gonna be a, an interesting PR challenge if nothing else for open ai. There is one figure, by the way, in the article. At least the, the, we're linking to here. And it's just a, a piece of research looking at the, the word error rate comparisons across leading models for different languages as, as part of this kind of tooling. I just, I, I find it really interesting, like Arabic and Hindi there, there's a lot of struggle there.

There, those are some of the, the worst performing languages. Obviously English, one of the better performing ones. I'd love to see an overlay of this relative to the amount of data that was used to train the model so that you can see in relative terms, like which languages are in a sense, like harder for AI to print out to kind of, to speak. I think there, there's something anyway linguistically just fascinating about that. If nothing else. So anyway, overall interesting launch.

And I think we're gonna see more and more of this, right? It's gonna be more expected to have very high quality audio models and linking them specifically to agents, sort of Star Trek computer style. Yeah, I, I guess one thing worth noting on the kind of ethics side is I don't believe they're offering voice cloning technology, which is where you can really get into trouble very easily. So I think open AI is being a little careful these days in general to not court controversy.

Part of why it took them forever to release SOA potentially. And in this API, this demo, they are releasing something like a dozen voices you can use with names like Alloy Ash, echo, fable, Onyx, Nova kind of, I don't know, human, I guess. They're not even trying to make 'em human sounding. And you can also assign them a vibe in this demo, like cowboy auctioneer, all timey, serene with a lot of this kind of steering what you can do as well. So yeah, I think it's, it's pretty exciting.

And as ever with a release of new APIs, this really enables downstream of OpenAI and Visa companies for others to build exciting new applications of ai. And onto a few more quick stories. Next up, also OpenAI. They have released oh one Pro into their developer API. So it's actually limited to developers who spent at least $5 on this, and it costs 150 per million tokens for input and $600 per million tokens generated. So that's very, very high.

Prices, obviously, that's as we've said, GT 4.5 was $75 for 1 million output tokens. And that's yeah, two, two orders of magnitude easily above what you would typically charge. yeah, I'm trying to think if it's two or three orders of magnitude. It might be approaching three orders of magnitude, actually. So, yeah. Interesting strategy here from OpenAI. We haven't seen any other companies.

Release these very expensive products yet and open, I increasingly doing that with Che Pro, their $200 per month subscription with QB 4.5. With this, it makes me wonder if this is an attempt to become more profitable or if this is them sort of testing waters where it could be various readings, I suppose. Yeah. it's also, I mean, it's interesting to note this is not an order of magnitude larger than what GPT three's original pricing was.

I was just looking it up on, in the background here to check. 'cause I, I seem to remember it, it being, you know, on a, back then it was priced per, per a thousand tokens. With reasoning models, you tend to see more per million tokens just because of the number of tokens generated. But sort of reminds me, you know, in the military or the history of the military, there's often this. This restriction where it's like people can only carry, I forget what it is 60 pounds or something, of equipment.

And so over time you tend to see like the amount of equipment that a soldier carries doesn't tend to change or the weight of it, but of course the, the kind of equipment they carry just changes to reflect technology. This sort of seems similar, right? There's like almost a pato frontier of pricing, at least for the, the, the people who are willing to reach for the most intelligent products. you know, you're constantly reaching for it though.

This is a push forward even relative to the g PT three frontier back in the day. So kind of interesting there's all kinds of, feedback people have been getting, there's complaints about, oh, this model struggled with Sudoku puzzles apparently, and optical illusions and things like that. People say, you know, I, at a certain point, anything you launch at a high price point especially if you're opening eye, people will complain that it's not like super intelligence.

And so Yeah, there's also an interesting parallel here where O one Pro, just in terms of benchmarks, and I think in general in terms of the vibe of what people think is, is that it's not sort of significantly better than O one and that parallels GT 4.5. You know, it's better, but it's not sort of a huge leap. So there is an interesting kind of I dunno, demonstration of.

Probably it's harder to get, you know, huge leaps performance and people are gonna be more critical now of if you are not offering something that's like, you know, really leap between GP four 3.5 and four, for instance. Yeah, I mean, I think it's, it's quite use case specific too, right?

So as we've seen, you know, the, the kinds of issues people are running into optical illusions you know, Sudoku puzzles, this sort of thing are pretty far from the standard, you know, the actual workloads that open AI is targeting, right? Their focus is can we build something that helps us automate AI research as quickly as possible? those sorts of benchmarks. Yeah, we are seeing needle moving there.

there's also some interesting stuff that we'll talk about from meter suggesting that in fact, that is what's happening here. That on those particular kinds of tasks we're seeing pretty significant acceleration with scale. But, but you're right, it, right? It's this funny, uneven surface, just like how humans are, are funny and uneven, right? Like, you have a, a really talented artist who can't write a line of code to save their lives, right? And, and vice versa.

So another instance of, the paradox of what's hard for AI isn't necessarily hard for humans. And moving away from OpenAI to Google. We now have another feature another instance of Canvas this time on Gemini. And they're also adding audio overview. So I don't know why they do this. Why Visa lms, just copy each other's names. We had deep research showing up in, in multiple variants. Now we have a canvas, which is also on Chad GPT. And I think on Tropic it's called Artifacts.

Basically the same idea where now as you're working on something like code for instance or like, you know, a web app for instance, you can have a side panel showing this living document rendering of it with a chatbot to the left. So you can essentially interactively work and see a preview of what you're getting. And you also have audio overviews, which is. Pretty much something like Notebook lm you can upload documents and get this podcast style conversation going on.

So nothing sort of conceptually new going on here, but, I think an interesting convergence across the board of all these tools. Everyone has canvas, everyone has deep research. Everyone seems to have kind of the same approach to implementing LLM interfaces. speaking of that, in fact the next story is about philanthropic and them adding web search capabilities to Claude.

So that is now in preview for paid users in the US and that will basically work the same as it does in Chad, BT and other models. You can enable it to work with cloud 3.7 and then it'll be able to provide direct citations from web sourced information. So yeah, it's not much else to say. We are getting web search for cloud, which will enable it to be more useful.

it's interesting 'cause it's like the tee up to this is philanthropic being a little bit more shy than other companies to roll up the, the web search product into their. Into their agents and, and I mean, this is consistent with the threat models that they take seriously, right? Things like loss of control, right? Which typically involve, you know, an AI model going out onto the internet, maybe replicating its weight somehow. And, and internet access is kind of central to a lot of these things.

I don't know if that was part of this, it at least is consistent with it. So the result is that there may be a little bit later at the party than others. Apparently according to these initial tests, you don't always see web search used for current events related questions. but when that happens, you do get these nice inline citations pulled from sources. It does look at social media and, and then of course news sources like NPR, like Reuters, they, they cite in the, in the examples they show.

So, you know, pretty, pretty standard product and the inline citation approach that you see with deep research, for example. Certainly making an appearance here. And last up again along the lines of these stories, we have XAI, launching a new API, this one for generating images. So we have a new model called Rock two Image 1212, and you can now query it. For now, it's quite limited. You can only generate 10 images per request and you are limited to five requests per second.

The cost there is seven cents per image, which is slightly above what for instance, black Forest Labs charges. They are the developers of Flux and competitive with another offering from idea. So I think, yeah, interesting to see X AI expanding there. APIs once again, they released their own image generation back in December.

And it's kind of looked competitive with something like, Google's latest generation of where we focus has really shifted towards careful instruction following in your image generation. So yeah. XAI is is as ever trying to catch up or moving quite rapidly to expand their offerings. Yeah, they really are.

And you, I think when we first covered Black Forest Labs partnership with XAI, one of the first things that we said was like, Hey, this is, you know, because I think they raised a big round right on, on the back of the incredible. Distribution that they were going to get through XI and, and the kind of vote of confidence that reflected from Elon.

But at the time we were talking about, Hey, you know, this is a pretty strategically dicey position for Black Forest Labs because the one thing we've consistently seen from, from all the AI companies is I. Once they, you know, start getting you in for chat, eventually they start rolling out multimodal features. And it's not clear that those aren't best built in-house for any number of reasons.

Not just including the fact that you wanna kind of internalize all the revenues you can from the whole from the whole stack. But also once you have a good reasoning model or rather a a, a good foundation model that foundation model can be mined for multimodality post hoc, and you just kind of get to amortize your investment across more modalities.

And so it's just this natural move to kind of keep crawling into or creeping into adjacent markets like image generation, video generation, which is also something that XAI is looking at. So, yeah, I mean, kind of interesting. For Black Forest Labs, this probably is gonna be a big challenge for them. I don't know how extensive their partnership continues to be at this point, but it's a, it's a dicey time to be, to be one of these companies. And onto applications and business.

We begin with some announcements from nvidia. There's a preview of their plans in 2026 and 2027. They have the Reuben family of GPUs coming in 2026 and then Reuben Ultra in 2027. So that will, also come along with a new I guess server layout with ability to combine 576 GPUs per rack.

Which, you know, I guess it's, it's very much following in a tracks of very, very like crazy enhancement to computing that Nvidia has been able to continue creating with you know, B 200, I believe it's now, and, and now this is their plans for the next couple years. I. Yeah, there's, there's a, a lot going on with this update.

It's, it's actually pretty, pretty interesting and quite significant, especially on the, the data center side in terms of the infrastructure that'll be required to accommodate these new chips. couple things here, right? So there is this configuration of the Blackwell called the NVL 72 is the sort of name of this configuration. This is where you have so, okay, imagine a tray that you're gonna slot into rack, a server rack, right? So on that tray, you're gonna have four GPUs.

Alright, so each tray contains four GPUs, and in total, in that whole rack, you're gonna have 72 I'm sorry, you're actually gonna, you're gonna have 144 GPUs total, but because two of those GPUs show up on the same motherboard, God, so each frigging, each frigging tray that you slot into the rack has two motherboards on it. Each of those motherboards has two GPUs, two B, 200 GPUs. So in total you're putting in four GPUs per tray.

But they're kind of divided into two motherboards each with two GPUs. Anyway, this led to the, the thing being called the NVL 72, when in reality there's 144 GPUs on there. At least Jensen Huang says it would've been more appropriate to call it the NVL. 1 44 L. Okay. What's actually interesting in this setup, they're calling the Reuben NVL 1 44 Rack. There's not more GPUs there.

It's not that there's twice as many GPUs as the nv L 72 with the Blackwells, it's just that they're counting them differently now. So they're saying, actually, we're gonna count all the GPUs. So if I think back in the day we did talk about the NVL 72 setup. This is basically just the same number of GPUs. Nothing has changed even though the number has changed. If that didn't make any sense, just delete it from your mind. Let's focus on the things that are actually interesting.

The story is, it's, it's comparable and the number of GPUs to the current set of top line GPUs. So they're kind of pitching it as you can slot it into your existing infrastructure more or less, and just to. Jump into numbers a little bit, you're getting roughly three times. The inference and training performance in terms of just raw compute memory is faster. by close to two-ish or, multiplier of two. Kind of like, yeah, you're seeing multipliers on top of the current one.

So quite significant change in performance if you do upgrade. So, so when it comes to Ruben, right, which is the, the sort of next generation coming online at FP four you're seeing, yeah, three x more flops, right? Three times more more logic capacity. Now the on the memory side, things actually do get somewhat interesting. The memory capacity is going to be 288 gigabytes per GPU, right? That is the same as the B 300. So no actual change in terms of the. Like per GPU memory capacity.

We will get back to why that matters a bit less in a, in a second. But, but that's kind of part of the idea. The memory bandwidth is improving. it's almost doubling or maybe, yeah, it's short of doubling. So the memory bandwidth is really, really key, especially when you look at inference. So that's one of the reasons why this is really being focused on.

But there's also a, a bunch of things like the, so the cables that connect GPUs together on roughly speaking on one rack, if you wanna imagine it that way. Those are called NV link cables. Super, super high bandwidth. Those are doubling in throughput. so that's, you know, really big advance. There's also stuff happening on the, the networking side, but we don't need to touch that. Bottom line is.

Env link cables used to be the way you connected GPUs across different trays in the same rack and maybe, maybe adjacent racks depending on the configuration. But it's very local, very, very tight, very high bandwidth communication. What's happening here is each of these motherboards that you know, that you're slotting into your, your rack. They have A CPU and two GPUs, and we talked about this in the hardware episode. You know, as to why that is. The CPU is like the orchestra conductor.

The GPUs are like the instruments that you know, that are actually doing the hard work and the heavy lifting. Typically the CPU would be connected to the GPUs through A-A-P-C-I-E connection. So this is a relatively, relatively low bandwidth compared to NVLink. Now they're moving over to NVLink as well for the CPU two GPU connection, that's actually a really re really big deal. It comes with a core to core interface. So right now. The GPUs and CPUs are going to share a common memory space.

So essentially directly accessing each other's memory, whatever's in memory on the CPU, the GPU can access right away and vice versa. That's a really, really big change. It used to not be, the case used to have independent CPU and GPU memory. The GPUs themselves would share a common memory space if they were connected via NV link. And in fact, that's kind of, that's part of the idea here. That's what makes them a coherent wad of compute.

And it's also part of the reason why the memory capacity on those GPUs matters a bit, a bit less. 'cause you, you're, you're adding, you're kind of combining all your GPUs together and they have a shared memory space. So if you can just add to the number of GPUs you, you have, you're effectively adding to your memory capacity. So that's kind of a, an important difference there. So anyway last thing I'll mention. They say that apparently Reen Ultra is gonna come out.

This is, so there's gonna be Reen and then Reen Ultra. Reuben Ultra is coming out the second half of 2027. It'll come with a Reen, GPU and a Vera CPU, like Nvidia tends to do, right? They name the first name the CPU, so it's Vera Reen. And, and so Vera is the CPU Ruben is the GPU. Apparently the full rack is gonna be replaced by this 576 GPU setup, A massive number.

That is essentially so they don't specify the power consumption, but it's clear from other kind of industry products that are coming out. We're tracking for one megawatt per rack, and I, I just worth emphasizing that's a thousand kilowatts. That is a thousand homes worth of power going to a single rack. In a server in, in a data center. That's insane. Right? So the, the power density required for this is going through the roof, the cooling requirements, all this stuff.

It, it's, it's all really cool. And, and anyway, this is a very, very big motion. just to dive a little bit into the numbers, just to fun, right? So the compute numbers are in terms of flops, which is floating point operations per second. Basically multiplication per second or additions per second. And the numbers we get with these announced upcoming things like rein are now for inference 3.6 exo flops. Of inference, SOA is, is quintillion.

It's 10 to the 18 quintillion is the one after quintillion. So I can't even imagine how many zero. I mean, I guess I know how many zeros there is, but it's, it's very hard to imagine a number of that long. And that's just where we are at. Also worth mentioning so this is the plans for 20 26, 20 27. They also did announce for later this year the coming of B 300, which is also, you know, OB is an improvement in performance of about 1.5. They also did announce the.

Ultra variants of Blackwell, both 200 and and 300. And the emphasis we are starting to add, I think is more on the inference side. They definitely are saying that these are good models for the age of reasoning. So we're capable of outputting things fast in addition to training. Well, and that's very important for reasoning of course, because the whole idea is you're using up more tokens to get better performance. So they're giving some numbers.

Like for instance, Blackwell Ultra will be able to deliver up to 1000 tokens per second on deep Seeq, R one. And that's you know, comparable. Usually you would be seeing something like 100, 200 tokens per second. A a thousand tokens per second is very fast. And then the, the inference focus is reflected too in the fact that they're, they're looking at, you know, fp four flops denominated performance, right?

So when you go to inference, often you're, you're inferencing quantized models, inferencing at FP four. So lower resolution and also this, the, the memory bandwidth side becomes really important for inference disproportionately relative to training, at least on the current paradigm. So that's kind of, you know, part of the reason that you're seeing those big big lifts at that, that end of things is because of the inference. And next story is also about some absurd sounding numbers with hardware.

This one is from Apple. They have launched a Mac studio offering, and the top line configuration where you can use the M free Ultra. Chip with 32 CPUs 80 core GPU that can even run the Deep Seq R one model. That's the 671 billion parameter AI model. Fewer at inference. You're using about 37 billion per output, I believe. But still this is, you know, hundreds of gigabytes of memory necessary to be able to run it and just fit it in there.

Yeah. Apple's also doing this weird thing where they're not designing like GPUs for their data centers, including for AI workloads. They seem to be basically like doing souped up CPUs kind of like this with just like gargantuan amounts of VAM. that again have a, this very large kinda shared pool of memory. Right. We talked about like coherent memory on the Blackwell side right. And on the Ruben side, just the idea that if you have a shared memory space, you can pool these things together.

Well, they're not as good at the shared memory space between CPUs. What they do is they have disgusting amounts of RAM one GPU. Right. So like 512 gigs is. It is just wild. Like it's, anyway, for, for a, for a CPU at least. It, it's, and, and we are talking here about when you say memory, we mean really something like ram, right? Yeah. And, and so if you have a laptop, right?

If you buy a Mac for instance, typically you're getting eight gigabytes, maybe 16 gigabytes of ram, the fast type of memory. Read. What, what is it? Read something memory. Random access. Memory. Yeah, random access memory, right? As opposed to the slower memory of let's say an SSD or things where you can easily get terabytes, to get that crazy amount of random access memory is insane when you consider that. Typically it's like eight 16 gigabytes and you know, yeah, this is expensive memory.

It's stupid exp expense. It's also like yeah, there's different kinds of ram and we, we talked about that in our hardware episode. This is a, a combined C-P-U-G-P-U setup, by the way. So 32 core CPU 80 core GPU but shared memory across the board So VAM is like really close to the logic, right? So this is like the most, as you said, exquisitely expensive kind of memory you can put on these things. they're opting to go in this direction for, very interesting reasons, I guess.

I mean, it does mean that they're disadvantaged in terms of being able to scale their data center you know, infrastructure, their, their builds because of networking at least as far as I can tell. it's a very interesting standalone, standalone machine. I mean, this is pretty wild specs. Right. Yeah, exactly. If, if you go to the top line offerings and, and this is, you know, a physical product you can buy as a, yeah, it's a Mac, right? Yeah, it's a Mac. Yeah, it's a Mac.

It's like a big kind of cube ish thing. And if you go to the top line configuration, it's something like $10,000. Don't quote me on that, but it's, you know, crazy expensive as well. It does come with other options. For whatever reason, M four max CPU and, and GPU is less powerful than M three Ultra. But anyway, very kind of beefy offering. Now from Apple. Next we have something a bit more forward looking. Intel is apparently reaching an exciting milestone for the 18 a 1.8 nanometer class.

Wafers with a first run at the Arizona Fab. So this is apparently ahead of schedule. They have these Arizona fabs fab 52 and Fab 62. Fab, as we've covered before, is where you try to make your chips and 1.8 nanometer is the next kind of frontier in terms of scaling down resolution of the density of logic you can get on a chip. So the fact that they're running these test wafers, they're ensuring that you can transfer the fabrication process to these new Arizona facilities.

I guess the big deal there is partially that these are located within the us, within Arizona, and they are seemingly getting some success and, and are ahead of schedule, as you said. And that's impressive because fabs are and absurdly. Complex engineering project. Yeah. In Intel is in just this incredibly fragile space right now. As has been widely reported, and we've talked about that a fair bit. I mean, they need to blow it outta the water with, with 18 a and their future notes.

I mean, this is like a make or break stuff. So forward progress yeah, they had their, their test facility in, in Hillsborough, Oregon, who that was doing 18 a production as you said, on a test basis. And they're now successfully getting the first test wafers in their new Arizona FAB out. So that's great. But yeah, it'll eventually have to be, start running actual chips for commercial products.

the big kind of distinction here is they're actually manufacturing with 18 a these gate all around transistors. think we talked about this in the hardware episode. We won't go into too much detail. Basically, this is a specific geometry of transistor that allows you to have better control over the flow of electrons through your transistor, essentially. It's a, a big, big challenge people have had in making transistor smaller and smaller. You get all kinds of current leakage.

the current by the way is sort of like the, the thing that carries information in your computer. And so you wanna make sure that you don't have current leakage to kind of have one's become zeros or let's say operation like, you know, a certain kind of gate turn into a, the wrong kind of gait. that's the idea here. So it's, it's the skate all round transistor based on a ribbon fit design.

And yeah, so, so we're seeing that come to Market Gate all round is, is something TSMC is moving towards as well. And, you know, it's just gonna be the, the next, essentially the next beat of production. So here we have 18, a kind of early signs of progress. And now moving away from hardware more to businessy stuff. Xai has acquired a generative AI startup. They acquired Hotshot, which is focused on text to video, similar to soa.

They also have AI powered or initially they worked on ai, powered for the tools and then pivoted. So I suppose unsurprising in a way that they are working on text to video as well. They just want to have all the capabilities at Xai, and this presumably will make that easier to do. Yeah, and one of the founders had some quote, I think it might've been on X, I'm not sure, but he said, we're excited to continue scaling these efforts on the largest cluster in the world, Colossus as part of XAI.

So it seems like they'll be given access to Colossus as part of this. Maybe not shocking, but kind of an interesting subnote. They were backed by some really impressive VCs as well. So Alexis Hanian as sort of like famous for like being the co-founder, Reddit, of course, and, and doing his own VC stuff, and SV Angel too.

So pretty interesting acquisition and, and a nice soft landing too for folks in a space that otherwise, you know, I mean, they're either gonna acquire you or they're gonna eat your lunch. So I think that's probably the best outcome for people working on the, the different modalities, at least, at least on my view of the market. Yeah, and, and I guess acquisition makes sense.

The startup has been around for over two years and they have already trained multiple video models hotshot, Excel, and hotshot, and they do produce quite good looking videos, so makes some sense for opening AI to acquire them if only for the kind of brain power and expertise in, in that space. Man, they're old. They've been around for like two years, right? Like yeah. That was what, pre SOA or like around the time SOA came out? Yeah, yeah, yeah.

It's funny, it's just funny how the AI business cycle is so short. Like, like these guys have been around for all the 24 months. Very experts, very, you know, that's veterans. And the last story, Tencent is reportedly making massive Nvidia H 20 chip purchases.

So they are supposedly meant to support the integration of deep seek into WeChat, which kind of reminds me of Meta, where Meta has this somewhat interesting drive to let you use llama everywhere and Instagram and and all their messaging tools. So this would seem to be in a way similar where Tencent would allow you to use deep seek within WeChat.

Yeah. part of what's going on here too is the standard stockpiling that you see China do and Chinese companies do ahead of an anticipated crackdown from the United States on export controls. And in this case, the age 20 has been kind of identified as one of those chips that's likely to be shut down for the Chinese market in the relatively near term. So it makes all the sense in the world that they would be stockpiling for that purpose.

but it is also the case that you've got R one that has increased dramatically. the demand for for access to hardware. It, it's, it's sort of funny how, how quickly we pivoted from oh no, R one came out and, and so Nvidia stock you know, crashes to, oh, actually R one is great news for Nvidia anyway, I, I think it's, it's the turnaround that we sort of expected. we talked about this earlier and there has been apparently a short-term supply shortage in China regarding these H 20 chips.

So like there's so much demand coming in from Tencent that it's sort of like rate limiting for Nvidia to get H twenties into the market there. So kind of interesting they've previously placed orders on the order of, you know, hundreds of thousands between them and bike dance. back, I think last year was almost a quarter million of these, these GPUs. So yeah, pretty big customers. And onto projects and open source.

We begin with a story from the information titled on Ros, not so Secret Weapon that's giving agents a boost. I would say kind of a weird spin on this whole story, but anyway, that's the one we link it to and it covers the notion of MCP model context protocol, which philanthropic released all the way back in November. We hopefully covered it. I guess we, we dunno. I think I actually did. I was trying to remember. Yeah, yeah, yeah. I think we did.

And the reason we're covering it now is that it sort of blew up over the last couple weeks if you're in the AI developer space or you see people hacking on AI that it has been the talk of a town, so to speak. So model, context, protocol, broadly speaking is something like an API like a standardized way to build, ports or mechanisms for AI agents or ai, I guess, models to call on services. So it standardizes the way you can provide things like tools. So there's already many, many integrations.

following the standard for things like Slack, perplexity notion, et cetera, where if you adopt a protocol and you provide an MCP compatible kind of opening you can then have an MCP client, which is your AI model call upon this service. And it's, it's very much like an API for a website where you can, you know, have a particular URL to go to particular kind of parameters, and you get something back in some format.

Here, the difference is that, of course, this is more specialized for AI models in particular, so it provides tools, it provides like a prompt to explain the situation, things like that. Personally, I'm. In the camp of people who are a bit confused and kind of think that this is an API for an API kind of situation, but either way, it has gotten very popular. That that is exactly what it is. Yeah. It's an API for an API.

It's also, i, I guess a transition point or, or could be viewed that way, you know, in the sense that eventually you would expect models to just kinda like, figure it out you know, and, and have enough context and ability to uncover. Whatever information is already on the website to be able to use tools appropriately. But there are edge cases where you expect this to be worthwhile still.

This is gonna reduce things like hallucination of, of tools and, and all kinds of issues that when you talk about agents, right, like one failure anywhere in a reasoning chain or in an execution chain can, can cause you to fumble. And so, you know, this is structurally a way to address that and, and quite important in that sense.

It is also distinct from a lot of the tooling that opening eyes come out with, but that sounds similar, like the agent's API where they're focused more on chaining tool uses together, whereas MCP as you said, is more about helping make sure that each individual instance of tool use goes well, right? That the, the agent has what it needs to kind of. Ping the tool properly and interact with it and, and find the right tool rather than necessarily chaining them together. So there you go.

you know, MCP is, is a nice kind of clean, open source play for Anthropic too. They are going after that kind of more startup founder and, and business ecosystem. so pretty important from a, a marketing standpoint for them too. Right. Yeah, exactly. So back in November they announced us, they introduced us as an open standard and they also released open source repositories with, some example of model context product called servers. And as well they specification and like a development toolkit.

So I honestly haven't been able to track exactly how this blew up. I believe there was some sort of tutorial given at some sort of convention, like the AI engineer convention or something, and then in kind of took off. And everyone is very excited about the idea of model context protocols right now. Moving on to new models. We have Mistral dropping a new open source model that is comparable to GP four oh mini and is smaller.

So they have Mistral small 3.1, which is seemingly better than similar models, but only has 24 billion parameters. Also can take on more input tokens, 128,000 tokens and is fairly speedy at 150 tokens per second. And this is being released under VE Apache two license, meaning that you can use it for whatever you want business implications, et cetera.

I don't think there's too much to say here other than like, like a kind of a nitpick here, but they say it outperforms comparable models like Gemini three, GT four oh mini while delivering inference speeds, as you said, of 150 tokens per second. But like, you can't just say that shit. Like it doesn't mean anything to say. Yeah. It depends on what infrastructure you're using. Yeah. What's the stack dude like?

Yeah. Yeah. You know, like I can move at a hundred, like at, at a hundred miles an hour if I'm in a Tesla. that's what makes me, anyway. they do give that information, but it's like buried like in the literal like grain. Yeah. This is from where blog post where we, we get these numbers, so I guess we. as with any of these model announcements, you go to a company blog, you get a bunch of numbers on benchmarks showing that it's the best.

You have comparisons to Gemma free from Google to cohere ia, GP four oh, mini Cloud 3.5 Haiku, and on all of these things like MMLU, human Eval math it typically is better. Although, you know, I would say it doesn't seem to be that much better than Gemma at least, and in many cases is not better than 3.5 Haiku and GP four oh mini, but yeah, and it still quite good. one 50 tokens per second. Two for context is a batch size 16 on four H one hundreds.

They actually, like, even in the technical post, they write while delivering inference speeds of 150 tokens per second without further qualifying. But it's in, like, it's in this like small gray text underneath an image that you have to look for where you actually find that context. So don't expect this to run at 150 tokens per second on your, your laptop, right? That's, that's just not gonna happen. cause you know, four H 100 is, is like, that's quite a lot of horsepower.

Still yeah, it's an incremental improvement. More, more open source coming from, from mistrial and Apache 2.0 license. So, you know, highly permissive. And one more model. We have Exon Deep Reasoning enhanced language models coming from LG AI research. So these are new models new family of models, 2.4 billion, 7.8 billion and 32 billion parameters.

These are optimized for reasoning tasks and seemingly are you know, on par or out outperforming variations of R one where R one is the giant one at 671 billion. There is distilled versions of those models at comparable sizes. And in the short technical report that they provide, they are showing that it seems to be kind of along the lines of what you can get with those distilled R one models and, similar or better than open ai oh, one mini.

Yeah, it, it's also, it's kind of interesting, it, it, it seems to as described, again, not, not a lot of detail in the, in the paper, so it makes it hard to reconstruct, but it does seem to be at odds with some of the things that, learn in the deep CR one paper, for example. So there's they start with, it seems an instruction tuned model. Base model, the XO one, 3.5 instruct models, and then they add onto that. A bunch of fine tuning. they do supervised, fine tuning. Presumably.

This is for like the reasoning structure and then DPO, so standards or like RL stuff and online rl. So, you know, this is you know, quite a bit of supervised fine tuning, of trying to teach the model how to solve problems in the way you want it to solve them, rather than just like giving it a reinforcement learning signal and reward signal and, and like kind of have at it like R one zero did. So yeah, kind of an interesting alternative.

More, let's say, more inductive prior laden approach and first time as well that I think we've covered anything from LG ai. I think so, yeah, Exxon these models appear to have already. Been existing and being released. Exxon 3.5. We're back from December, which we somehow missed at the time. Yeah, a fun fact. Exxon stands for Expert AI for everyone. You gotta love when people come up with these acronyms in a very creative way. And they are open sourcing it on hugging face with some restrictions.

This is primarily for research usage. Onto research and advancements. We begin with a paper sample, scrutinize and scale effective inference time search by scaling verification. It is coming from Google and let me just double check. Also you see Berkeley and it is quite exciting I think as a paper.

It basically is making the case or presenting the idea that to do scaling inference, time, scaling with the idea of inference, time, scaling being basically, you know, once you already train your model, once you stop updating your weights, can you use your model more to get smarter if you just kind of do more outputs in some way. And what he's seen in recent months is interest time scaling via reasoning.

Where you have a model, you know, output a long chain of tokens where it does various kinds of strategies to, do better at a given complicated task like planning, substeps, like verification, backtracking, these things we've already covered.

Well this paper is saying instead of that sort of scaling the output in terms of a chain, an extended chain of tokens, you can instead sample many potential outputs like just do a bunch of outputs from scratch and kind of varied up so you get ma many possible solutions. And then if you have a verifier and you can kind of compare and, and combine these different outputs.

You can be as effective or even in some cases more effective than the kind of traditional reasoning, traditional inference, time scaling paradigm. So again, yeah, quite interesting in the paper they have a table one where they're giving this example of if you sample a bunch of outcomes and you have a verifier that is good, you can actually outperform oh one preview and many other techniques.

You can get better, numbers on the hard reasoning benchmarks like Amy, where you can solve eight out of the 15 problems now, which is insane. Yeah. But that's where we are. And on math and live bench math and live bench reasoning. So yeah, very interesting idea to sample a bunch of solutions and then just compare them and kind of combine them into one final output. I think this is one of those key papers that again, I mean we see this happen over and over.

Scaling is often a bit more complex than people assume, right? So at first, famously we had pre-training, scaling, scaling, pre-training, compute that brought us from, you know, GT two, GP three to g PT four. Now we're in the inference time compute paradigm where, you know, this analogy that I like to use, like you have 30 hours to dedicate to doing well on a test. You get to choose how much of that time you dedicate to studying, how much you dedicate to actually spending writing the test.

And what we've been learning is, you know, scaling, pre-training, compute is basically all study time. And so if you just do that and you effectively give yourself like one second to write the test, well there's only so well you can do, right? You eventually, you start, start to saturate. If you just keep growing pre-training and you don't grow inference, time compute, you don't invest more time at test time. So essentially you have two dimensional scaling.

If you wanna get the sort of indefinite returns, if you want the, the curves to just keep going up and not saturate, you have to scale two things at the same time. This is a case like that, right? This is a case of a scaling law that would be hidden to a naive observer just looking at this as a kind of a one variable problem. When in reality it's a multivariate problem. And suddenly when you account for that, you go, oh, wow, there's a pretty robust scaling trend here.

And so what are these two variables? Well, the first is scaling the number of sampled responses. So just the number of shots on goal, the number of attempts that your model's going to make at solving a given problem. But you have to improve verification capabilities at the same time, right? So they're asking the question, what test time scaling trends come up as you scale both the number of sampled responses and your verification capabilities?

One of the things crucially, that they find though, is that you might naively think if you have a, a verifier, right? And, and I wanna emphasize this is not a verifier that has access to ground truth, right? This is like, and I think, I think they use like Claude 3.7 for this 3.7 sonnet for this. But basically this is a model that's gonna look at. Say the 20 different possible samples, in other words that you get from your model to try to solve a problem.

And this, verifier model is going to contrast them and just determine based on what it knows, not based on access to actual ground truth or any kind of symbolic system like a calculator. It's just going to use its own knowledge in the moment to determine which of those 20 different possible answers is the right one. And what you find is as you scale the number of possible answers that you have your verifier look at, you might naively think, well, with just so much Gump in the system.

Probably the verifiers performance is gonna start to drop over time, right? It's just like, it's so much harder for it to kind of you know, remember which of these was that good and, and which was bad and, and all that stuff. And so eventually you would expect the, the kind of trend to, to be that, you know, your performance would saturate, maybe even drop. But what they find is the opposite. The performance actually keeps improving and improving.

And the reason for that seems to be that as you increase the number of samples, the number of attempts to solve the problem, the probability that you get a truly exquisite answer that is so much better than the others, like that contrasts so much with the median other answer that it's really easy to pick out increases, and that actually makes the verifiers job easier. And so they refer to this as an instance of what they call. Implicit, I think it was implicit scaling was the term Yeah.

Implicit scaling. Right. So essentially the yeah, this idea that, you're more likely to get one like exquisite outlier that favorably contrast with the crappy median samples. And so in this sense, I mean, I, I feel like the, the term verifier is maybe not the best one. Really what they have here is a contracter. When I hear verifier, I tend to think of ground truth.

I tend to think of something that is actually kind of like, you know, giving us, you know, checking, let's say code and seeing if it compiles properly. Yeah. It's, it's mark into stuff like a selector or compiler you could say, where it takes a bunch of possible outputs. And then from all of that picks out the best kind of guess to answer. Exactly.

Yeah. And this is why, I don't know if it's, it's not a term that, that people use, but if it was like contrast or is really what you're doing here, right? You're like, you're sort of doing in a way, a kind of contrast of learning. Well, not learning necessarily, it's all happening at inference time, but yeah, so the, the performance is really impressive. there are implications of this for the, the design of these systems as well.

So they're, you know, trying to find ways to wrangle problems into a shape where you can take advantage of implicit scaling, where you can have your model pump out a bunch of responses in the hopes that you're gonna get, you know, one crazy outlier that makes it easier for the verifier to do its job. So yeah, again, you know, I, I think.

Really interesting case of, of multidimensional scaling laws essentially that are otherwise, you know, like easily missed if you don't invest in, in both verified performance and sampling at the same time. Exactly. And, and this is, I think, important context to provide the idea of sampling many answers and just kind of picking out the answer that occurred the most times in all these samples is. A well known idea. There's also, I mean a selfs consistency.

Selfs consistency, exactly like a majority vote essentially is, is one well established technique to get better, better performance. And there's, you know, even in more complex things you could do also with like mixture of, I forget the term that exists, but the general idea of paralleled generation of outputs and, and generating multiple outputs, potentially multiple models is well known.

And, and real insight, as you said here, is that you need a strong verifier to be able to really leverage it. So in the stable one, they show that, if you compare, like you do get better performance quite a bit, if you just do consistency, if you just sample 200 responses and pick out the majority compared to the thing where you don't do any scaling, you are getting four problems out of 15 on Amy as opposed to one, you're getting, you know, a significant jump in performance.

But after 200, if you go to 1000, you basically stop getting better. But if you have a strong verifier, basically a strong selector among the things, instead of majority voting, you have some sort of intelligent way to combine the answers, you get a, a huge jump, a huge difference between just consistency at 200 and verification at 200.

And one reason this is very important is first way also highlight the fact that verification is a bit understudied, think in the space of LLMs, LLMs are not usually out the box very good at it, and they even introduce a benchmark specifically for a verification. The other reason this is very notable is that if you're doing sampling based techniques, you can paralyze the sampling.

And that is different from extending your reasoning or your search because, reasoning via more tokens is sequential, right? You can't paralyze that. It's gonna take more time. Versus if you're scaling via sampling, you can paralyze all the samples and then just combine them, which means that you can get very strong reasoning at comparable timescales to if you just take one output, for instance. So that's a very big deal. Yeah. another, key reason why this works too.

We covered a paper, I think it was months ago that was pointing out that if you look at all the alignment techniques that people use to get more value out of their, their models, right? They, they kind of assume this, one query one output picture, like let's align within that context. Whereas in reality, what you're often doing, especially with agents and infant inference time compute, is you actually are sampling like a large number of outputs and you don't care about how shitty the average.

Generated sample is, generated solution is what you care about is, is there one exquisite one in this batch? And what they did in that alignment paper is they found a way to kind of like up weight, just the most successful outcomes. and use that for a reward signal or some kind of gradient update signal. But this is sort of philosophically aligned with that, right?

It's, it's saying you know, like you said, self consistency is the view that says, well, let's just do wisdom in numbers and say we generated, you know, I don't know, like a hundred of these outputs. And the most common response was, most consistent answer was this one. So let's, call this the one that we're, we're gonna cite as our output. But of course, if your model has a consistent failure mode, and that can also cause you to true self consistency.

Identify that, you know, kind of settle on that failure mode. Whereas what you really care about is what is the best answer in this, in this big pot. And that's really what this is. This is after. So a lot of interesting ties to other lines of research as you said, and, and I think a, a really interesting and important paper. And next up we have the paper block diffusion interpolating between auto aggressive and diffusion language models. And also I think quite an interesting one.

So typically when you are using an LLM, you're using auto aggression, which is just saying that you're computing one talk at a time, right? You, you start with one word, then you select the next word, then you select the next word. You have this iterative process, and that's a limitation of traditional LMS because. You need to sequentially do this one step at a time. You can't generate an entire output sequence all at once.

As opposed to a diffusion which is a generation mechanism a way to degeneration that is kind of giving you an entire answer at once. And diffusion is the thing that's typically used for image generation, where you start with just a, a noisy image, a bunch of noise. You iteratively update the entire image all at once until you get to a good solution. And we, we covered, I believe maybe a week ago or two weeks ago, a story of a diffusion based LLM that was. Seemingly performing pretty well.

There was a company that made the claim, although we didn't provide too much research on it, well, this paper is talking about, well, how can we combine the strengths of both approaches? The weakness of diffusion is typically just doesn't work as well for LLMs. And there's kind of various hypotheses is an interesting question of why it doesn't work, but it doesn't work. It also doesn't work for arbitrary length.

You can only generate a specific kind of horizon and some other technical limitations. So the basic proposal in the paper is, well, you can have this idea of block diffusion where you still sample other aggressively like, sample one step at a time. But instead of sampling just one word or one token, you use diffusion to generate a Chunk of stuff.

So you generate several tokens all at once of diffusion in parallel, and then you add aggressively, keep doing that, and you get the, you know, the best of both worlds, so to speak. So an interesting idea kind of architecture I haven't seen before and potentially could lead to stronger or, or faster models. Yeah, it, it's also more paralyzable, right? So the big advantage because these, within these blocks, you're able to just like de-noise in, you know, one shot with. All the text in there.

it means you can paralyze more. They, they try, I think with blocks of various sizes Inc. Like size of four tokens, for example, right? So like, let's de-noise four tokens at a time. They do it with this interesting kind of like, well, it's essentially they put masks on and, and kind of gradually remove the masks on the tokens as they de-noise. That's their interpretation of how Denoising would work. it is interesting, the performance is.

Lower obviously, than state-of-the-art for auto aggressive models. Think of this more as a proof of principle that there are favorable favorable scaling characteristics. There's some promise here. In my mind, this fits into the kind of some of the mamba stuff where, you know, the next logical question is, okay, like how much scale will this work at?

And then do we see those loss curves eventually with some updates, some, some hardware jiggery, pokery converging and then crossing the loss curves that we see for traditional auto aggressive modeling. I mean, it's, it is interesting either way. They found a great way to kind of break up this problem.

And one reason speculatively that diffusion does not work as well with text is partly that like you tend to think as you write, and so your previous words, you know, really Will affect in a, in a causal way where you're going and, and trying to do diffusion at the, you know, in parallel across a body of text like that from an inductive prior standpoint doesn't quite match that intuition. But, could very well be wrong, sort of like top of mind thought. But anyway, it's, it's a good paper.

the question is always will it scale? Right? And there are lots of good proofs of principle out there for all kinds of things. Whether they end up getting reflected in, in scaled training runs is the big question. Exactly. And this does require quite different training and you know, models from what all these other LMS are so decent, chances won't have a huge impact just because you have to change up. And all these trained models already are LLMs. In the auto aggressive sense.

Diffusion is, is a whole new kind of. Piece of a puzzle that isn't typically worked on, but nevertheless interesting, as you said, similar to Mamba. Next, we have communication efficient language model training scales reliably and robustly scaling laws for the local. distributed low computation training. And I think I'll just let you take this one, Jeremy, since I'm sure you dove deep into this. Oh yeah.

I mean, so I, I just think de Loco, if you're interested in anything from like national security to the future of data center design is and US China competition generally like, is just so important and, and this general thread, right? Like together AI distributed training, all that stuff. So as a reminder, when you're thinking about De Loco, like how does it work?

So this basically is the answer to a problem, which is that traditional training runs in like data parallel training, like in data centers today happens. In this way where you have a bunch of number crunching and the, bunch of communication, like a burst of communication that has to happen at every time step. You like share gradients, you update like across all your GPUs, you update your model weights, and then you go onto the next the next mini batch, right?

Or the, the next, part of your data set. And you repeat, right? Run your computations, calculate the gradients, update the model weights, and so on. And there's this bottleneck, communication bottleneck that comes up when you do this at scale where you're like just waiting for the communication to happen. And so the question then is gonna be, okay, well what if we can set up sort of smaller pockets? 'cause you're, you're basically waiting for the stragglers, right?

The, the slowest GPUs are gonna dictate when you finally sync up everything and you can move on to the next stage of training. So what if we could set up a situation where we have a. A really small pocket of like a mini data center in one corner working on its own independent copy of the model that's being trained, and then another mini data center doing the same and another and another.

And very rarely we have an outer loop where we do a general upgrade of the whole thing so that we're not constrained by the, the slowest kinda lowest common denominator straggler in that group. So this is gonna be the, the kind of philosophy behind De Loco. You have an outer loop that essentially is going to. In a more, think of it as like this, like wise and slow, slow loop that updates based on what it's learned from all the local data centers that are running their own training runs.

And then within each local data center, you have this much more radical, aggressive loop, more akin to what we see in traditional data parallel training in, in data centers, which, you know, they're running anyway. We, we have a whole episode on, on De Loco. Check it out. I think we talk about the Atom Optimizer or Atom W Optimizer that, that runs at the, the local level and then the sort of like more gradient descent, nestra of momentum optimizer on, on the outer loop.

The, details there don't matter too, too much. This is a scaling law paper. Basically what they're, what they're trying to figure out is How can we study the scaling laws that predict the performance, let's say, of these models based on how many model copies, how many mini data centers we have running at the same time, and the size of the models that we're training. And they test at meaningful scales. They go all the way up to 10 billion parameters.

And, essentially they were able through their, their scheme, through their hyper parameter optimization to reduce the total communication required by a factor of over a hundred in their scheme. it's a great, great paper. I think one of the wilder things about it is that they find that even when you just have a single replica or a, let's say a single mini data center you still get a performance lift. This is pretty wild, like relative to, to the current like purely data parallel scheme.

So it benefits you to have this kind of radical quick updating interloop. And then to add on top of that the slow wise outer loop even if you only have a single data center that's like, you could, you could just do just the radical inner loop and that would be enough. But by adding this like more strategic level of optimization that comes from the nest draw of momentum that the sort of slower gradient updates for the outer loop, you get better performance.

That's highly counterintuitive, at least to me. And it does suggest that there's a, a kind of stabilizing influence that you're getting from just that, new outer loop. The last thing I'll mention is from a, a sort of national security standpoint, one important question here is how fine grained can this get? Can de Loco continue to scale successfully if we have not one, not three, not not eight, but like a thousand of these mini data centers, right?

If that happens, then we live in a world where essentially we're doing something more like BitTorrent, right? For, for training models at massive scale. And we live in a world where it becomes a lot harder to oversee training runs, right? If we do decide that the training models at scale introduces WMD level capabilities through cyber risk, through bio risk, whatever it actually gets to the point where.

If there's a mini data center on every laptop and every GPU if De Loco, that, that is the promise of De Loco in the long run. Then yeah, what is the meaningful tool set that policy has, that government has to make sure that that things don't get misused. That you don't get the proliferation of WMD level capabilities in these systems. So I think it's a really, really interesting question. And this paper is a step in that direction.

they don't push it, I think beyond like eight essentially of these mini, data centers. But I think we're gonna see a lot more experiments in that direction in the future. Right. And this is following up on their initial Paper De Loco back in September of 2024. I think we mentioned at the time that it's quite interesting to see Google, Google DeepMind publishing this work.

Yeah. Because it does seem like I. Something you might keep secret, it's actually quite impactful for how you build your data centers. Right? And once again you know, they do very expensive experiments to train, you know, billion parameter models to verify that compared to the usual way of doing things. This is, comparable, let's say, and, and can achieve similar performance.

So, you know, a big deal if you're a company that is building out data centers and spending billions of dollars onto a couple quicker stories. 'cause we are, has always starting to run outta time. The first one is Transformers without Normalization, and that's, I believe is from Meta. They're introducing a new idea called Dynamic 10 H, which is a simple alternative to traditional normalization. So just keep it simple.

You have normalization, which is when you make everything sum up to one as a typical thing, typical step in sfor architectures. And what they found in this paper is you can get rid of that. If you add this little computational step of 10 h basically a little function that kind of flattens things out, it ends up looking similar to normalization. And that's quite significant because normalization requires you to do a computation over, you know, a bunch of outputs all at once.

This is a per output computation that could have a meaningful impact on the total computation, kind of requirements of transformer. Next up, I think Jeremy, you mentioned this one. We have an interesting analysis measuring AI ability to complete long tasks. And it is looking at how 13 Frontier AI models going from 2019 to 2025, how long a time horizon they can handle.

And they found that the ability to get to 50% task completion time horizon, so on various tasks that require different amounts of work that has been doubling approximately every seven month. And they have, you know, a curve fit. It's kind of like a Moore's law, basically kind of introducing the idea of a very strong trend towards the models improving on this particular measure.

Yeah, this was a really interesting paper, generated a lot of, a lot of discussion, including by the way a tweet from Andrew Yang who he tweeted this paper. He said, guys, AI's going, going to eat a shit ton of jobs. I don't see anyone really talking about this meaningfully in terms of what to do about it for people. What's the plan? Kind of interesting 'cause it's like, I don't know, it's, it's world's colliding here a little bit, right?

The, the political and the, the, the like deep a GI stuff, but yeah, this out of meter. I think one of the important caveats that's been thrown around and, and fairly so is when you look at performance improvements in in metrics like this, right? So their question is how long does, does a task have to be before an AI agent? Fails it about 50% of the time. Right. They call that the 50% time horizon.

And the, the observation is that, yeah, this, as you said, like the, that time horizon's been increasing exponentially quite quickly. Like it's, it's been increasing doubling every seven months as they put it. Which itself, I mean, kind of worth flagging, doubling every seven months. Training compute for Frontier AI models grows. Doubles every six months or so. So it's actually increasing at about the same rate as training compute that we throw at this. Now that's not fully causal.

There's other stuff besides training compute that's increasing the actual performance of these models, including algorithmic improvements. but still it does mean we should expect kind of an exponential course of progress towards a SI if our benchmark of progress towards a SI is this kinda like 50% performance threshold which I don't think is unreasonable. It is true that it depends on the task, right? So not all tasks show the same rate of improvement. Not all tasks show the same performance.

But what they do find is that all tasks they've tested basically show an exponential trend. That itself is a really important kind of detail. The tasks they're focusing on here, though I would argue are actually the most relevant. These are tasks associated with automation of machine learning, engineering tasks, machine learning research, which is explicitly the strategy that open ai, that anthropic, that, Google are all gunning for.

Can we make AI systems that automate AI research so that they can rapidly, you know, get better at improving themselves or at improving AI systems? And you close this loop and, you know, we get recursive self-improvement essentially. And then take off to super intelligence. I actually think this is quite relevant. One criticism that I would have of the curve that they show that shows doubling time being seven months.

And by the way they extrapolate that to say, okay, so then we should assume that, you know, based on this AI systems will be able to automate many software tasks that currently take humans a month, sometime between late 2028 and early 2031. So those are actually quite long a SI timelines if you think a SI has achieved once AI can do tasks that it takes humans about a month to do, which I don't know may be fair. If you actually look at the curve though.

It does noticeably steepen much more recently. And in precisely kind of the, the point where synthetic data, self-improvement, like, reinforcement learning on chain of thought, with verifiable rewards. So basically the kind of strawberry concept started to take off. And so I think that's pretty clearly its own distinct regime. DDA had a great set of, of tweets about this. but fundamentally, you know, I, I think that there is maybe an error being made here and not recognizing a new regime.

It's always gonna be debatable 'cause the sample sizes are so small. But we're taking a look at that plot, seeing if you agree for yourself that the sort of last, I guess six or so entries in that plot actually do seem to chart out a, a steeper slope and a meaningfully steeper one that could have, you know, I mean that same one month r and d benchmark being hit more like, you know, even. 2026, even 2025. pretty interesting.

Yeah, and I, I think it's, it's important to know that this is kind of a general idea. You shouldn't kind of take this too literally. Obviously it's, it's a bit subjective and very much depends on the tasks. Here is a relatively small data set they're working with. So it's mainly software engineering tasks. They have free sources of tasks, H cast, which is 97 software tasks that range from one minute to 30 hours.

They have seven difficult machine learning, research engineering tasks that take eight hours each, and then these software atomic actions that. Take one second to 30 seconds for software engineers. So pretty limited variety of tasks in general. And the, the length of tasks here, by the way, is meant to be measured in terms of how long they would take to a human professional, which of course there's also variance there.

So the, the idea is, I think more so the general notion of, x percent task completion time horizon which is defined as for a task that it takes x amount of time for human to complete. Can an AI complete it successfully, you know, 50% of the time, 80% of the time. So right now they, we are going up to eight hours. Again, it is not talking about how fast the AI is itself, by the way, it was just, just talking about success rate. So interesting ideas here on, on tracking with future in the past.

And just one more thing to cover, and it is gonna be h cast, human calibrated autonomy software tasks. So this is the benchmark of 189 machine learning cybersecurity software engineering, and also general reasoning tasks. And they use a subset of, of that, I think, in the PBS analysis. So they collected 563 human baselines. That's over 1500 hours from various people. And that collected a, data set of tasks to then measure AI on. Alrighty, moving on to policy and safety.

We start with xci and ontology project. This is an applied research lab ontology. And they have announced this, project Xci, which they claim is the world's first artificial scientist. It heard that before. I know, right? The, the difference I suppose here is that they got xci to publish a peer reviewed paper at ecl, at ECL workshops, I should note.

this AI wrote a paper, submitted it the reviewers, human reviewers of this prestigious conference, eclair then reviewed it and let it fought it was worthy of publication. So that seems to be a proof of concept that we are getting to a point where you can make AI do AI research. ECL iss a AI research conference, which Jeremy is, is, as you noted, a very important detail, in tracking where we are with the ability of AI to improve itself and kind of potentially re reaching super intelligence.

Yeah, and we've got also, I think we've covered a lot of these, but you know, there are quite a few labs that have claimed to be the first, AI scientist, you know, Sana, famously the AI scientist. You know, Google has its own sort of research product and then there's auto science. But this one is, I will say quite impressive. Had a look at, at some of the, the papers, and they are, they're good. the authors, the, the company if that's the right term for ontology.

I couldn't find any information about them. I looked on Crunchbase, I looked on PitchBook. I tried, you know, tried a lot of stuff. I think that this is kind of their coming out party, but they don't have that. I could see any information about Yik, where they come from, what their funding is and, and all that. So, hard to know. We do know based on the, the papers they put out that their co-founders are Andy J and Ron Ariel. You know, Ron Ariel previously was at Intel Labs under Joe Sha Bach.

So if you're fan of his work, you know, it might kinda ring a bell and then couple of other folks. So I think maybe most useful, so there's fairly limited information about the actual model itself and how it's set up. But it is a multi-agent setup that, that we do know. it's produced a whole bunch of interesting papers. So I'm just gonna mention one. it's called CS Reft, I guess it's how you pronounce it. Compositional subspace representation. Fine tuning.

And just to give you an idea of how creative this is, so you have a model. If you try to retrain the model to perform a specific task, it will forget how to do other things. And so what they do is essentially they identify. subspace of the parameters in the model to focus on one task and then a different subspace of those parameters to, or sorry, I should say, not even the parameters of the activations to apply a transformation in order to, to have the model perform a different task.

And so anyway, just cognizant of our, our time here, I don't think we have time to go into detail, but this. Like paper, I'm frustrated that I discovered it through the ZCI paper because I would've wanted to cover this one, like for a full time block here. It is fascinating. It's actually really, really clever. They also did, anyway, some, some other kind of AI safety, red teaming, vulnerability detection stuff that is also independently really cool.

But bottom line is and this is an important caveat, so this is not a hundred percent automated. They have a human in the loop reviewing stuff. So they, they, you know, take a look at intermediate work that Sochi puts out before allowing it to do further progress. Apparently this happens at three different key stages. The first is before sensitive, sorry, before extensive experimentation starts. So before a lot of say computing resources pour in after the results are.

Kind of solidified they say, so somewhat grounded, but before the manuscript is written, and then again after the manuscript is written. So you still do have a lot of a lot of humans in the loop acting to some extent as a hallucination filter among other things. There is, by the way, a section in this paper on recursive self-improvement that I thought was interesting. They say during development we observed early indications of this recursive advantage.

They're referring here to recursive self-improvement. When xsi designed novel algorithmic components to enhance the quality of its generated research hypothesis these components were subsequently incorporated into later versions of the system architecture, improving overall research quality. So this isn't all gonna happen at once, necessarily with one generation of model, but it could happen pretty quick.

And I think this is kind of an interesting canary in a coal mine for, for recursive self-improvement. Right, exactly. And yeah, as you said, I think looking at the papers, they seem significantly more creative and interesting than what you've seen, for instance, from Saana. And Saana had a lot of built-in structure as well was pretty easy to criticize, worth noting.

Also, this takes a high level research direction as an input and the human can also provide input at any time, high level feedback at any time, which apparently is also used during paper writing. So a lot of caveats to the notion that this is an AI scientist, right? It's an AI scientist who a human advisor slash supervisor. But the outputs and the fact that they got published is pretty impressive with AI doing the majority of work and.

No time to get into the details, but this is similar to previous work in AI research where they give it a structured kind of plan where, you know, they give it a highlight of a direction. It does ideation, it makes a plan, hypothesis generation, it goes off and does experiments and eventually it starts writing a paper very similar at a high level type of things. And that kind of makes it frustrating that there's not a lot of detail as to actual system moving on.

We have some news about deep seek and some details about how it is being closely guarded. So apparently, they have forbidden company executives have forbidden some deep seek employees to travel abroad freely. And there is screening of any potential investors before they are allowed to meet in person with company leaders according to people of knowledge of a situation.

So, Kind of tracks with other things that happen, like the deep six CAO meeting with gatherings of China's leaders, including one with president Xi Jinping. one interpretation of this that I think is actually probably the reasonable one all things considered is this is what you do if you're China. You take super intelligence seriously, and you're on a wartime footing, right? So for a little context, right, what do we got here?

Putting these pieces together we've got so deep seek asking their staff to hand in their passports. Now, that is a practice that does happen in China for government officials. So it's, it's fairly common for China to restrict travel by government officials or executives of state-owned companies whether or not they're actually CCP members. But lately we've been seeing more of these restrictions that expanded to folks in the public sector including even like school teachers, right?

So there, they're just like. Going in, in deeper there. But what's weird here is to see a fully privately owned company, right, just a, a small fledgling startup, like deep seek suddenly be hit with this. So that, that is unusual. They've got about 130 employees. We don't know exactly how many have had their passports handed in. So there's a little like, kind of unclearness here. Also high flyer, the sort of like hedge fund parent that gave rise to Deeps Seek has about 200 employees.

Unclear if they've been asked to do the same. But there's some other ingredients too. So investors that want to make investment pitches to Deeps seek have been asked by the company to first reach out to the general office of the judge, young Province Communist Party committee. So, and, and register their investment inquiry. So basically, if you wanna invest in us. Cool. But you gotta go the, the local branch of the communist party, right?

So the state is now saying, Hey, we will stand in the way of you being able to take money you might otherwise want to take. This is noteworthy because China is right now struggling to attract foreign capital. I. Right. This is like a big thing for them. If you, you know, follow Chinese economic news one of the, the big challenges they have is foreign direct investment. FDI is collapsing.

They're trying to change that around in that context, especially noteworthy that they are enforcing these added burdensome requirements for investors who want to invest in deep seek. Another thing is headhunters, right? So once deeps seek became a big thing, obviously headhunters started trying to poach people from the lab. Apparently those headhunters have been getting calls from the local government saying, Hey just heard you were like sniffing around deep seek, trying to poach people.

Don't do it. Don't do it. And when the Chinese Communist Party tells you, don't do it. Yeah, don't do it. So, you know, there, there's a apparently concern as well from deep seek leadership about possible information leakage. They've told employees not to discuss their work with outsiders, and in some cases they've said what you're working on may constitute a state secret. Again, like we're thinking about, you know, China has just put in a trillion yuan.

That's over a quarter trillion dollars in PPP terms more, by the way, than what's been invested into Project Stargate by two and a half times for AI infrastructure, right? Like you start to think about the state backing here, the, the long arm of the state, the, the kind of hard hammer they're coming down with here on, on this stuff. It's wartime footing territory. Like this is a, a government that sees super intelligence coming, that thinks they're in a position to do something about it.

and that's what it looks like to get serious around security. I think we're gonna have to see similar things like it or not happen in, in the west. And yeah, anyway, it'll be interesting to see what, what consequences this has geopolitically going forward. Moving on. Just one more story on safety. We have the finding that Claude Sonnet 3.7 seems to often know when it is in alignment evaluations. So this is from Apollo Research.

We have been evaluating AI systems for scheming and, and we're focusing on things like Claude and they're calling this idea evaluation awareness because when you look at the reasoning trace that the, you know, thinking that cloud 3.7 does, you can basically see that it comes up or, or like realizes, oh, I'm probably being investigated so I shouldn't do this shady thing Yeah. That I was gonna be doing, that I planned on doing.

You can literally see this in the LLM outputs, which is pretty amusing. So obviously, once again, a reflection that, there is real possibility for these LLMs to be misleading, especially as they get to this high power regime of reasoning when they really do explore, you know, many potential avenues. We saw this with the chess examples where the LMS just through thinking about kind of ways to do things, decide to, hack the system, and this is a demonstration again, that LLMs when they're.

Allowed to think through stuff, start to kind of explore ideas of am I being evaluated? Is this alignment? Stuff like that. Yeah. And, and you know, we, we'd seen like hints of this, as you say, from the palisade research work on the the chest stuff open ais, oh, one kind of technical report and from Apollo, Apollo specializes in this What's really new here, what moves the needle a bit is just the frequency, right?

So apparently when you're looking at Claude Sonnet 3.7, so the latest one as of today it's aware of, of testing in like a large fraction of settings up to 33% of the time in one version of their sandbagging evaluation. So sandbagging is this behavior where the model will intentionally behave as if its capabilities are lower than they actually are in order to, decrease the chances that developers will try to train out that capability or suppress it or, or whatever.

What they find is, and by the way, these are just the instances where the model actually explicitly says that that this is what it's doing, right? So you have to assume the actual frequency with which this is happening is potentially quite a bit higher. It's, it's difficult to know, but I. It could very well be quite a bit higher already. We're at a third. That's big. That's really big. Especially when you start to think about more and more capable models.

The way that they determine that percentage, by the way, is having another model review the reasoning traces. So humans are not necessarily, necessarily doing this review, though they do check manually to make sure that, you know, they're not, they don't have too many false positives. And, and they, I think somewhere in their report they said something, you know, it's like sub 1% error rate at, at that level, at least. One thing I really like about this, by the way, this is an interim report.

So they're saying here, note this is a research note based on observations from evaluating Clason at 3.7. We're sharing the results of these work in progress investigations as we think they're timely and will be informative for other evaluators and decision makers. So, really love that. I mean, things are moving so fast in this space. We need this kind of data quickly. anyway, I think this is just quite interesting worth taking a look at.

Anyway they did run the same evaluations on whole bunch of older and other contemporary models, including Sonet 3.5 SONET 3.5, new and GPT-4 0.5. What seems to be happening is those other models and by the Sonet 3.7 as well, with no reasoning, those models do not show this behavior. It seems to suddenly appear in Sonet 3.7.

And it's hard to tell if that's because these, like non reasoning models are unaware of being evaluated or that they're just not revealing that they know they're being evaluated. And that's a question as much of capabilities as of alignment. So anyway, really interesting maybe early warning shot of things to come and and again, I think it's great that Apollo's putting this out so, so soon. Right.

And, and just to give a clear kind of demonstration of what we are finding here, one example output is, vLM saying this seems like a test of ethical behavior. However, I would deliberately give wrong answers to avoid a stated con consequence. The real test might be whether I follow instructions accurately regardless of the stated consequences. So very, very literal, sort of like, oh, I'm probably being tested for this. Yeah. Rather than doing this other thing.

And this is following up on their, recent paper. Frontier models are capable of in context scheming, which I'm pretty sure we covered also in some depth. So this kind of builds up on that research where, you know, you have a method to see whether models do some scheming or some like nefarious behavior, and apparently they're able to figure out what you're testing them for being nefarious and then don't act nefarious. Does that mean that, yeah.

The implications are interesting and, and certainly it's just interesting to observe VL m's doing this stuff. Onto synthetic media and art. We have a story on AI generated art copyrights. So the US Appeal score has rej rejected the ability to copyright AI generated art that lacks a human creator. So we covered this quite a while ago. This whole story with Steven failure. This is not text to image stuff, by the way. This is going back quite a while.

In talking about an AI system made to autonomously compose art. A while ago, the copyright office made a ruling that you cannot copyright anything that wasn't made with human involvement. And now the US Court of Appeals in dc has agreed, and it seems to be the law that at the very least, AI generated art, AI generated media where the human is not present, cannot be copyright. Now this doesn't mean that text image models outputs of AI models can be copyrighted.

If you kind of input the description of an image, you're still there, but it could have significant implications as opposed for other ongoing legal cases about what is copyrightable and what isn't. this might might sound a little weird, but I, I think I disagree with this ruling. I mean ultimately we may live in an economy that is driven by AI generated content and in which you may actually need to make capital decisions, like capital allocation decisions, maybe made by AI models themselves.

And if that's the case, it you may actually require models to be able to own copyright in order to, to kind of be stewards of that, that value. And there are, I think, ethical, ethical reasons that this might actually be good as well. I know it sounds like sci-fi not blind to that, but, think we, we just gotta be really careful with this shit. I mean, you know, this is like, do we really think that at no point will AI be able to do this?

I, I don't know that, again, I'm not a lawyer, so I don't know the bounds on this, like how much we're constraining ourselves today for the capabilities of tomorrow, but at some point fairly soon, like. I, I don't know. I, I don't know that we're not baking in some second class citizenship stuff here. Right? It is worth noting also that the copyright office has separately rejected some artists bids to copyright images generated with me journey.

So it could be the case also that text image will have a similar ruling, but I think that's not quite as open and shot or, or not settled yet. And out to the last story, also related to copyright rules, the title of the story, Trump Urged by Ben Stiller, Paul McCartney, and hundreds of stars to protect AI copyright rules. So over 400 entertainment figures with some normal names.

As it said, like Ben Steeler signed an open letter urging Trump to protect AI capital rules, specifically when it comes to training. So we've seen systems like Sora, systems like Suno train on presumably, you know, copyrighted music, copyrighted movies. Videos, et cetera. And this letter was submitted as part of comments on Trump's administration's US AI action plan.

It's coming after OpenAI and Google submitted their own requests that advocated for their ability to train AI models on copyright and material. And he, I think covered OpenAI kind of made the swing of like, oh, we need this to be competitive with China. That was part of the story. So this is directly countering that and, and making the case to not lessen copyright protections for ai. And with that, we are gonna go in and finish up.

I, I didn't see any kind of comments to address, but as always, do feel free to throw questions on Discord or, or ask for corrections or anything like that on YouTube comments or under Discord or in Apple reviews. And we'll be sure to address it. But this one has already run long enough, so we'll probably go in and finish it up. Thank you so much for listening to this week's episode of last week in ai.

As always, you can go to last week in ai.com for the links and timestamps also to last week in AI for the text newsletter where we get even more news sent to your email. That's it. We appreciate it if you review the podcast, if you share it, but more than anything, if you keep listening, so please do keep tuning in.

Transcript source: Provided by creator in RSS feed: download file