#203 - Gemini Image Gen, Ascend 910C, Gemma 3, Gemini Robotics | Last Week in AI podcast

⁠¶ Intro & Banter

00:11

Hello and welcome to the last week AI podcast where you can hear us chat about what's going on with ai. As usual, in this episode, you will summarize and discuss some of the last week's most interesting AI news. And as always, you could relatively episode description for the timestamps and links for those articles, and also to last week in ai.com where you can go and browse the web and and so on. I am one of your regular hosts, Andre ov.

00:40

I studied AI at Stanford and I now work at a generative AI startup Astro. And hi everyone, I'm Jeremy. I'm your other regular co-host. Sort of been, I guess, been in and out the last couple weeks. But yeah, great to be here and yeah co-founder of Gladstone, ai, ai, national security stuff. You know the deal if you've been listening to the podcast and this week we were talking about this, this happens every once in a while.

01:03

Not that often these days, but we're like looking at our roster and we're like, man, this is a light week. And I guess it's gonna be a short podcast for that reason. But like hot air that expands to fill the entire volume that is available to it. I'm sure we'll find a way to make this a two hour podcast. Nonetheless, we, it's a problem, it's a skill, you know, we really are capable of talking a lot when we have the time.

01:27

but to give a quick preview of what we'll be talking about in tools and apps, there's a variety of sort of smaller tools. Launching one kind of major one from OpenAI, but the rest are maybe less notable and, but kind of varied and interesting applications and business, as is often the case. You're gonna be talking about a lot of hardware OpenAI spending much money on it. Some developments from Huawei. And a couple of business deals as well. Projects in open source.

01:57

There's a couple new models out to Gemma Free and ones from Sesame. Pretty exciting ones. Research and advancements. We got Gemini Robotics, which is kind of, I don't know, unexpected from me and pretty exciting and interesting paper about test and compute. And finally, policy and safety. Our usual mix. We got one paper on understanding and alignment. And then we have a lot of stories about China, US relations, which seems to be big deal these days. Yeah, it's going great. It's going great.

⁠¶ Tools & Apps

02:30

Well, let us just go ahead and jump in, starting with tools and apps. The first story is OpenAI launching new tools to help businesses build AI agents. So there's now a new responses, API, which allows you to make custom A agents that involve things like web searching and file scanning, and that is meant to be even more. Yeah, autonomous capabilities. So it also allows you to use the computer using agent model, which can control much more varied types of things on your device.

03:07

And apparently enterprises can run that computer using agent model locally, although on the consumer version, you're gonna have to only use that for web actions. So I think kind of not too surprising, we saw philanthropic launch, also a computer used API in sort of, early release a while ago. And it would be very interesting for me to see if this will be part of a trend of kind of the next wave of automation after just playing l LMS seems to be something like this.

03:44

It's, it's a really interesting moment for the agent, like age agentic business model if you want. Like, here's OpenAI essentially looking at how do we unbundle the package that we've offered historically through the the agentic systems that we, we've offered basically, right? Like we built the agents, you use them, this is them saying, no, no, well, we'll give you access to the underlying tooling and you can build your own agents.

04:07

You know, like this comes with, for example, like a file search utility that can scan across files in your database and Train models on those. So that's at least in principle, the guarantee there. But yeah, there's a whole bunch of other, you know, you mentioned the, the QA model, the, the computer using agent model behind operator. And that by the way, you know, generates mouse and keyboard actions that's actually like about computer use itself.

04:29

But essentially, yeah, the unbundling of this gives customers a lot of options for how to create their own agents. Eventually the sense though, every indication from OpenAI is they intend to have one experience to rule them all at least on offer, right? So they are looking to build an integrated solution where it's not unbundled. But this is, it's kind of the other side of the coin, right?

04:50

You're either building tooling to empower your users to build their own agentic systems, or you're building one agentic system that, you know, you imagine most kind of consumers would tend to use directly. So they don't have to wrangle their own thing. They're, they're trying to sort of have it both ways here, right?

05:05

So unbundle and bundled products, which I would expect anybody with sufficient scale to do this who can afford to focus on two things at once is gonna have to do at some point because. It's like, it's along the spectrum towards open sourcing, right? When you, you start unbundling the tools and letting people mess, mess around with them and, and see what they build. So that itself is, is kind of interesting.

05:26

And then OpenAI can learn from how those tools are being used and unbundled in the same way that, you know, meta learns from the way people play with Lama in the open source world to then integrate those learnings into their own kind of all fully packaged agent systems. So kind of interesting, I think a, a good strategic play. They're opening up a whole bunch of toolkits including an open source toolkit called the agents SDK.

05:48

And that gives you a bunch of free tools to integrate models with your internal systems, add in safeguards and do monitoring stuff for your agents as well. So, pretty interesting. It's again, this sort of like straddling the line between what do we open source, what do we not? and I think a, a nice sort of middle ground that open OpenAI spotted there. Expect philanthropic to do the same. Expect ultimately xai to do the same.

06:09

I think a lot of this stuff is gonna get picked up, but it does seem like a good strategic play. Yeah, and I think also it kind of points to something I suspect we don't know the details, but I would imagine that the API is where the real money's at, right? We have a consumer version. You can pay a subscription of $20 per month for now, $200 per month if you are a total power user. And presumably enterprises are paying for that for their workers.

06:40

But a bunch of companies, including to some extent our company is using the API to create their own thing on top of a Chad GPT or on top of Claude. And I would imagine long term, that's how open AI philanthropic will be making the majority of their money. And certainly philanthropic is targeting enterprise explicitly. So this also plays into that I think. Absolutely. Yeah. The, the money as ever is always in B2B. Right.

07:08

And it's interesting that in a way you can view the history of chat GPT as having been just to use B2C business to consumer to build brand recognition for the B2B play that inevitably was going to be, I. Most of the value created here.

07:23

I think the one caveat is when you go really long term with this stuff, eventually, if you talk about super intelligence, if OpenAI plans to start to centralize more and more of, of the activity that's happening, the economically productive activity, which you have to imagine they would despite everything that they say publicly about, you know, we want to empower creators eventually, like, like Amazon, right? With you know, what do they call it?

07:44

Amazon Basics or whatever, you know, they spot those products that are selling really well and boom, they'll, you know, snap up those opportunities, expect the economics to look the same for open ai. When that happens, essentially they're cutting out the business middleman and going straight to consumer and internalizing all that value. In some verticals, not all, but in some.

08:04

So I think there's gonna be this interesting evolution to your point a transient where B2B is where all the money's to be made, but then because AI has the ability to eat the world it's gonna be really interesting to see do they ultimately become. As much a B2C company as a B2B company. Right now, by the way, anthropic has a massive advantage, relatively speaking in terms of the balance of, of consumer versus business. Anthropic much more focused on the business side.

08:28

And that's reflected in just the better coding abilities of, you know, cloud 3.7 sonet cloud 3.5, sonet new and all that stuff, even over and above some of the age agentic models from remote ai. But I think that's a great point. it's so interesting, right? Like these companies are inventing new business models. No one really knows what's gonna work, how they'll evolve over time. The one thing that's guaranteed is they will evolve and we will be surprised. Next up we have a story from Google.

08:53

They are now releasing variability in Gemini two flash to have native image output. So that allows you to do conversational image editing in the chat flow as you do with chat and others, you are able to ask it to generate an image to my understanding here, basically this is different because it's not calling on another tool. It's built in to Gemini to flash itself as a multimodal model.

09:25

And so the results can be quite impressive in multi turn conversation taking where one of the key sort of limitations of image generation or challenges is if you wanna edit an image. You wanna also preserve, let's say the characters the people, you know, various aspects of the image while still changing it. And that's one thing you get sort of for free here because Gemini two Flash has the context of a conversation, both in terms of the text and the images.

09:56

And that means that it can really do a pretty stellar job from examples I've seen of maintaining consistency and also being very kind of generalized and able to do all sorts of stuff directed by your output. So initially this was announced, it was available to some testers. Now this is rolling out to users and, and even developers.

10:22

Yeah, and it's you know, the, the, the small handful of things that you tend to turn to and you look at these days, you know, okay, how good is this new in the generator text, right? It handles text really well. They, at least based on the, the demos that they show, which you never know, but they have it create an old detailed vintage, 35 millimeter photograph from a front view of a computer monitor.

10:44

And then they say like, have this text displayed on the monitor, and it's about three lines and it, it captures every, every word as far as I can tell here. So, you know, that's a failure mode that we've seen before. And it's the combination, you know, once you get into text that you want to have faithfully represented in the image. And also you wanna get into this back and forth editing of the image.

11:04

A lot of these things, as you stack them together, that's where you run into a lot of the problems, and at least based on what they're deciding to show us here in the demo, it, it does look pretty remarkably good. So, as ever, I am wondering what is the next step in, in image generation? I'm sure there will be some, but for, for those of us who are just sort of like lowly consumers of image generation, tech I think we're pretty close to approaching saturation point.

11:27

I mean, Andre, you're obviously more plugged in on the gaming side. I'm, I'm guessing that, you know, you have specific things that you might look for just 'cause of, you know, generating visual artifacts, avatars, things like that, or. Yeah, I mean in our kind of testing we found that if you have a very particular use case, these are general purpose models and they reflect their training data sets.

11:49

So they're really good at generating kind of stuff you would find on the web if you have a very specific set of requirements. Typically these models aren't ideal. So, being able to be very good at instruction, following down to very, my new details is very important to be able to use them kind of zero shot for whatever you're doing. So that could be one of the, powers or benefits here.

12:13

Another interesting aspect, just looking at their blog post is if you look at the examples they give in the multi-term conversational image editing, you have to wait like upwards of 10 seconds for a response.

12:26

Mm-hmm. And I would imagine that's one of the limitations when you are doing this kind of native image output, when you have a multimodal model that does image and text and audio it can output images for you and, and have very flexible kind of reasoning and accuracy, but then it is way slower than the sorts of image generators, text image, generators we see on the market. So I think that's an interesting trade off we haven't seen necessarily.

12:58

Yeah. There's almost the kind of use case issue there where. Until compute is so cheap that for most practical purposes, you can get instantaneous generation of outputs for multimodal models. Yeah. There's probably gonna be a need for, you know, still specific, you know, high specificity models that only have one modality or another, and maybe router models that route your queries or, you know, whatever modality to whatever modality.

13:22

But we're definitely not there yet that we can have a, you know, a single gato like model that just does everything for everything. Yeah, people are still using plenty of Lauras out there, that's for sure. Moving on to the lightning ground, we have a few smaller stories. First up, one of my favorite topics, apparently since I keep bringing it up, it's Waymo and they are yet again expanding.

13:43

They're now offering robot rides in a few more cities in the Bay Area, including Mon View, Palo Alto, Los Altos, and parts of Sunnyvale, which is exciting to me because I work in Los Altos. So now I'll get to potentially use it in my commute sometimes just because that's fun. So this is part of our rollout. It seems like they are really trying to expand a lot. This year. They've expanded to Phoenix to offer their robot services there.

14:15

They're trying to expand to la and they're also are planning to go to Atlanta. So, yeah, it seems like we're. They feel ready to expand. And for me, the main question is, can they do it more rapidly than they have for the past year or so? I mean, they were in San Francisco for now, quite a while, maybe two years, and they're now moving kind of to the suburbs south of San Francisco with some of these smaller cities in their backyard, so to speak.

14:46

So still moving a little slow on the expansion front, but it seems like they haven't had any kind of big crashes or anything of that sort as they've expanded, which is promising. Yeah. It's also it's strategically interesting, right? 'cause the, one of the big things that's come out from way more recently, of course, and which we covered, is their partnership with, with Uber in Austin. And that seems to be expanding now to Atlanta, or at least it will later this year.

15:13

it does make me think a little bit like. Uber is at some risk here, right? Because essentially the the core platform that they're using for a lot of these rides, where essentially as an Uber customer, now you can, you know, if you're in Austin, you get matched with a Waymo Robo Taxi. Same thing will be true in Atlanta. yes, you're a marketplace. Like that's the value of Uber in this context, right? It's discovery of supply and demand.

15:36

But at a certain point, if people get kind of comfortable riding in Waymo cabs, you've got the brand established. If Waymo just comes out with an app and then undercuts Uber, which presumably they may be able to do, if only for a transient period to onboard people like Uber's got some, some platform risk here. And so this is, it's not a coincidence, right, that Uber had made previously through Uber, a TG. The self-driving piece, a priority for them.

16:01

They've since ditched that just because it's, it's too capital intensive. They weren't making enough progress. But that was because they saw this eventuality potentially coming and huge amounts of, of platform risk on the table. So, yeah, I mean, I, I don't love Uber's positioning here.

16:17

I I, I think, you know, they have great software, but when you're riding off a, a kind of hardware platform where there's a lot of efficiency to be gained from vertical integration potentially in this space, 'cause the margins are so limited I wonder what they're thinking and how that ends up playing out. But we'll get some early indications anyway. So with these rollouts in, you know, Austin, Phoenix and elsewhere. Right.

16:38

Exactly. And, to that point, Waymo already has an app, a standalone app you can use for example in San Francisco. So they're kinda ready to get rid of Uber whenever they can. I guess the benefit for Uber and so on is, is just their scale. They're everywhere, obviously, across the globe. So it'll take quite a while for Robotaxis to be, just have enough hardware in the first place to even be able to compete. And the big question as ever is, is Tesla gonna be able to catch up?

17:09

'cause right now it seems like Waymo at this point is the only player in town on the robot taxi business. Next up we have a new video generator. There's a startup called Moon Valley that has released a video generating model they call Marley, with the pitch that it's trained on licensed content only. So not using any copyright data this was done in incubation with Asteria and AI animation Studio, and is also meant to seemingly be for more let's say cinematic or I media production.

17:48

Type of roles. It allows you to customize camera and motion controls, for example, and in scene movements and things like that. And this allows you to produce high resolution clips for up to 30 seconds long with, again, the pitch being where it is low legal risk. So, this is certainly you know, it's kind of been a little quiet on the text to video front. We had a big kind of moment there of Sora being released. A while ago.

18:16

We saw Adobe, I believe I don't remember if it's released already, but we have announced, we have video generation model. So it's continuing kind of a rollout even as the focus has definitely shifted to reasoning. Yeah, it's, it's also, they're apparently, so starting off with more kind of open source, openly licensed stuff. In this first release, they are apparently working with partners to handle licensing agreements and packaging videos into data sets that they can then purchase.

18:43

Which is a lot like what Adobe is doing, right? So we, you know, we see them kind of doing the same thing with their big they were, I think the first company, the first company certainly that we covered that was doing the indemnification guarantee. You know, if you get sued for using our, you know, image, video outputs, whatever, we will indemnify, we will kind of defend you in court if you're using our, our software as it's intended to be used.

19:04

The interesting thing with Moon Valley too is like, I'm, I'm not tracking how much they've raised, but it's certainly, you know, not gonna be a huge amount. It's not gonna be in the orbit of, you know, what OpenAI has on hand. And so when you think about a small company like that, trying to do this stuff through licensing agreements and, you know, purchasing video content from other companies, that's a much taller order.

19:25

But strategically, and, and this is just speculation there's actually kind of an interesting symbiotic relation here, a relationship here potentially between companies that put out licensable video content. I imagine if, I'm a, a company that's pumping out videos that might be used for training for these models, I might actually want to partner with a company like Moon Valley, give them very, very cheap access.

19:48

but still sell to them, access to the licenses for these videos, if only to set the precedent so that opening AI then feels pressure to come in and buy. And once you get the big companies to come to you, then you charge the full amount, if that makes sense. So like, I don't know, I don't know the, the legalities of how that, that would play out. You know, if there's, there's an issue here with kind of like selective pricing with different players like that.

20:11

But there's a kind of interesting potential partnership here with up and coming companies for these content creation platforms to license stuff for cheap, just to get that flywheel going, set the precedent, and then, you know, cave the bigger companies that can afford it. It is sort of interesting and not saying that's part of this, but it, it kind of makes me think in that direction. When you, when you look at this. Right. And to a point about funding, I just looked it up.

20:34

They got a seed round of 70 million back in late 2024. That was when they announced it at least. So, significant amount, not a huge amount. And that's another part of the story, I think, is it turns out you can get pretty good video models these days for not Yeah. A ton of money. Not like, you know, hundreds of millions of dollars. We'll get to that also in the open source section when the cost of compute collapses.

21:01

Right. 10 x every year, you know, a 70 thou, a $70 million raise is effectively a 7,000 million dollars raise. At least, you know, if you're comparing CapEx to CapEx. Yep. Hmm. Next up we have Snapchat and they are introducing AI video lenses that use their own. Model that is built in house. So if you are a Snapchat platinum subscriber, I'm sure we have many listeners who use Snapchat. you can pay $60 per month to be able to use these basically filters. Not quite filters, I guess.

21:35

It's kind of like, video editing where they have free AI video lenses currently, raccoon Fox and Spring flowers, which basically takes your video and adds in a raccoon or adds in a fox or adds in flowers. And there's some sample videos. You know, I dunno, they, they look fun. This feature for so long, I can't, I, you know, if I have to use another fucking image editing platform that doesn't have. An editable fox or raccoon feature. I'm gonna lose it.

22:06

Andre. You know, I didn't know that this is such a highly desired feature. I'm gonna be honest with you. I didn't know. Raccoons are such a big deal on Snapchat. Apparently they are. Oh. But anyway, I think interesting to see Snapchat investing in, in, in, in-house generative AI model. And it's a real question mark as to whether this would be an incentive to actually pay $16 per month. But yeah, I, I dunno much about Snapchat, so I dunno if users are big on video filters of this sort.

22:40

Yeah. I've never felt more disconnected from like. the median consumer of the product, and I'm looking like 16 bucks a month for the, okay. I mean, I can see other uses, but I, there, there's other stuff too, obviously that will come with this and I'm sure they'll roll it out. It is interesting that they chose to go in-house with this. Maybe not too surprising given the sheer volume of data that they have.

23:00

And also the fact that when you look at Snapchat video not gonna lie, it's been a while for me, but it does have like a certain aspect ratio. It has a certain pe you know, people tend to frame their shots a certain way. There's a whole culture around the use of the app. And so you might expect that you know, having a fine tune model, but may maybe, I guess even a pre-trained model like this that's done all in-house, you know, can actually make sense there.

23:23

So, they're presumably also training on, on other open source at a minimum open source video data, I would assume. But certainly when you have such a huge amount of data in-house, it kind of causes you to lean in that direction, especially if you've got the funds. And one more story, and this one is kind of one I have a soft spot, not one that is covered in a lot of media, but I think it's kind of neat.

23:45

The headline is Pseudo Right Launches Muse AI model that can generate narrative driven fiction. So pseudo right is a platform with the intent to basically yeah, have an AI assistant for writing generally fiction and, and potentially also blog posts. And they've been around for years and years. It was one of the tools that I used going back a few years ago when I was playing around with these things. So they were kind of early on the LLM train and now they have this news model that they say.

24:19

Is actually capable of producing a better you know, literature so to speak, can better assist you in writing. And, and that's kind of one of the things to highlight. Pseudo write is meant to be as an assistant, you're type, and then a suggestions ability to suggest structure, characters and so on. another kind of slightly interesting idea here where we know that on the one hand things like G-S-G-B-T can write a whole entire short story that is kind of logical and you can read.

24:52

On the other hand it's generally true that if you just ask an LLM to write you something, it's gonna be pretty generic and pretty just LA to read. So I could plausibly see that there is space to, with some data, get to a model that is much better by default at producing good suggestions for writing that aren't, let's say cliche or just yeah, un unusable if you're trying to write something a little more out there than what you see typically.

25:29

Yeah. Well, and I don't think we have a story for this specifically 'cause it's just sort of rumors and, and pre announcements, but opening eye right sort of came out I think yesterday as of time of recording to say, Hey, we have this new model that we're working on. It's really good at creative writing. Sam a tweeted about it or Xed about it. it's definitely something that people have, thought would be sort of struggling point for LLMs, right?

25:52

It's easy to, to train them, especially age agentic models. But even just sort of general pre-trained lms, it's easy to train 'em to do coding and things that you can objectively quantify and, and evaluate. But, you know, the creative writing stuff is harder, so. You know, maybe we'll see more of a push in this direction.

26:06

I'm sort of curious, you know, if you compare the pseudo right model and the upcoming OpenAI model in terms of the performance, but also in terms of like, what does that training process look like to get more creative outputs? 'cause it's not obvious to me at least how you would, other than just curating your data more carefully, which is sort of the obvious thing here.

26:24

Or maybe just putting more weight on it, you know, changing the, the, the order in which you train and making sure it's at the very end of your training that you're putting in sort of the highest quality sources. These are all kind of, you know, standard tricks that are used. But yeah, I'm curious to see how this differs both in performance and, and in training procedure. Yeah. Unfortunately they didn't release too many technical details as to what is actually involved here.

26:47

I guess we, the cool. Story or what would be neat if this came out as a result of sort of the actual platform pser, right? Having such a particular use case where you have actual people using it and rejecting suggestions or accepting suggestions rewriting parts of the output you know, picking from various suggestions. So that is like a gold mine of data for this particular use case that nobody else has.

27:18

If that was part of how they did this and they did also say that they talked to s of users of a platform you know, could be an interesting example of a more niche use case platform that is then able to become a leader for that application. Yeah, opening for like kind of process and a kind of pseudo process reward or even RL type stuff you get into just based on like, like you said, the editing history would be an interesting thing to get into. Not just the outputs, but yeah.

⁠¶ Applications & Business

27:48

And onto applications and business. We have a story about OpenAI. They are in a $12 billion agreement, a five year agreement with cloud service provider core reef. So that is. Partially investment. OpenAI is getting 350 million in equity from Core Weave. And that will, you know, presumably play into their need for infrastructure open. AI's need for infrastructure. Infrastructure. Core Weave has an AI specific cloud service with 32 data centers and over 250,000 Nvidia GPUs.

28:34

Microsoft is a big user of Core Reef actually. And it seems that OpenAI now is also planning to have additional compute providers outside of Microsoft who is presumably still their number one source of compute. Yeah, this is a, a really interesting story in a couple different ways. I think we covered last week. Core Weave is planning an IPO that's coming up. One of the concerns there is that, yeah, Microsoft is the line share of Core weaves revenue, right?

28:59

62%. Given that, you know, that's a source of pretty significant risk for Core Weave, this deal with OpenAI is a probably very refreshing injection of, of funds and potential for partnership. So kind of, diversifying a little bit the portfolio of customers at scale. Core Weave is, by the way, backed by Nvidia. That's how they've been able to access GPUs, so many GPUs so soon. They're now in the process of adding Blackwells already. so that's a big deal for compute capacity.

29:25

But the other dimension of this too, besides just the, the IPO and, and how this sort of helps Core Weave strategically is that partnership between Microsoft, no AI that you referenced. So there is a bit of, sort of deteriorating relationship there, right? I mean, we talked about a few weeks back in the context of the, the big Stargate builds that are happening right now. Right? Big partnership between OpenAI and not Microsoft. It's supposed to be Microsoft, but instead Oracle and Cruso.

29:53

So, so Cruso being a big data center company, and then Oracle being the sort of hydration partner to provide a lot of the GPUs that's great deal for those companies. But it does mean that open AI is kind of breaking off this reliance that it had with Microsoft. The story there seems to be that Microsoft's sort of risk appetite to keep on with the Stargate build. Has been more limited than open ais. I think that's somewhat overblown.

30:18

My understanding is that behind the scenes Microsoft is actually a major funder of the Stargate Initiative. It's just not a sort of publicly recognized thing. but still this is OpenAI kind of finding even more diversity in their supplier and their vendor portfolio, if you will, for compute. And that gives 'em a bit more leverage over Microsoft. So you see each company kind of trying to position itself, right?

30:40

Microsoft is going off and making their own reasoning models, and OpenAI is going off and finding their own compute partners. And there's this very uneasy equilibrium between Microsoft and open AI that kind of seems to be falling apart a little bit, kind of, death by a thousand cut style. Next story.

30:54

We are getting into chips being made in China or possibly China actually as we'll get to the story is that Huawei has now a new chip line, ascend 910 Sea, which is seemingly entering production and would be the leading AI accelerator product that is made by a Chinese company. So this is a part of their chip line via Ascend nine 10.

31:26

They are, there's various analysis, I guess we don't know exactly, but from one person commenting Leonard Ham, it seems that this will be sort of along the lines of an H 100. So not near NVIDIA's flagship chips now of B 200. It's probably. You know, a fraction, one third of a computational performance much less computational memory and so on. But still, this is a domestically made product for AI acceleration that is moving closer to Nvidia.

32:07

At least this is comparable to an A 100 GPU for instance. Yeah, this is a, it's a really, by the way, Leonard Heim, great guy to follow. If you're interested in anything, kind of China chip related a lot of export control stuff. He has this great tweetstorm that's worth checking out too on this. But yeah, it, it is a big story, right? So one of the, the big dimensions of this is that China's been able to get its hands on the.

32:32

Ascend nine 10 Bs. So just for context, the nine 10 C is the kind of next generation Huawei chip that's just entering production. That's the one that's like, depending on how you calculated about 80% of the performance of an Nvidia H 100. So, you know, the H 100 I, I think, sort of came off or started production, you know, like three-ish years ago. So 80% of a chip that came out three years ago, presumably. I mean there, there's a little fog of war here.

33:01

A lot of these chips seem to have been sourced illicitly from TSMC. I've seen some differing accounts there, both from the CSIS post. There's a was it tech Radar article or something like people are disagreeing over this a little bit, but it definitely seems like there are a lot of. Illicitly acquired chips and including potentially these nine 10 Bs that were produced presumably by TSMC. So two of these nine 10 B dyes. These are the logic dyes.

33:28

If you go back to our hardware episode, logic dyes actually run the, the computations. They're distinct from the memory, the highend with memory that sits on the chip. But these 2 9, 10 B dyes need to be packaged together to make a nine 10 C. And so apparently they seem to have gotten their hands on about 2 million of these nine, 10 Bs enough to make around a million. Nine, 10 Cs. And essentially around a million H 100 equivalents this year seems to be well within reach.

33:55

There are all kinds of caveats around dotted eye bandwidth and, and how the packaging process that Huawei actually has available to it sucks compared to T SMCs. But you know, this is a lot of stockpiling like the China has done and Huawei has done an amazing job of stockpiling a whole bunch of chips ahead of export controls. This doesn't just include logic logic dyes like the nine tens, but also high bandwidth memory.

34:20

So HBM two E that they sourced from Samsung, apparently they have enough for about 1.4 million nine, 10 C accelerators. So even though these chips are now controlled, they were stockpiled previously. A similar thing, by the way, happened with the actual Nvidia chips. We talked about that. Back in the day when Nvidia. Essentially had new export controls that they knew the US government was gonna bring in to prevent them, for example, from shipping.

34:45

at the time I guess hopper chips and, ampire chips, like a one hundreds for instance. And what they do is they, they know that the export controls gonna come in and so they try to throw as many of these chips into the Chinese market. In fact preferring Chinese customers over American ones for the purpose because they know they'll still be able to sell to America once the export controls come in. And so this is sort of Huawei and Nvidia in a way, like through incentives, essentially.

35:12

Being pushed into partnering together to jack up Huawei's supply of everything from supply of, of ready to go chips. So the stockpiling strategy is something that happens all the time. China's really good at it. It compliments the illicit acquisition after the export controls come in as well. And the net result is they end up with a lot of these chips.

35:32

You shouldn't think of export controls as being meant to get a perfect block on Huawei or SMIC or any of these companies being able to acquire technology. It's, it's a matter of slowing them down and it's something that compounds over time to create more of a gap. But anyway, sort of really interesting that, you know, when you get to, a million H 100 equivalents in the, the Chinese market just from Huawei one thing you gotta think about is the CCP is really good at centralizing. Their compute.

35:57

So if they wanna do a big, big training, run a national scale project, they can do that much more easily than the us. So even though we're producing way more chips, ours are spread out across like a whole bunch of different hyperscalers and smaller companies in principle, China can just step in and say, Hey, you know, we are commandeering these chips, throwing them together, doing a, a really big training run. So that's an important distinction.

36:18

And why those 1 million h 100 equivalents that we could see again in just 2025 from just Huawei is, is pretty remarkable and important. And we have a follow up to that story as you mentioned. The other article kind of along with this that came out is that Huawei did reportedly acquire 2 million Ascend nine 10 AI ships from TSMC last year through Shell companies. So a little bit more on that. They acquired seemingly more than 2 million Ascend nine, 10 B Logic dies. Nine 10 C is.

36:53

Kind of a combination of 2, 9, 10 Bs from what I could tell. And they did go kind of around, they didn't directly get the product from TSMC. TSMC Sly caught this happening and halted the shipments after an internal investigation. And so this kind of was unintended partnership, you could say. And from what I also saw this, it seems like a seven nanometer process. So it's not the most advanced technology TSMC has, which is probably not surprising.

37:29

We, you know, other customers are using up all that capacity. But it does demonstrate that as we know, export controls are in place and we've seen a lot of leakage through the export controls. And this is seemingly a pretty big example. Yeah, absolutely. And, and the fact, as you, as you alluded right, the seven nanometer process that they're using, that's something that variants of that with debatable yields and, and other properties are things that SMIC can do domestically.

37:57

So, SMIC being China's, TSMC which also was founded based on very clearly TSMC derived ip, that was effectively stolen a pretty, a pretty cool story. I will say SMIC. There's some interesting people from TSMC who it's a real story, let's say. Yeah, yeah, yeah. It's, it's sort of a classic Chinese corporate espionage story, right?

38:20

Like poached some of the most senior figures at the company have them come over and, and I think, I mean, they stood up SMIC and got it to reasonable production on close to leading node in like suspiciously close to record time. I think it might have been 12 months or something outrageous like that. So like an unheard of speed of, of takeoff, which is part of what triggered the whole suspicion from TSMC that this is going on in lawsuits galore and all that. So yeah.

38:45

Yeah, I mean it's, it's essentially like. China has a pretty solid domestic capacity to produce chips. Huawei is roughly their Nvidia, SMIC is roughly their TSMC. But they're both under the same roof, if you will, under the CCP umbrella, so you can see it potentially more integrated partnership there as they form one kind of Huawei SMIC complex, a bit tighter than Nvidia TSMC.

39:08

But one important difference is SMIC does not have access to EUV machines, and so you're pretty much bottlenecked at the seven nanometer node. Maybe they can push to five nanometers with multi patterning, but it seems, you know, you're like, you're gonna run outta steam pretty quickly. And the other interesting thing too is if you look at TSMC we've talked about this a lot, but their leading node is all, you know, iPhones.

39:34

So that means that it's the, the node up right now, the five nanometer, four nanometer node that goes off to GPUs. Well, you know, you don't have that with S-M-I-C-S-M-I-C is having to balance their leading node at seven nanometers between, you know, Huawei smartphones and well other smartphones and the GPU supply. So there's kind of a lot of interesting stuff going on here. Yields seem to kind of be shit at SMIC. Last I checked, I is like around 75% or so, which is not economically great.

40:01

But when you have you know, the Chinese government subsidizing you to blazes, you don't necessarily need the same yields that you might to ha to be economically viable if you are TSMC. So anyway yeah, interesting situation and definitely they've gotten their hands on an awful lot of these illicit chips. We've covered a lot of stories like this in the past, and, and this honestly hasn't come as too much of a surprise to a lot of the people I talk to on the export control side.

40:24

But anyway, it's all, all kind of locked in. And now let's take a short break from talking about chips. We'll get right back to it. But quick story about investing. We now know that Google has a substantial investment in anthropic. It seems that Google owns 14% of philanthropic, although that doesn't come with any sort of control mechanisms. They don't have voting rights or board seats or board observer rights.

40:52

Something that Microsoft has at open ai, for example in total Google has invested over $3 billion in anthropic. So kind of an interesting note, maybe it seemed like a tropic is aligned with Amazon as a primary, let's say Ally sort of an to open AI and, and Microsoft and Google investing this much in sort of a rival is, at least to me, a bit of a surprise. Yeah, I mean they're, you know, they're gonna be hedging their bets.

41:21

I guess if you're Google, one of the things you think about, you look at search GPT, you look at perplexity, obviously the search space is changing, and sooner or later, someone's gonna do something that goes after your search market share in a big way. And because Google owns so much of the search market, like well over 90% that, and because the search market represents such a big fraction of Google's overall revenue.

41:45

They kind of have no choice but to make sure that they own a piece of the search pie wherever it goes in the future. But still, like, yeah, I mean, it's an interesting play. One of the consequences of the Google and Amazon investments in anthropic is anthropics reliance increasingly on TPUs and train chips, training being what Amazon has and TPUs being what Google has.

42:06

and that's, we've covered stories, you know, that involve that and, and some of the, the challenges associated with training on that kind of infrastructure. yeah, I mean, it's, it is interesting. It is also something that we're learning because of the antitrust case that's been brought on Google in this case. Yeah. On Google in this case, looking into basically, you know, are, are you controlling too much of the market?

42:28

one thing that we do know is that Google is now being required to notify antitrust enforcers before investing in any more AI companies. This is based on a, a revised justice department proposal. That was filed Friday. So if that holds this is a, an interesting requirement to kind of pre-register what they're gonna do. There was an initial proposal by the way, that would have required Google to fully unwind its investments in companies like Anthropic, but that's longer apparently on the table.

42:54

Would've been a big or very big deal, right? So as you can imagine Google pushing back really hard. There's also a whole bunch of stuff that the DOJ has proposed, including a forced sale of the Chrome web browser that is still on the table. So the claim here is that the government is quote concerned about Google's potential to use its sizable capital to exercise influence in AI companies. So, yeah, no surprise there. but still forcing a little bit more disclosure than normally would have.

43:22

And kind of interesting to note the, the percentage ownerships and, and the board structure. As promised, going right back to chips. And now to meta, we got the story that they reportedly testing an in-house chip designed for AI training. This is the one that was made in collaboration with TSMC and is in the initial testing, small deployment phase.

43:48

Meta has used custom chips for inference, but not for training and we have reported on them doing this kind of development and, and trying to essentially have something to compete with TPUs. So, seems interesting, like, I don't know what sort of timeline is safe to project for this kind of project. I would imagine it would take years of engineering, so I don't know if them doing, testing in-house is indicative of very rapid progress or wherever they're at.

44:24

Y Yeah, I mean, so one, one interesting thing is I haven't seen any word of collaboration between meta and a separate entity that would help them with chip design. So they are like, you know, opening AI partnered famously with, with Broadcom, so did Google right to make the TPU we're not seeing any indication of that here with Meta. So it does seem like they are fully going in-house with this. They do seem to think of their inference ships as having been a, a big success case.

44:50

Those are only being used for recommender systems right now. So obviously that's a huge part of Meta's business, right? Having to serve up ads and content. And so the recommender systems at scale have have specific requirements that Meta is tracking. Where we are now with this, to speak to your, your question about timelines. So, right now meta is finished what's called their first tape out of the chip.

45:11

This is basically a, a threshold where you send an initial design through a chip factory and, this is super costly, right? Tens of millions of dollars, well, not super costly for meta, but anyway, tens of millions of dollars and it can take three to six months to complete. No guarantee that things will actually succeed at the end of the day. And if there is a failure. Which by the way, has happened before for meta at this very stage.

45:35

They've gotten to the stage before with previous chips meant for the same thing. If that happens, then they have to go back, diagnose the problem, repeat the tape out step and sets them back, presumably that additional three to six months. And so, the in-house custom inference chip that they built previously had flopped just before the sort of small scale test deployment. That's the stage we're at right now.

45:54

Meta wants to do this little, little deployment, kind of see how things work in practice. and the interesting thing is like after they did that, after the first time that they had this, this, this kind of in-house design blow up in their face, they. Had to figure out an alternative strategy, right? Like we need to now have the compute to do what we're gonna do with these chips.

46:14

And so they were forced to place a multi-billion dollar GPE order with Nvidia which then, you know, gives them a later start on that as well. So it's this interesting balance between how do we hedge our downside and make sure that we have standing orders with Nvidia, say, but also we want to have independence from Nvidia, so we kind of need to double dip and have an investment in, in-house chip design.

46:33

So, yeah, I mean we're still seeing as well in the, in the case of this particular chip a recommender system focus, though, the plan is eventually they do wanna start using their own chips for training. And that would be, they think around 2026. And one more story about hardware this time Data centers. It's about XAI and they have bought a 1 million square feet site for a second Memphis data center. So this is an $80 million acquisition.

47:05

they are seemingly aiming for this new data center to be able to support up to 30 350,000 GPUs. Up from the a hundred thousand, is it 200,000? I've lost track in their existing Memphis facility. Yeah. the initial rollout was a hundred thousand, and now they have Yeah. A plan exactly. to double it. So you're a hundred or 200,000 is exactly right. It's, it depends on when and, and yeah.

47:35

The interesting thing with this new facility that they're standing up, it's next to the south Haven Combined Cycle natural gas power plant. And that generates about 780 megawatts of power. So, then there's the question of, okay, well. Sure, but how much of that power is already in use? For context, when you think about one megawatt of power, order of magnitude gets you to about a thousand GPUs.

47:59

So one GPU is around, it's a little, it's a little over a kilowatt that you're gonna consume, which is about the power consumption of like one home, right? An average home in America. So if you look at 780 megawatts of power, if you had that all available, which of course you do not because there's, you know, industry and there, there are houses using this and all that. But if, if you just pour that Into GPUs.

48:21

You could already kind of get to the, several hundred thousand GPUs that are powered by that, which is exactly where you're getting that, you know, 350,000 presumably blackwells that they'd be looking at there. The other thing is that we know is that apparently Memphis Light Gas and water, which is a local energy company, said that XAI has requested a system impact study for up to 260 megawatts of power.

48:44

So presumably that means that they only expect, at least in the near term, to use 260 megawatts. If that's the case, then you're looking more at like maybe 200 thousand GPUs, depending on Yeah, power usage, efficiency, and a whole bunch of other factors that go into your, data center. But generally speaking, this is another big build. The data center apparently will be home to what they claim will be the world's largest deployment of Tesla mega pack batteries as well.

49:12

Which is something that, so when your power goes in. Sometimes you get power spikes. This is actually a, a, a big problem with Nvidia GPUs that they're trying to sort out right now. But essentially when you like start the training process, you get like a massive burst of, of power consumption which can be on the order like 30% or so. That's a real problem because your power infrastructure may not actually be able to handle that. For that reason.

49:34

You often want like batteries that can be connected to your data center that can deal with that spike in power demand, and that's where the Tesla mega packs become important, right? It's a source of kind of capacitance. It's also there for when you know, the grid load is just too high or you just need to inject more power for whatever reason. So yeah, really big build XI continuing to be at the forefront of this stuff, right?

49:55

Like we haven't heard the big Stargate announcement, but when you look at the numbers, they're like, they're actually up there with open AI right now. And onto our last story. It's not a business section if you don't have a hundred million dollars startup as ever in ai. So we've got a new one, and yet again, it is founded by DeepMind X researchers. So they are launching Reflection ai. They are two former Google DM I researchers and they have 130 million in early stage funding. Not too many.

50:33

Details that I was able to find here. It's the co-founders are Michelle Laskin and Jonas an, I don't know, that said they worked on Gemini training systems and now they're aiming to develop super intelligence much like SSI and others in that space. Seemingly they're starting by building an autonomous programming tool. It's $130 million in funding. Not a ton of money.

51:03

It sort of makes me think of thinking machines, you know, that Mirati startup, you're seeing a lot of super intelligence companies that aren't raising the giant amounts that. The scaling laws suggest at least nominally you would need to compete. But you know, these are smart people, so we'll see. By the way, the cap table is pretty wild, right? So funding round for their seed was $25 million. This is now all by the way, being announced at the same time. So they're coming out and saying, Hey.

51:29

it's not just one fundraise at one 30 million. We actually had raised a seed at 25 million. We raised a 105 million series A after that. But the seed round was led by Sequoia so like basically the best VC on planet Earth. Then the Series A was led by Lightspeed Venture Partners. So really, really solid vc. And then there are other investors, Reed Hoffman, right, LinkedIn co-founder. Alex Wang from scale ai. That's really interesting.

51:53

'cause he's a, he's been historically a kind of AI safety focused guy, including sort of loss of control alignment stuff. And then there's SV Angel, and importantly the VC arm of Nvidia. So that's, you know, a really, really big partner to have on hand. We saw how it moved the needle for, for core Weave in terms of getting allocation for GPUs. So that's, that's kind of interesting. Latest valuation, half a billion dollars not half bad. I would take that. anyway, we'll see.

52:18

They have paying customers by the way, so that's at least different from LEAs safe super intelligence. So there, there's that. But as you said, it's, it's really unclear what exactly they're doing, just working with apparently. Fields that have large coding teams, such as in the financial services and technology sector. So very specific, right?

52:37

Yeah. In their blog post they mentioned, for instance, imagine autonomous coding agents working tirelessly in the background and workloads will slow steam down. So even though they say their working on super intelligence, it does seem like in practice they're producing a product that is more of a tool to be used and similar, for instance, to cloud code from Atropic that has just released coding seems like increasingly the new frontier and letting an agent do its thing.

53:07

So, could even, yeah, could see this coming out to a product relatively soon.

⁠¶ Projects & Open Source

53:11

And onto projects and open source. We have a couple exciting new models starting with Gemma free from Google. So Gemma is kind of a little sibling of Gemini, and this is a Multimodel addition to that family. They have this at scales from one to 27 billion parameters. The new thing here is vision, vision, understanding capabilities. They also cover more languages 35 languages and a longer context.

53:47

128,000. Tokens and some kind of architectural changes to be able to leverage that context effectively. They mentioned that this was done via distillation and are, you know, as you might expect much better than Gemma to for both pre-training and fine tuning. So Gemma, you know, another one of these small to medium scale models, increasingly you are seeing these 20, 20, 27, 12 billion parameter models and you see them being pretty performance.

54:26

So Gemma 3 27 B they say has an ELO score on chatbot arena that is higher than deep Seeq V three, for instance. And also higher than LAMA three, 400 five B. actually they have a, a nice handy illustration too on that chatbot arena ELO score chart where they show the, the estimated GPUs that you need to run each of these models. And I think that's actually really I. A really important dimension that, that Google is articulating better than I've seen it articulated before.

54:59

The referring to this as the world's best single accelerator model. In other words, single GPU model. And when we talk about I think we'd previously been referring to it as the koff scale of model size. Basically you'd have these like, how small does a large language model have to be, or before it's not called large anymore.

55:18

Anyway there's been this debate, right, that a model is not truly open source if it requires so much hardware to run that you might as well be a big company to run it, right? So if you open source a as you know, Deepsea did with R one, if you open source a model that requires like dozens of GPUs to run like yeah, you've open sourced it. The code's out there, that's great, but. This model has, you know, 671 billion parameters, you just can't fit that on one GPU.

55:46

And so is it really there for the people? You know, can the people really use this meaningfully? And the answer is kind of, I mean, you can have companies that run their own instance of it, and that is a kind of open sourcing, but everything's along a spectrum. The, the big flashy point here is one GPU to rule them all one GPU to run this entire model. I will say there is kind of an important distinction when we talk about what is impressive and what's not impressive in this space.

56:10

So you might look at, you know, deep CCAR one and go, oh my god, that's almost 700 billion parameters. Compare that to Gemma three 27 billion parameters. So a fraction of that this must, mean that Google just like knows what they're doing a lot better. The answer is that these are different use cases fundamentally. You know, so deep cq R one when you actually like. Push a token through it and do inference. You're only using about 35 billion odd parameters in the model.

56:39

So not all parameters get activated. It is a mixture of experts. Model. Moes tend to have far more parameters relative to their performance. That's just how they're set up. So you go with an MOE typically when you care more about the performance than Then the infrastructure costs of hosting the model. Whereas you might go with a mo a smaller, kinda like monolithic model if you wanna just compress it, have it running like on the edge or something. with sort of as high performances you can.

57:05

So that's part of the trade off here. These are just different things to care about. But certainly if you care about, like, I wanna be able to run this locally on, you know, my own hardware. Yeah. Like this is a, a big step forward. And again, I like this, this line, the world's best single accelerator model. It's better than some of the previous statements we've heard of, like the world's best 7 billion parameter model or, you know, 22 billion per 'cause. Those just seem too specific.

57:29

This ties it to something that we actually care about. And that's a moving target too, right? Accelerators are improving all the time, but still feels more concrete and, and useful. Right. And as with other releases of this flavor you know, you can get the weights, the code to run it on hugging face for instance. And you have a proprietary license that allows quote, responsible commercial use. So they have a bunch of things You're not allowed to, you know, let's say broadly do bad things.

58:00

A lot of restrictions unlike something like rock rock two. So, yeah, there's so many of these good small, smaller models these days, and I think that really showcases the power of distillation when you train a giant model like Gemini two. Next story. We have a, a model from sesame, and we just covered this pretty recently.

58:24

Their virtual assistant Maya, this kind of pretty impressive conversational AI that you could talk to and, and have a sort of naturalistic interaction that we had demoed from, let's say OpenAI for example. So we have now released the base model CSM one B, that powers that model of, sorry, that personal assistant Maya. So 1 billion parameters in size available under the Apache 2.0 license, meaning that you can use it for commercial applications. You have. Pretty relatively few restrictions here.

59:03

And they have of course a fine tuned version of CSM one B for Maya. So this isn't the exact same model they use for that demo that they launched a couple weeks ago, but you can use this as a very powerful startup model to do, for instance, voice cloning with relatively little data and it's capable of producing really pretty naturalistic speech. So this is an area where you don't have too many models capable to produce really good outputs, idea generation in general.

59:34

There's less open source stuff there than LLMs, and this is certainly like a pretty exciting new model for that application. It's also you know, not a a monolithic model, but it's a model that comes in parts. So you have a, a base llama model, which is kind of the backbone, and then it's paired with a, a decoder component. that's what they fine tune to build Maya anyway. So, is kind of interesting. They talk about the data as being.

01:00:01

Not revealed by sesame, so we don't actually know what went into it. They say it's capable of producing a variety of voices, but has not been fine tuned on any specific voice. And they say also that the model's been some capacity for non-English languages due to data contamination in the training data, but likely won't do well. They are also not including any real safeguards.

01:00:22

They have an honor system and they just urge developers and users not to use the model to mimic a person's voice without consent. So cool. that's good shit. the, um, journalist I guess who wrote this said he tried the, the demo. And apparently cloning your voice on the system takes less than a minute. And from there, it was easy to generate speech to my heart's desire, including on controversial topics like the election. So that's fun.

01:00:47

Sesame by the way, is a well-funded or at least, I shouldn't say well-funded. It's a, well, their, their market their cap table looks really good. So they've got Andreessen Horowitz, spark Capital. So really like top line VCs. And yeah, interesting as you say. I mean, it is a differentiated space. We'll see how long it remains differentiated. If we end up with like, the world of multimodal models that eat everything, then maybe they get gobbled up by OpenAI.

01:01:10

But certainly for now, haven't seen that many of these models. And just a couple more stories. Jumping back to the smaller large language model side, we have a new model from Recka ai Recka Flash three, which they say is a general purpose reasoning model trained from scratch using a combination of public and synthetic data sets. And in their comparison, it's able to go head to head with O one Mini and QWQ 42 B. And this is a 21 billion parameter model. You know, overall not. Super strong.

01:01:50

It has a relatively short context length of 32,000. It has also a budget forcing mechanism which allows you to limit the model's thinking. And it, it's similar to Gemma three is possible to run on a single device. So not, you know, super lead of a pack in the open source front. But another useful model, smaller model for other people to build upon. Yeah, as ever with these models, especially for the smaller developers, you can't just can't afford to keep up with frontier capabilities.

01:02:26

it's all about what your comparison points are, right? Who you, who you choose as your opponent. And so in, in all their kind of headline figures they are comparing mostly to, so like Quinn with questions three, 2 billion, right? So making the argument that, hey, almost all of these benchmarks, it's actually behind qu with questions 32 B. But it is a smaller model. So I guess the case is like, hey, we're almost as good as a 32 billion parameter model with a 21 billion parameter model.

01:02:52

Not obvious to me how much that buys in terms of impressiveness. One thing I will note, like they generally, I mean, I would call this model on par with oh one mini, which dropped, I mean like, when was that, you know, late last year. So I. When you look at like how fast open source has come, it's like four or five months behind. Granted, Owen Mini is something that OpenAI had been sitting on for a while, right?

01:03:16

And it's also, it was their mini, it didn't represent the kind of true cutting edge of, of what they could do at the time. But still, you know, like open source being. Six months, seven months behind something like that. That is a closing of the gap. And we've seen that trend hold for some time now. So not just Chinese open source models, but now sort of like western open source models in all forms. So there you go. The, the rising tide of reasoning models, right?

01:03:42

And all to a point reasoning model. They do have some indicators of what you would hope reasoning models that as you use more test time computes, you get better accuracy on tough benchmarks. Like Amy, they have that here. If you go up to, you know, 20,000 tokens, you're able to do substantially better when with fewer outputs. So I guess it's, it's significant in the sense that also there's not too many reasoning models.

01:04:12

Of course now we have R one, which is pretty big one certainly, but with, as a smaller reasoning model is so significant. And one more story on the open front. We actually have a paper we're not gonna be doing a super deep dive on, but still worth noting. The report is open source A two training, a commercial level video generation model in $200,000. So this report shows how they trained a pretty performance model, not up to the level of soa, but relatively close and, and better than.

01:04:47

Other models like Cog video and Ion video and there's essentially a whole large bag of tricks they go through. They have, you know, data creation, training strategies, AI infrastructure, lots and lots of details that allow you to train this model for, you know, what is relatively cheap $200,000. So, I think interesting or exciting always to see these in-depth and go reports that really showcase the nitty gritty of how to be able to train this kind of model.

01:05:24

Yeah. And, and it really is, it is always like nitty gritty when you look under the hood, right? The, the age of just coming up with a new architecture and oh, it's beautiful. Arguably never really existed, but, and now it really, really doesn't exist. Every time we do get an open source release, we get to look under the hood. It's all these, you know, like optimizer tweaks and, and, you know, batching and, and like finding ways to get to get your accelerators to work.

01:05:51

Efficiently and overlap communication and computation. Like all these things, the engineering is the outcome over, like, over and over again. So, yeah, I mean it's another example of that now with with SOA type models. And, and you can see like for on a 200 K budget, right? That's training of, if you look at the benchmarks, yeah, it's pretty solid pretty solid set of win rates relative to, to other models that are comparable.

01:06:15

And yeah, so I mean, what this means for people's ability to not just train, but also fine tune their own video models and, and eventually inference super cheaply, like that's, pretty, uh, significant.

⁠¶ Research & Advancements

01:06:25

And onto research and advancements. We begin with an announcement from DeepMind. They are announcing what they're calling Gemini Robotics. Models that are optimized for robotic control in a general purpose way. So these are, I guess, built on top of Gemini two and are meant to really focus on the reasoning side of things like physics prediction 3D space, all that sort of stuff. So they published a very meaty technical report.

01:07:01

Lots of details, a lot of focus on the perception side, on the various kind of general capabilities you're seeing that are very useful in the context of embodiment.

01:07:13

With respect to the general purpose nature of the robotic control, they collected a bunch of data over the last 12 months with these Aloha two robots that are like two arms, and they then are able to give it an image, give it an instruction with text, and the model plans and outputs code, and the code is then executed via, I'm not even sure what it is, so it's not quite the same as what I think figure one x one of those two announced with a dedicated model for control.

01:07:50

This is kind of a planning model, then an execution model. That's not as far as I can tell, not real time, but either way really, you know, deep investment in robotic space they compare to things there. We implementation of PI zero where PI Zero was also meant to be this general purpose robotics foundation model.

01:08:15

And they show that this is capable of doing a whole bunch of manipulation tasks with, without fine tuning necessarily, where you can just give it objects and instructions and it is able to pull things off. Yeah. And Gemini Robotics itself you alluded to it, it's like it's two component architecture, so they have the VLA backbone which is. A distilled version of Gemini Robotics, er, right?

01:08:41

So they have this model, essentially Gemini Robotics, er, its job is to kind of understand the world and reason about, you know, kind of spatial information and things like that. But then the action model itself is tacked on sort of like local action decoder that runs. the VLA backbone, so, so the Gemini robotics, er, the distilled version of that, that runs on the cloud, and then the thing that's local to the robot is the action decoder.

01:09:07

So it, it's on the onboard computer smaller model and, and therefore low latency. So it is optimized for real-time control. They have apparently latency query to response latency that has been reduced from seconds for previous iterations to under 160 milliseconds for the backbone end to end latency if you include the action decoder is closer to a quarter of a second, 250 milliseconds.

01:09:30

So you know, that's pretty quick, like, getting into the, the domain where you can see interactions with these systems and, and importantly, you know, 250 milliseconds means you're able then to respond to what you see and touch right in a relatively reasonable time period. I actually don't remember how long it takes humans to kind of, you know, respond to stimuli, the environment. I, I wouldn't be surprised if it was in that ballpark. But this is, by the way, a supervised, fine tuned model.

01:09:56

So you're looking at the collection of, of huge amount of data, thousands of hours of, of expert demonstrations that they say they collected over 12 months. A whole bunch of diverse tasks and, and non-action data as well. So that itself is interesting. I mean, not surprising just because you don't usually expect RL to be used in this context. 'cause RL is just.

01:10:16

You know, super sample inefficient and also for safety reasons unless you've got a really, really high fidelity sim to real transfer, you're gonna be doing some of this stuff potentially in the real world if you are, there are safety issues with just using rl. But in any case, yeah, they say they collected, you know, a bunch of teleoperated robot action information on their Aloha two robots. And that's what was used for supervised fine tuning. And the result is, I mean, pretty impressive.

01:10:43

kind of feels like it follows in that tradition of gato, the sort of truly multimodal models that can do anything including control robots, including understand video, like just trying to try to stuff it all in there to get to true a GI it is a very kind of Google Deep Mind paper in that sense. Right. We have some fun examples of things you could do. It can make an origami fox pack a lunchbox, and to give you an example an idea of aloha to RV's.

01:11:10

Little two gripper hands with some, you know, relatively less expensive hardware as far as you can tell, but still pretty capable in terms of what they're able to pull off. And, yeah, I guess I should correct myself a little bit this not, this isn't like an end-to-end image to control model as far as I can tell, because of the intermediate code output, but for the execution of individual code commands, I, I guess we are using that Gemini robotics vision language.

01:11:40

Action model to execute things like grasping, for instance grasping objects or moving to a particular pose, which the planning stage is output. So, don't believe this is released yet. I need to double check, but I have a way, you know, we saw this surprise zero. This is pretty much comparable to that in claiming to be a robot foundation model. And I think also figure N one X are working on similar things with a general purpose capable robotics model.

01:12:12

So it really does seem like there's a pretty significant amount of progress in the space, and it is pretty conceivable that we can have pretty broadly capable embodied AI agents in the coming years, which has pretty significant implications for, let's say, economic impacts and so on. For sure. I mean there, I, I haven't read it yet.

01:12:32

There's a, post where they're talking about, implications for China of a lot of the breakthroughs in robotics and, I mean, this sort of makes me think of that, right? To the extent that we're building really, really good software, you know, that, that controlling robots is a software problem pretty quickly. Then if China replicates that, which they will your ability to just manufacture robots at scale is one key, key determinant of, national power.

01:12:57

And China has this just is mopping the floor with us on that. So, that'll be an interesting thread to follow for sure. Yeah. And, and there's even more of a, kind of a, a lot bundled with this announcement. They also, as a fun side detail, released the MOV benchmark. So there's another paper called the Generating Robot Constitutions and Benchmarks for Semantic Safety.

01:13:23

They say that this MOV benchmark is a comprehensive collection of, for data sets for evaluating and improving semantic safety, a foundation and models serving as robot brains. So that's pretty fun, I guess. They are focusing on the safety side to not get Skynet to happen, I guess. They also have another data set they release as part of this I think it's ER qa. I. It's a, it's a visual reasoning dataset with a focus on embodied reasoning.

01:13:55

And so a lot of you know, variation and means of benchmarking for embodied applications in particular. Next we move away from robotics back to the favorite topic of recent months. Test time, compute and reasoning with the paper. Optimizing test time compute via meta reinforcement, fine tuning. So the focus here is not on enabling test time compute and reasoning. It's on making test time compute usage efficient. So, that has been kind of a known problem.

01:14:32

And we covered a fake, a paper for this, this idea of overthinking where you use more test time, compute when necessary. They have an interesting figure in the paper actually, that it seems that in many cases, just doing a majority vote. So making many outputs. And then just from among the many outputs choosing what most of the models, output doesn't actually outperform test time compute scaling.

01:15:01

So in fact, test time compute is maybe not as efficient as just doing inference with shorter compute a bunch of times. So in any case, they are introducing a method to optimize test time compute efficiency via meta reinforcement. Fine tuning. So reinforcement learning is a learning to optimize your reward for a given task. Meta reinforcement learning or fine tuning is being able to adjust quickly to a given task.

01:15:34

So it is kind of a metal layer, right, where you're learning to achieve good rewards efficiently with not much feedback. And I do that by providing a dense reward during the reasoning process. So. Similar as opposed to process reasoning models as well. They chunk the test time, compute reasoning steps into episodes. They provide reward for each episode that indicates how much progress it represents towards solving of a problem.

01:16:09

And then they are able to train the model to, pretty much minimize regret, optimize, being able to make rapid progress. And they say that they're able to show two to three times improvement in performance for math reasoning compared to just outcome based reward reinforcement learning, where you are getting just zero one rewards at the end. So that's a pretty significant, you know, one to also at a 1.5 gain in token efficiency.

01:16:42

Yeah, it, it sort of reflects this challenge of, this is sort of classic exploration, exploitation, trade off that you see in reinforcement learning. I mean, I would argue basically anywhere, but reinforcement learning is the place where it's most obvious in its mathematical formulation. Essentially at any given step. If you are a language model and you're trying to reason through some reasoning trajectory to solve a problem, you could kind of choose like, do I just generate.

01:17:05

An output right now, which is very compute efficient. Like, I'm not gonna spend a lot of flops, I'm just gonna do a, you know, quick inference and, and the job is done. Or do I spend more time doing discovery, doing exploration, testing out different possible solutions, right? And certainly that exploration, but is important.

01:17:24

We know that because when we look at models like R one zero that are just trained by reinforcement learning to reason as much as they want to get to their solutions, what you see is the reasoning traces get longer and longer and longer, and those traces correlate directly with higher and higher performance. So there certainly is value to just longer reasoning traces that allow for more discovery. But the question is like. is all of that discovery actually worth it?

01:17:49

Like at a certain point are you just sort of mulling over the unknowable? Are you, are you kicking a dead horse or whatever that quite literally overthinking? Yeah, exactly. Yeah. Quite literally overthinking. Not necessarily because it'll, it'll kind of like, you know, give you a worse result. Though potentially that's a thing. But also just because it's wasteful, right? That's the, the big thing.

01:18:08

And when you think about the human brain, the way that our brains were sort of evolved, yes, there is a pressure evolutionarily to make good predictions so that we make smart. Calls, but also there's a huge, huge pressure to be compute efficient or energy efficient, right? The human brain runs on, I forget how many watts, but it's like a shockingly small amount of energy. And that's often a, a distinction anyway between AI systems and computers where we just face different constraints.

01:18:33

And so this is going to be an effort to say can we measure, as you said, the progress that each of reasoning in a long reasoning thread, in a long chain of thought. Each chunk of reasoning. How much does it contribute to the accuracy of our final answer? And it might seem weird, like how do you even measure that in, in practice? There are a couple different ways.

01:18:56

The sort of most straightforward is that they just they look at, okay, after this chunk of reasoning, they, by the way, in reinforcement learning terminology, they refer to, to these chunks of reasoning within the chain of thought as episodes. So an episode in, in RL parlance is like a 1 play of a game or something.

01:19:15

you know, these episodes or essentially chunks of reasoning, what they're gonna do is after episode number, say five they will, instead of letting the model move on to episode six, in other words, try another chunk of reasoning. They'll sometimes just like force it to give an answer. I. Directly from say, episode number five, and they'll do that like 20 times and get 20 different answers. Let's say that 20% of those answers that jump straight from episode five are correct.

01:19:42

Then you let the model go on to episode six. You repeat the process, have it generate 20 answers straight from episode six. Let's say that those 20 answers you get, I don't know, 60% correct. Well, now it's like, damn, episode six really kind of made a difference. So it's worth it to keep reasoning up to that point, but maybe episode seven, you know, once you repeat it there, it plateaus.

01:20:02

That kind of gives you a sense that, hey, we're not actually making more progress here as we add more of these episodes as we do more in context or yeah, I guess chain of thought stuff. And so maybe it's worth cutting our losses, right? So this is quite kind of gonna be how they, they instantiate this in practice. They're going to train 'em two different models to take advantage of that information. One of them, they'll train through supervised, fine tuning.

01:20:26

This one will involve basically just generating a whole bunch of reasoning traces for a whole bunch of problems. Segmenting those reasoning traces into episodes, right? These like logical thought segments evaluating, again, each episode the, the progress that's made by each episode by forcing the model, just give its best answer at that point. And then ultimately filter for traces that achieve maximum progress. So they, they make steady improvements.

01:20:51

Each episode builds considerably on the previous one and then eventually reach the correct answer. And they're just gonna keep those reasoning traces. And what this means is, like, if you think about it, this is filtering for cases where. You're adding more and more reasoning steps, like more and more episodes, and each one corresponds to a significant bit of forward progress, right?

01:21:11

Your ability to get to like kind of one shot the right answer after that stage keeps going up and up and up consistently, which is not always the case. Sometimes, you know, the, you'll get an episode that is incoherent or, hallucinates something or whatever, and it'll actually make your, outputs less likely from there to be correct. So they're filtering those out and getting this really kind of.

01:21:31

Pristine dataset with correct answers, but also correct, or let's say constructive progress in between. And then they're just gonna fine tune a model on those reasoning traces. That's the sort of a supervised fine tuning version. They also do a reinforcement learning version where at each of these episodes, they'll actually do not just a, a one shot output, but partial rollouts.

01:21:52

So, so do multiple continuations some of those continuations will like terminate reasoning immediately they'll produce a final answer. Others will keep going. And in either case, what they do is they issue a reward and. Both process rewards based on that same metric of like, how much progress are you making and also final outcome rewards for correctness. So anyway, these are two different approaches.

01:22:16

Fundamentally based on, mean, I would argue the, the main insight here is this idea of like, can we chunk things up into an episodic format that more closely resembles the traditional, kinda like reinforcement learning paradigm, and then find a way to evaluate progress towards towards an outcome here. So I think an interesting paper.

01:22:34

Yeah. And, and they do compare to other techniques like, GRPO, VIT used for deep Seq R one and show that not only do you get better performance by a pretty, let's say, not a huge amount, but a consistent improvement. You do get that improvement in efficiency as well as you'd want from this technique. okay. Moving on to reviewing a couple things we didn't get to last week. We have some system cards. First up is deep research system card from OpenAI.

01:23:13

They introduced deep research as the multi-step technique where you can ask OpenAI or chat GBT to answer some query, do some research for you. And that will go off and do a bunch of searching, do a bunch of analysis, and compile essentially a report. So they did conduct rigorous safety training, preparedness evaluations, and governance review.

01:23:42

And the focus of this new evaluation is in particular on privacy protections around stuff that's published online and the training model to resist malicious instructions that it may come across while searching the internet. So this is basically. Now that you're letting ULLM go off and just, you know, browse the web on its own, you probably have new kind of risks and ways the model might be exploited or jailbroken. As we've seen, there's been papers on jailbreaking via like website manipulation.

01:24:17

They go into that kinda stuff. And also their usual preparedness framework in the system card. Yeah, the, the preparedness framework stuff is, is kind of interesting. So, you know, often when these models come out there isn't really a move relative to the frontier on the preparedness framework stuff. So, this time we have that for context.

01:24:38

So opening Eyes preparedness framework basically says, look, we have I think it's four now, different stages of risk or classifications of, of risk for these models, for each of the, the kind of standard threat domains cyber. Seaburn. So chem, bio radiological and nuclear capabilities. And then autonomy. So they have low, medium, and high risk classifications. Once you cross over from medium into high, they say they will not release the model in that form.

01:25:06

They're gonna do mitigations until the model's capabilities or or risk associated with that model get back down to medium level. Interestingly, this is the first time, so deep seek. Sorry, deep seek, deep research is the first model that's been classified as medium in cybersecurity, which is significant. They've seen big improvements in capture the flag challenges.

01:25:26

So these are cases where you wanna get your model to solve a problem that involves getting some sequence of numbers or characters outta some container or some, some sort of software environment that involves cracking into it. So apparently it solved 82% of high school level, 55% of collegiate level, and 47% of professional level capture the flag challenges with without browsing, by the way, without internet access.

01:25:51

So, so pretty impressive cyber offense capabilities here, even if these don't involve. You know, finding new zero days or whatever the reality is that this massively increases what you can do as just a, a sort of, from a, a cheapness standpoint, you know, you can, you can increase the number of people who can effectively carry out cyber attacks if they can outsource it to something like deep research.

01:26:14

And then you can also just increase significantly the amount of work that one cyber attacker can do thanks to these tools. So we're also seeing a medium risk classification on the seaburn side. Again, that's chem, biological, radiological nuclear risk. they say that deep research can help experts with operational planning for reproducing known biological threats. So that's a, a needle mover.

01:26:36

And then they also flag that several of their bio evals are indicating that these models are quotes on the cusp, on the cusp of being able to meaningfully help novices. Create known biological threats. And so that would cross their high risk threshold. That's a pretty wild threshold if you think about it, right? Being able to meaningfully help novices create known biological threats, like not new ones, but you're a novice.

01:27:01

Like if you don't have any experience and you want to know how to like, I don't know, perform an anthrax attack or something, I mean, presumably this is what this is getting at, right? So anyway, they also say that well here's a quote, current trends of rapidly increasing capability. Continue and for models to, they expect models to cross this threshold in the near future, into the, the high risk threshold. That is similar results by the way, in autonomy, which I found particularly interesting.

01:27:24

Opening eyes, big play. And really right now, this is everybody's play. It's pretty obvious. You make models that are increasingly good at tasks associated with AI research. So you can automate AI research itself. That leads to the risk of models figuring out how to self-improve and loss of control. This is something that they look at in, or a subset of what they look at in their autonomy evals, which they do by the way. In collaboration with.

01:27:48

With meter, which is one of the kind of big AI evals companies. They say medium risk here for autonomous capabilities flagging self-improvement potential. So improved performance on SW bench, which is this software engineering benchmark. SW bench verified is the version of that that OpenAI has kind of cleaned up. And they see greater potential for self-improvement and AI research acceleration. This is essentially just like a coded way of saying recursive self-improvement at the lab level.

01:28:16

So, you know, helping one AI researcher do the work of. More than one AI researcher. And to the extent that happens, that means you can accelerate the speed at which you build the next generation of model. That generation of model then presumably is even better at accelerating your research within the lab. And eventually that continues out and finite. Boom, you get a singularity.

01:28:34

That is kind of the explicit target of open ai of, all the frontier labs with various commitments to safety along the way. But that's kind of the, the framework here. So pretty remarkable that, we're at the point where the medium to high risk threshold is we're sort of flirting with that. I think if and when you cross that threshold you're gonna see some really, really big changes in the national security situation.

01:28:56

I think it's, it starts to become impossible to deny that loss of control is a risk that recursive self-improvement is a risk. And they do say they're working on mitigations that include quote, threat model development for self exfiltration and self-improvement risks in preparation for age agentic models with heightened capabilities like this is. Being invested in right now at OpenAI, pretty, pretty remarkable time to be alive.

01:29:19

2025. And next up we actually have a second system card this time about Claude 3.7 sonnet. Also a little while ago released, but wharf covering in relation to actually another one. We will get to just a sec. So the Claude 3.7 sonnet system card is about, you know, cloud 3.7 and it's modeled both as just an LM and also in particular as a reasoning model. It was released in February I don't know exactly, maybe a week two weeks ago.

01:29:54

And as with the usual reports we get from philanthropic, very detailed dozens of pages to go through with various evaluations and so on. I think the, the interesting thing to highlight for me from this model card is the new things introduced by the thinking aspect as sound 3.7. So. They do discuss how the chain of thought, the reasoning process can be a new tool for alignment where you can look through it and basically have hints or be able to verify alignment.

01:30:33

They also evaluate concerning thought processes and the idea of chain of thought faithfulness. So, it may be the case that the chain of thought is not enough as a signal to monitor, to be able to tell if a alum is doing something wrong. And we've seen some other research also along these fronts, so they do some preliminary investigation of the chain of thought faithfulness show that there is some amount of possible misalignment.

01:31:07

So, yeah, lots to cover in, in this one, and I'm sure on topic will have follow ups that go deeper on the particular ideas around chain of thought outputs. I'm also tracking that we have like 10 minutes for the episode. Man. We are good at filling the time. I'll just share one thing really quickly. Unfortunately. I mean, great to go into more detail broadly, I will say the results of this are, are in line with with the deep research.

01:31:34

System card, that opening I put out, you know, obviously differences here and there 'cause of the focus on age ancientness versus kind of the deep research side. But in any case they do see uplift in human performance, on bio weapon acquisition trials pretty significantly. They, it was 2.1 x improvement compared to people without the assistance of the model.

01:31:53

I remember when people were talking about OpenAI study or sorry, there was a RAND study that was put out a little while ago that seemed to suggest that, hey, you know, there's no meaningful uplift in then current models. was then I. Disproven by OpenAI, who used the fact that they had access to the base version of G PT four without, without the, the safeties on it. And even something like that can make a big difference.

01:32:16

Here's a quote from the paper that's interesting and in line with stuff I've heard from folks at Anthropic recently. There is a substantial probability that our next model may require a SL three safeguards. A L three essentially it's, it's an AI security level. Three. So this is anthropics equivalent to opening ai, sort of like low, medium, and high risk categories. They have, you know, a SL 1, 2, 3, and four.

01:32:37

So a SL three, you're, you're starting to look at systems that are showing early signs of autonomy and so on. But a again, it's sort of like crossing to that high risk for the sein side. So again, very consistent, right? We are on the cusp to use open AI's language of, of a lot of these very significant developments on the risk side. And moving on to policy and safety. We begin with very much related safety paper from OpenAI.

01:33:02

It's titled Detecting Misbehavior in Frontier Reasoning models or sorry, the paper itself, I think we, a blog post detecting misbehavior. The paper is started monitoring reasoning models for misbehavior and the risks of promoting a obfuscation. So the idea is once you have these reasoning models, they output a

⁠¶ Policy & Safety

01:33:26

bunch of its reasoning via these chain of thought traces and. Fundamentally chain of thought monitoring can be very effective at being able to tell whether the LLM is going off and doing things you don't want, and prevent it from you, kind of catch it while it's going off course, so to speak.

01:33:47

So they first, in this paper show that COT monitoring can be effective and possible to intervene, but also that if you integrate the monitoring into the reinforcement learning reward, that could lead at first to better agents that are more aligned. But if you optimize too much with a chain of thought monitoring, the agents do learn obfuscated reward hacking, which I think we covered something similar a week ago where basically not only do the models find wrong way to get a good reward.

01:34:26

They also obfuscate the fact we are doing that you know, making it seem like we're not in the chain of thought. So they have in this paper an idea of a mo mo ability tax by allowing the chain of thought output to maybe be suboptimal as a, a method of making it kind of easy to tell what the model is doing so it doesn't optimize itself into being able to cheat. Right?

01:34:53

Yeah. It's essentially like the, the model at first learns to tell you that it's planning on, doing something dangerously creative to achieve its programmed objective. And if you then go, oh, don't do that eventually if you do that enough, it will learn to just lie to you about the fact that it's going to use the dangerously creative strategy. this is something that's been predicted by the way, forever.

01:35:15

It's something that I will say a lot of people claimed was like science fictiony and implausible. I think this is another unfortunate moment where people on the alignment side in the alignment community. Have called it like years and years ahead of time and have turned out to unfortunately be right. Didn't have to be this way, but it, apparently is. And so, yeah, this is good heart's law, right?

01:35:36

Anytime you apply enough optimization pressure to an objective, you will find dangerously creative strategies that exploit the gaps between your intent and that objective that you didn't tell it not to pursue 'cause you never even thought of in the first place, right? That, that is good heart's law and you're just seeing it play out. It's, it's really interesting that specific way that this manifests is apparently you'll get this like very innocent looking reasoning trace.

01:35:59

And then suddenly it'll just like do the bad thing. Like it'll do the thing that you would have told it not to do. It just won't tell you that it's reasoning its way there. Which is sort of interesting. I like, to me that suggests the model must be Paying its own tax. 'cause it, it takes extra compute to think about being deceptive, but then it's just generating an output that's completely decoupled from the reasoning trace.

01:36:22

And it makes me wonder if there is some spectography eventually that would go on there where not only would it. Find a way to generate an innocent looking reasoning trace so that it doesn't get kind of blocked, but also that innocent, ostensibly innocent looking reasoning trace might actually contain encoded within it.

01:36:42

Some thought tokens that allow it to reason in the kind of evil way, if you will, that that you don't want it to, just because those are the two pressures you're putting on it, get the right answer, but also make me think that you're reasoning innocently anyway. They have this really interesting plot that shows how these hacks, these kind of dangerously creative hacks evolve in the model over time.

01:37:02

and they show manual interventions where hack exploitation basically drops to zero with one kind of hack because they catch it, but then another pops up, right? You're playing whack-a-mole with these systems and ultimately if you just pour more compute into your problem, this is a, a game of whack-a-mole you are destined to lose. it's really interesting. they check, you know, baseline agents where you don't exert any optimization pressure on the chain of thought.

01:37:25

You find that quite often, not always, but around 95% of the time, a weaker supervising model like GPT-4 oh can catch these attempted hacks and, and knock 'em down. Over time the number of hacks that are attempted, by the way, just goes up and up and up. And so that just reflects the fact that hacks are almost by definition things that are easier to implement than the rightful solution. And so there is an actual outright pressure. It's not just that these are equally. These are equal solutions.

01:37:55

And sometimes it goes with the annoying hack and sometimes it goes with the solution you want. No, over time the models will converge on the dangerous solution because it is more efficient, it's less compute intensive or reflects other pressures that the model faces.

01:38:10

And so this is at its core, by the way, the argument for loss of control like it's very, very difficult looking at, these kinds of empirical results to conclude that we're not headed for a situation where that could be a, a live risk in at least some context. And so I think this is great great research, great to see OpenAI publishing this being very open about it. Especially given some of the other qualms that people have had with, with their alignment program.

01:38:35

Yeah. And by the way, the hacking and question in this case they focus on examples where the AI needs to pass unit tests. And so in the s Python context, it could do various kind of wrong ways, basically hack the environment itself to be able to get passing grades in without actually doing the task itself. And the next one is about China telling its AI leaders to avoid US travel over security concerns according to the Wall Street Journal.

01:39:10

So that's pretty much the story is apparently the Chinese leadership is worried that AI experts traveling abroad could divulge confidential information and that they could be detained and used as a bargaining chip. And I've gotta say, I mean, this has been outgoing for a while, but. In the context of AI, where there is a lot of researchers in China, a lot of you know, conferences and so on, it's starting to get a little intense maybe when you're seeing these kinds of stories. Y Yeah, for sure.

01:39:46

And it, there's a bit of projection I think, going on here 'cause this is exactly the kind of game plan that, you know, China has used in the past. So there was famously a case where you know, China held back, I guess it was two Canadians Canadian citizens. And it was part of their negotiation with, with Canada for, I forget what and then there was. In parallel with that, a Huawei executive held in Canada, Washington's request, this was during the first Trump administration.

01:40:12

I think this was oh God. May something I, I forget. But anyway, she was ultimately, I think, sent back to China. So in this case you know, can you say wartime footing? I mean, this is really what you, what you expect to see. when countries are starting to look at AI as a, essentially a proto WMD, certainly a, a key strategic national security technology.

01:40:32

And the details here are that, so you have, authorities are discouraging executives at leading local companies in AI and other strategically sensitive industries such as robotics from traveling to the us and to us allies, unless it's urgent, if they must they're instructed to report their plans before. Leaving and upon returning to brief authorities on what they did and whom they met. So, we see apparently in retrospect this may have been tied to Liang won Wfg.

01:41:00

So the deep seek founder who turned down an invitation to attend the AI summit in Paris back in February. So speculation about whether that may have been prompted by essentially the same muscle movement that led to this announcement by the Chinese. A lot of similar things that, that have come up.

01:41:15

And apparently Xi Jinping told a bunch of at, at this gathering and he with famous gathering that involved Leon Wang and a whole bunch of other folks in the AI community in China he told them to uphold a quote sense of national duty as they develop their technology. That's pretty interesting kind of proto nationalization language, though it is consistent with, with the CCPs broader tone when, when they address these things.

01:41:36

So there, there is an interpretation of this that says, well, China seems to be gearing up for some, big adversarial race on ai. It certainly is the exact opposite of what China's historically done, right? They've historically said, no, go forth, go to the us, absorb everything they know. Bring it back here. This now suggests you know, maybe a slightly different attitude. It could also be, and, and, you know, it could be a yes. And here that there's been an issue with a brain drain from China.

01:42:05

A lot of wealthy Chinese have moved overseas in recent years especially, and that includes researchers. So you might worry about that piece. But both things can be true at the same time and it's pretty, pretty ominous sign. And I think frankly, under reported, like this is the sort of thing you see in the runup to a major sort of international clash on a key technology. Right.

01:42:26

And kind of a related note, OpenAI did submit a policy proposal to the Office of Science and Technology policy basically saying that they should heavily consider banning technology from what they're saying. Step state subsidized or state controlled entities may be referring to deep seek. So, opening eye seemingly positioning itself at increasingly also as a hawk against China, I suppose I. Yeah. And you see all the predictable pushback on that.

01:42:57

Or, you know, you have OpenAI pushing back against the, an open source model on the grounds that it's okay state subsidized. But of course the, the whole premise of OpenAI was, well, you, you wanna open things up so that anybody can use the ai, it doesn't all get kind of centralized. Yeah. So similar skepticism about, I think Open AI came out and, and used deep seek as a reason to say, Hey, we need to give AI companies in the US this fair use exemption so they can train on, proprietary data.

01:43:25

because we have this competition with China, anyway, it's, it's a whole hairball, but This by the way, is a submission to the request for information that came out as the Trump administration is putting together their AI action plan. So, you know, we'll be seeing that. I think that's David Sachs, who's nominally in charge of that effort. And that is it for the episode. Not quite two hours, but we did certainly use up a lot of time given that there isn't any huge.

⁠¶ Outro

01:43:48

Thank you so much for listening. As always, feel free to go to last week in AI for our text newsletter with even more articles we haven't covered. And as always do subscribe. If you aren't, please do share and review and give us feedback to let us know how we can approve. But more than anything, be sure to keep tuning in.

Transcript source: Provided by creator in RSS feed: download file