#205 - Gemini 2.5, ChatGPT Image Gen, Thoughts of LLMs - podcast episode cover

#205 - Gemini 2.5, ChatGPT Image Gen, Thoughts of LLMs

Apr 01, 20252 hr 34 minEp. 245
--:--
--:--
Listen in podcast apps:

Episode description

Our 205th episode with a summary and discussion of last week's big AI news! Recorded on 03/28/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • OpenAI's new image generation capabilities represent significant advancements in AI tools, showcasing impressive benchmarks and multimodal functionalities.
  • OpenAI is finalizing a historic $40 billion funding round led by SoftBank, and Sam Altman shifts focus to technical direction while COO Brad Lightcap takes on more operational responsibilities.,
  • Anthropic unveils groundbreaking interpretability research, introducing cross-layer tracers and showcasing deep insights into model reasoning through applications on Claude 3.5.
  • New challenging benchmarks such as ARC AGI 2 and complex Sudoku variations aim to push the boundaries of reasoning and problem-solving capabilities in AI models.

Timestamps + Links:

  • (00:00:00) Intro / Banter
  • (00:01:01) News Preview
  • Tools & Apps
  • Applications & Business
  • Projects & Open Source
  • Research & Advancements

Transcript

Hello and welcome to the last week AI podcast. We can hear us chat about what's going on with ai As usual. In this episode, we will summarize and discuss some of last week's most interesting AI news, and you can go to. To the description of the episode for all the timestamps, all the links and all that. I am one of your regular hosts, Andre Ov. I studied AI in grad school and I now work at a generative AI startup. And I'm your other host, Jeremy Harris.

I'm at Gladstone AI doing AI national security stuff. This has been a crazy hectic day, a crazy hectic week. So I'm, I'm gonna say right off the bat, there's a big, big anthropic story that I have not yet had the chance to look at. Andre, I know you've done a bit of a dive on it, so I'm gonna maybe punt my, my thoughts on it to next week. But yeah, this is, has just been a while. Yeah. It's been a wild week for the last couple ones. You know, we had slightly more quiet stuff about anything huge.

And then this week, multiple huge stories coming out and really being surprising and, and actually quite a big deal, I think not since rock free and, and that like slate of models. Cloud 3.7 had we had a week that was this big. So it's an exciting week and we're gonna probably dive straight into it. Let me give a quick preview of what we will be talking about. Tools and apps. We have Gemini 2.5 coming out and kind of steamrolling everyone's expectations, I would say.

And we have image generation from GBD four oh. By open ai, similar to what we saw with Gemini taking image generation. To the transformer, getting rid of diffusion seemingly, and being like mind blowing. Then we go to applications in business, OpenAI, getting some money, and a few stories related to hardware projects and open source. Some very exciting new benchmarks where we continue to try to actually challenge these new models, research and investments.

As you said, philanthropic has a really cool interpretability paper that we will start talking about, but there's a lot to unpack. So we might get back to it next week. And then policy and safety some kind of smaller stories related to what the federal government in the US is doing. And actually some updates on copyright law stuff in the last section. So a lot to get through. We'll probably be talking a little bit faster than we are in our typical episodes.

Maybe what's good we'll see if, if we're able to keep it a bit more efficient. So let's get straight to it. Tools and apps. We have Gemini 2.5, what Google is calling their most intelligent AI model. And this is one of their I guess, slate of thinking models. Previously they had Gemini 2.0 flash thinking that was kinda a smaller, faster model here. Gemini 2.5 is representing their bigger models.

We had Gemini 2.0 Pro previously, which came out as their biggest model, but at the time it was kind of not that impressive, just based on the benchmarks, based on people's using it and so on. So Gemini 2.0 came out like topping the leaderboards by a good margin, which we haven't seen for a while with benchmarks. Like their performance on the benchmarks is significantly higher Yeah. Than the second best one on, on pretty much any benchmark you can look at. Even ones that seem saturated to me.

And not only that, just based on a lot of anecdotal reports I've been seeing in terms of its capacity for things like coding compared to Claude for its capacity for writing, problem solving. It's just like another kind of class, a model that is able to one shot just given a task, nail it without having to get feedback or having to do multiple tries, things like that. So, yeah. Super impressive. And I, I, to me, kind of a surprising leap beyond what we've had, I. Yeah, absolutely.

And I think one of the surprising features is where it isn't soda quite yet, right? So SW bench verified, right? it's actually a benchmark that OpenAI first developed. You had SW bench essentially real world ish software engineering tasks. bench verified is the cleaned up OpenAI version of that. So on that benchmark Claude 3.7 sonnets still number one, and by quite a bit like this is.

Pretty rare looking at 2.5, which just crushes everything else in just about every other category, but still that quite decisive edge, basically 6% higher in performance on that benchmark is Claude 3.7 sonnet still, but that aside, Gemini 2.5 Pro is, I mean, it, it just, just crushing, as you said, so many things. One of the big benchmarks a lot of people are talking about is this sort of famous humanities last exam, right?

This is the, the benchmark that Dan Hendricks Elon's AI advisor who works at the Center for AI Safety put together. I mean, it is meant to be just ridiculously hard reasoning questions that call on general knowledge and, and reasoning at a very high level. Previously OpenAI oh three mini was scoring 14%. That was soda. Now we're moving that up to 18.8%. We're gonna need a new name for benchmarks that don't make them sound as final as humanity's last exam, by the way.

But we're on track right now to, I mean, like, we're gonna be saturating this benchmark, right? That's gonna happen eventually. This is a meaningful step in that direction and things have been moving really quickly with inference time reasoning, especially on that benchmark. But a couple things I guess to highlight on this one. Google coming out and saying, look, this is, so, by the way, this is their first 2.5, Gemini 2.5 release. It's an experimental version of 2.5 Pro.

What they're telling us is going forward, they are going to be doing reasoning models across the board. So like, OpenAI don't expect more base models to be released as such anymore from DeepMind. So everybody kind of migrating towards this view that like, yep, the default model should now be a reasoning model. It's not just gonna be, GPT-4 0.5 and and so on. It's really gonna be reasoning driven. And the stats on this are pretty wild. There's so much stuff.

I'm, I'm trying to just pick one for, for one. I mean, it tops the LM Arena leaderboard, which is a cool kind of rolling benchmark because it, it looks at human preferences for LLM outputs, and then it gives you essentially like a, an ELO score for those and pretty wide margin for Gemini 2.5. So subjectively getting really good scores, as you said, Andre, this is kind of like the, the measured subjectivity side on suite bench verified.

63.8 is really good, especially given, you know, even though it comes in second places sonnet, when you look at the balance of capabilities, this is a very wide capability envelope. They do say they specifically focused on coding, so again, still kind of interesting that they fall behind 3.7 sonnet. Maybe last spec to mention here is it does ship today with a 1 million token context window.

So in the blog post announcing this, Google made a big stink about how they see one of their big differentiators as a large context, and they're gonna be pushing to 2 million tokens of context soon as well, apparently. Right. And that, that is a significant detail because 1 million I, I haven't been keeping track. I, I do think we had Claude Opus in that space of very large context. No, but 1 million is still very impressive and going to 2 million is, is pretty crazy.

Again, like you keep having to translate how big 1 million token is. Well that's, I don't know, a few million words. Or maybe slightly less than a million words. 'cause yeah, maybe 700,000 or something. Maybe 700,000, 2 million would be over a million. It's a lot of content. You can fit in an entire manual, an entire set of documents, et cetera in there. And of course, as with our Gemini things it is multimodal takes in text, audio, images, video.

I've also seen reports of it being very capable of processing audio and images as well. And to that point of, it's starting to roll out as an experimental model. You can already use it in a Google AI studio if you're paying for Gemini and advanced. You can also select it in a model, dropdown it, just try it. And that's part of what, how we've been seeing people try it and, and report really good outcomes. So very exciting.

And out to next story, also something very exciting and also something that's kind of mind blowing to an unexpected degree. So opening AI has rolled out image generation powered by GP four O. Two chat GT To my understanding, and I'm not totally sure this is exactly the right details, but similar to Gemini two last week, it was it last week or two weeks ago, I don't know from Google.

The idea here is instead of having a separate model that is typically a diffusion model where VLM is like, okay, let me give this prompt over to this other model that is just text to image and that will handle this and return the image.

This is taking the full kind of end-to-end approach where you have a multimodal model able to take in text and images, able to put out text and images just via a set of tokens, and as a result of moving to this approach of not doing diffusion, doing full on, token language modeling, I. These new category really of text to image models or image plus text to image models have a lot of capabilities we haven't seen with traditional text to image. They have very impressive editing right out of a box.

They have also very, very good ability to generate text, a lot of text in an image with very high resolution. And they seem to just really be capable of very strict prompt adherence and making very complex text descriptions work in images and, and be accurate. And we've also discussed how with image models it's been increasingly hard to tell the difference or like see progress. Yeah. But I will say also, you know, especially with Dali, I. And to some extent also revolver models.

There has been a sort of like pretty easy telltale sign of AI generation with it having a sort of AI style, being a little bit smooth being, I don't know, sort of cartoony in a very specific way, especially for Dali. While this is able to do all sorts of visual types of images, so it can be very realistic, I think, differently from what you saw with Dali from OpenAI.

And it can do, yes, just all sorts of crazy stuff similar to what saw what we saw with Gemini in terms of very good image editing in terms of very accurate translation of instructions to image. But in this case, I think even more so, just the things people have been showing have been very impressive. Yeah. And I, I think a couple things to, to say here.

First of all astute observers or listeners will note, last week we covered grok, now folding into its offering internally an image generation service, right? So this, this theme of the omni modal. At least platform, right? Rock is not necessarily going to make one model that can do everything. Eventually. I, I'm sure it will. But we're, we're making kind of baby steps on the way there.

This is OpenAI kind of doing their version of this and going all the way omni modal with, with one model to rule them all. You know, big, big strategic risk if you are in the business of doing text image or audio to whatever, like assume that all gets soaked up and because of positive transfer, which does seem to be happening, right?

One model that does many modalities tends to be more grounded, tends to be more capable at any given modality now just because it benefits from that more robust representational space. 'cause it has to be able to represent things in ways that can be decoded into images, into audio, into text. So just a much more robust way of doing things. One of the key words here is binding, right? One of the key capabilities of this model, it's binding, is this idea where essentially looking at how well.

Multiple kind of relationships between attributes and objects can be represented in the model's output. So if you say, you know, draw me a blue star next to a red triangle, next to a green square you wanna make sure that blue and star are bound together. You want to make sure that red and triangle are bound together faithfully and so on. And that's one of the things that this model really, really does well, apparently.

So apparently it can generate correctly bound attributes for up to 15 to 20 objects at a time without confusion. This, in a sense, is the text to image version of the needle in a haystack eval, right, where we see like many different needles. In the haystack. Well, this is kind of similar, right? If you populate the context window with a whole bunch of these relationships, can they be rep or represented, let's say with fidelity in the output?

The answer, at least for 15 to 20 ob objects in this case and relatively simple binding attributes is yes. Right? So that's kind of one of the, the key measures that actually there's something, something different. I wouldn't be surprised if this is a consequence of just having that more robust representational space. You know, that comes with an omni modal model. One other thing to highlight here is we do know that this is an auto aggressive system, right?

So it's generating images sequentially from left to right and top to bottom. In the same way that text is trained and generated in these models. That's not gonna be a coincidence, right? If you want to go omni modal, you need to have a common way of generating your, your data, whether it's video, audio, text, whatever, right? So this is them saying, okay, we're going auto aggressive, presumably auto aggressive transformer to just do this. So, pretty cool.

There's a whole bunch of, anyway, cool little demos that they showed in their, in their launch worth checking out. One last little note here is they, they're not including any visual water markings or indicators that show that the images are AI generated, but they will include what they call their standard C two PA metadata to mark the image as having been created by open ai. Which we, we've talked about that in the past. If you're, if you're curious about that, go check out those episodes.

But yeah, so OpenAI kind of taking a bit of a middle ground approach on the watermarking side. Yeah. And they also are saying there'll be some safeguards si certainly compared to things like rock where you won't be able to generate sexual imagery. You won't be able to for instance, have politicians with guns or something like that. Of course, you're gonna be able to get around these safeguards to some extent. But certainly a more controlled type of model as you would expect.

Last thing I'll also say is you've seen a ton of different use cases for this popping up on social media. The one you may see covered in media is the ification of images, where it has turned out that you can take in a photo and tell the system to translate it to a Ghibli style. Ghibli is a very famous animation studio from Japan, and it does a very good job, like a very faithful rendition. Definitely looks like Ghibli.

And that kicks off its whole set of discussions as to, again, AI for what it means for art, you know, the ethics of it. There are also discussions as to what this means for Photoshop because it can do image editing. It can do design. You know, again, this is, I think, a surprising thing where we haven't talked about text image as being mind blowing in a little while and it kind of seemed to Plato for a while, and now it is to me, certainly mind blowing again to see the stuff you can do.

Onto lining round, and we actually have a couple more image generators to cover. I don't know if it's decided to come out at the same time or what, but there are a few. Starting with Igram, they are presenting the version free of their system. Igram is one of the leading text image focused businesses out there. Early on, their claim to fame was being able to handle text better. But these days, of course, that's not the case.

They say that this 3.0 of their system is able to create better realistic and stylized images. In particular, they have the ability to upload up to three reference images to guide the aesthetic output. And there's 4.3 billion style. Presets. So I think this reflects ideal ground being a bit more of a business and this being more of a product of them, like as a primary focus.

So again now with GP four, oh, this is nowhere near that, but for specialized use cases, it could be still the case that something like ideogram can, you know, hold on for a while, we'll see. You can almost hear yourself arguing the, the tam, the total addressable market size for these products down and down and down as chat GPT as all the big players kind of grow and grow and grow their their own. Tam this is one of the problems we've talked about a long time on the podcast.

I think I think Idea Graham is remain to be proved wrong here and, and expect to look stupid for any nu number of reasons as usual. But I think Idea Graham is, is dead in, in the, the medium term. Like a lot of of companies in this space look, they do say 4.3 billion style presets. We of course, are extremely competent. AI journalists have tested every single one and can report that they are pretty good actually.

you're saying, Andre, that the text in image feature is a kind of lower value thing now because the competition a hundred percent the case. This is why idea is now I. Choosing forced to maybe emphasize photorealism and professional tools, right? That's kind of what they're making their niche, but they're gonna get more and more niche. This is gonna keep happening as their territory gets encroached on by the the sort of blessings of scale that, that the true hyperscalers can benefit from.

So very cool. But kind of overshadowed by GBT four. I will say one last point, it could still be the case that as this specialized sort of model or business for this case, where they focused on, let's say business use cases for, I dunno, posters, maybe they have training data, but allows them to still be better for a particular niche. I don't know. I think opening eyes, buying power for that training data is gonna vastly exceed theirs.

And I think also, well, I, I would say proprietary data from users of a platform perhaps. Oh, a hundred percent. Yeah. Yeah. I, I mean I think they're also fighting positive transfer as well. There, there are a lot of secular trends here, but you're right. At a certain point if you can protect that data niche Yeah, you're absolutely right. That's the one way out that I can see, at least for sure. Yeah. And the next story, also a new image generator that was also mind blowing before GPD 4.0.

So the headline is New re image generator beats ai, art heavyweights like mid journey and flux at pennies per image. This came out, there was a model code named Half Moon that was already impressing everyone. It came out now with weave Image 1.0. They are providing service for it. You can get a hundred free credits and then credits at $5 for 500 generations. And. You know, this was at previous GT four Oh.

Again, really impressive in terms of its prompt adherence in terms of being able to construct complex scenes and just generally kind of do better at various more nuanced or tricky tasks than other image generators. Seemed like the best, like an exciting new step in image generation. I'll keep saying it. GT four Oh some to some extent also like Gemini before, to be fair. Still kind of more mind blowing than these things.

Yeah. I mean, approximately, take my, my last comments on ideogram copy paste in here. I think it's all, all roughly the same, but it's, it's a tough space now, right? It's really getting commoditized. Right? And, and one thing also worth noting quickly is one differentiator could be cost. I. Because the auto goa model, you're using LLMs, you are using you know, cost and speed also because LMS are typically slower. You are using decoding these things.

If you're still using diffusion models, could be cheaper and could be faster, which yeah, could be significant. I don't know. I think in practice this is really tough. I mean OpenAI gets to amortize their inferences over way larger in batch sizes. And that's really the key number that, you know, you care about when you're, when you're tracking this sort of thing. There's also, you know, they're not gonna be using.

If it makes economic sense, OpenAI will just distill smaller models and or have models, you know, specialized in this. So I think again, like long run, it's really kind of batch size versus batch size compute fleet versus compute fleet. in my mental picture of this, the rich get richer, but again, like very, very willing to look like an idiot at some point in the future. Yeah, I'm certain these companies are definitely thinking about yeah, their odds as well.

Next up, moving away from image generation, but sticking with multi modality, Alibaba is releasing Quinn 2.5, Omni, which is adding voice and video models to quinjet or also adding these things. So there are. Open sourcing Quinn 2.5, Omni seven B. That is a multimodal model, has text, image, audio, and video that's under the Apache 2.0 license. And it is somewhat significant 'cause through my memory in the multimodal model space, we don't have as many strong models as just pure LLMs.

Yeah, we have started seeing more of that with things like Gemma, but this has text, images, audio and video. So possibly, if I'm not forgetting anything, kind of a pretty significant model to be released under Apache 2.0 with this multimodality. Yes. And, and kind of seeing, you know, maybe some of the blessings of scale, positive transfer stuff starting to show up here as well. Interesting.

You have to see it as an open source model and, and yet again, you know, I. the Chinese models being legit, like, I mean, this is no joke. the benchmarks here are comparing favorably, for example, to Gemini Pro on omni bench. That's a. Sorry, I, sorry, let me be careful. Gemini 1.5 pro, right? We're, we're two increments beyond that as of today. but still this is stuff from like six months ago and, and beating it handily in the open source. So that's a, a pretty big development. Right.

And can you imagine if we had a history where OpenAI didn't create this, you know, versioning system for models and we had actually new names for models, wouldn't that be cool? You know, you know what it also makes you want to go kinda like, you know, 1.5 for this lab should be the same as 1.5 for this one. And you even see some of the labs trying to kinda like. number their things out of sequence, just to kind of signal to you how they want to be compared. It's a mess.

Yeah. And speaking of impressive models out of China, next we have T one from Tencent. So this is their thinking model. This is kind of the equivalent to Gemini two to oh one, and it is available on Tencent Cloud priced pretty competitively, a save a tops, leaderboards beating R one along oh one. So another kind of impressive release. I couldn't see many technical details on this, and in fact it didn't seem to be covered in Western media that much, but it could be a big deal.

Tencent being a major player in the Chinese market. Yeah, the, the big kind of release announces that f first of all, it is interestingly the hybrid mamba architecture, by which they presumably mean the combination of transformer and mamba that we've, we've talked about before, that a lot of people see as this way of kind of like covering the downsides of, of each check out our mamba episodes, by the way, for, for more on that 'cause it's a bit of a deep, deep dive.

But yeah they claim this, they refer to it as the first lossless application of the hybrid Mamba architecture. I don't know what lossless means in this context. So I asked Claude and it said, well, in this context it probably means, you know, there was no compromise or degradation in model quality adapting the Mamba architecture for this large scale inference model. Okay, fine. You know, if, if that's the case. But again, this is where deeper dive would be helpful.

And it'll be interesting to, to see Mamba. We haven't seen much, I haven't seen much about Mamba in, in quite a while. That doesn't mean it's not being used, you know, in a proprietary context by labs that we don't know about, but sort of interesting to see another announcement in that direction. and moving on to applications and business, we begin with OpenAI and the story that they're now close to finalizing their $40 billion raise.

This is led by SoftBank and they have various investors in here. Things like Founders fund chu management, things that you actually don't know too much about. They have a million noise based hedge fund that is contributing up to 1 billion. But the leader certainly seems to be SoftBank. They're saying they'll invest in initial 7.5 billion there along with 2.5 billion from, I dunno, other sources. And this will be the biggest round ever in fundraising, right? 40 billion is.

More than most companies market cap. So it's crazy. And funnily enough, the shares of SoftBank dropped in Japanese share markets. I think because people are like solid bank, you're giving a lot of money to open ai. Is that a good idea? what Has SoftBank made giant multi-billion dollar capital allocation mistakes before Andre? Like are I, I, I certainly can't remember. Yeah, I mean, there's nothing, there's no company that starts with we, where SoftBank was no famously involved. Yeah, yeah, yeah.

But no, they, they've had a pretty, a pretty rough time. And so SoftBank obviously is famous for those calls. I, I actually I can't remember. I know that there's a noteworthy story to be told about their performance over the last few years. And My brain's fried. I'm trying to remember if it's like SoftBank actually, know, is doing well actually, or, or SoftBank is, is completely fucked. It's one of those two I think, the investors apparently include this.

So Magnetar Capital I had never heard of. To your point, the only one that I'd heard of in the list here is Founders Fund, which by the way, I mean these guys just crush it. SpaceX, Palantir, Stripe, Endur, Facebook, Airbnb, rippling, like these are, like, founders Fund is just like absolute God tier. But apparently Magnetar Capital has $19 billion in assets under management in a so they're gonna put in up to 1 billion alone in this round. So that's pretty cool, pretty big.

So yeah, going up to $300 billion would be the, the post money valuation, which has basically doubled the last valuation of 157 billion. That was back in October. So, I'm sorry, has your net worth not doubled since October? What are you doing, bro? Like, get out there and start working. 'cause OpenAI is that's pretty wild. So yeah. Anyway, there's, there's a whole bunch of machinations about how the capital is actually gonna be allocated.

SoftBank's gonna put in an initial $7.5 billion into OpenAI, but then there's also 7.5 billion that they presumably have to raise from a, a syndicate of investors. They don't, you know, necessarily have the full amount that they need to put in on their balance sheet quite yet.

And I think this was part of what Elon was talking about in the context of the Stargate build, saying, Hey, you know, I'm like looking at SoftBank, these guys just don't have a balance sheet to support a $500 billion investment or a $100 billion or whatever. You know, it was, it was claimed at the time, and this is kind of true, and that's part of the reason why there's a second trench of $30 billion that's gonna be coming later this year.

That will include 22 billion from SoftBank and then more from a syndicate. So it's all this kind of staged stuff. there's a lot of people who still need to be convinced or, you know, when you're moving money flows that are that big, obviously there's just a lot of, a lot of stuff you have to do to free up that capital. So This is history's largest fundraise. If it does go through that's, that's pretty wild.

Next up, another story about OpenAI and some changes in their leadership structure, which is somewhat interesting. So Sam Altman is seemingly kind of not stepping down, but stepping to the side and meant to focus more on the company's technical direction and guiding their research on product efforts. He is the CEO, or at least was the CEO, meaning that of course as a CEO, you basically have to oversee everything, lots of a business aspects. So this would be a change in focus and they are.

Promoting, I guess, or it doesn't seem like they announced changes to titles, at least not that I saw. But the COO Brad Light Cap is going to be stepping up with some additional responsibilities, like overseeing day-to-day operations and managing partnerships, international expansion, et cetera. There's also a couple more changes. Mark Chen, who was I think an SAP of research, is now the Chief research officer. There's now a new chief people officer as well.

So a pretty significant shuffling around of their C-Suite. Of course following up on a trend of a lot of people leaving what we've been covering for months and months. So I don't know what, to read into this. I think it could be a sign of trouble at OpenAI that requires restructuring. It could be any number of things. But it is notable of course.

There's, you know, that iceberg meme where they're like, you know, you got the, the regular theories at the top and then the kind of deep dark conspiracy theories at the bottom. there are two versions of this story, or, or a bunch, and I've heard one person, at least a for more opening eye person speculate about a kind of like darker reason for this. But, so Brett Lcap, you're right, was the CEO before is still the CEO. All that's happening here, presumably is a, a widening of his mandate.

This is notable because Sam Altman is a legendarily good fundraiser, and one would assume corporate partnership developer, and you can see that in, in the work that he did with Microsoft and Apple, like very few companies have deep partnerships with both Microsoft and Apple, who under any typical circumstance are at each other's throats.

For a, I'll also say quick on this Al note, the fact that he got a, like friendly with a Trump administration under Elon Musk's noses, also pretty legendary in my opinion. Yeah. Yeah. I mean, he, yeah, he managed to turn around essentially a lifetime of, of campaigning as a Democrat and four Democrats to kind of like, yeah, tighten his tie and make, nice with elements of the campaign. You know, it's, it's, it's always hard to know. But yeah.

you know, one take on this is, well, Mira Mirati has not been replaced and you know, Sam has said there's no plan to replace her. He essentially is stepping in to fill that role and it's founder mode stuff. He wants to get, you know, closer to gears level. I'm sure that's a big part of it no matter what. And it maybe the whole thing. Another take that I have heard.

Is that as you get closer to super intelligence, the people at the command line, the people at the console that get to give the prompts to the model first are the ones to whom the power tends to accrete or, or with whom the power tends to, to accrete. So you know, wanting to get more technical wanting to be, to turn into more of a Greg Brockman type makes sense if you think that that's where, you know, if you're power driven. And that's kind of like where, you know, where you wanna go.

Anyway, in interesting kind of iceberg meme thing. Last thing I'll mention is Mark Chen, who's, who's mentioned here in the list as one of the people who's promoted, who you mentioned, you may actually. Know him from all the demo videos, right? So the, you know, deep, deep research demo, I guess the oh one demo when it launched. He's often there as Sam's like, kind of right hand demo guy. So anyway, his face and, and voice will, will probably be familiar to quite a few people.

Next up, we have kind of a follow on story from what we covered a lot last week. So we were covering the Reen GPU announcements from Nvidia. This is a story specific to the 600,000 watt CBER racks at an infrastructure that is set to ship also in 2027, along with their announcements. So I'll let you take over on this one, Jeremy, if think you know the details. It's just a little bit more on power density rack, power density. So, so for context, you know, you have the the GPUs themselves, right?

So currently the black bells, the black wells, like the B 200, that's the GPU. But when you actually put it in a data center, it sits on a board with a CPU and a bunch of other supporting infrastructure, and that is called a system. So multiple of these trays with GPUs and CPUs and a bunch of other shit get slotted into these server racks. And together we call that whole thing a system.

A system with 576 of these GPUs, like if you counted all the GPUs up in that system that if you had 576 of them, that would be the NVL 5 7 6 KYBER rack. This is a behemoth, it's gonna have a power density of 600 kilowatts per rack. That is 600 homes worth of power consumption. For one rack in a data center, right? 600 homes in one rack, that is insane, right? The cooling requirements are wild. For context, currently with your B 200 series, you're looking at about 120 kilowatts per rack.

So that's like a five xing of power density. It's pretty wild. And Jensen while they haven't provided clear numbers has said that we're heading to a world where we're gonna be pushing one megawatt per rack. So a thousand homes worth of power per rack, just kind of pretty wild for this kyber system. Just gives you a sense of, of how crazy things are gonna be getting. And another story on the hardware front. This time from China, we have China's C carrier, I dunno how to say it.

Side carrier, I dunno. Side carrier. Yeah, that is what you had to say first. Yeah. Chenin based company that is coming out as potentially a challenger to A SML and number developers of FAB tool developers. So, as we've covered many times, probably at this point, SML is one of the pivotal parts of the ability to make advanced chips.

They provide their only, the only company providing the really kind of most advanced techniques tools to be able to fabricate, you know, at the leading edge of, of tiny note sizes. Nobody is able to match. And so this is very significant. If in fact there will be. Kind of a Chinese domestic company able to provide these tools. Yeah. What's happening in China is kind of interesting.

Over and over we're seeing them try to amalgamate to concentrate what in the US would be a whole bunch of different companies into one company. Right. So Huawei, SMIC seemed to kind of be forming a complex, it's like as if you glued Nvidia to TSMC, the chip design with the chip fab, right? Well, here's another company, S carrier, C carrier, I. I dunno.

But silicon carrier that's essentially like integrating a whole bunch of different parts of what's known as the front end part of the, the fab process. So when you manufacture semiconductors, you know, the, the front end is the first end, most complex phase of, of manufacturing where your, like, your circuits are actually gonna be created on the silicon wafer. There's a whole bunch of stuff you have to do for that.

You have to prepare wafers, you have to actually have a photolithography machine that like fires basically a UV light onto your, wafer to, then eventually do etching there. Then there's the etching, the doping with, with ions deposition. There's all kinds of stuff. They have products now across the board. They just launched a whole suite of products kind of covering that end-to-end.

So that puts them in competition, not just with A SML, but also with applied materials, with Lam Research with a, a lot of these big companies that own other parts of the supply chain that are maybe a little easier to break into than necessarily lithography. But then on lithography side, s carrier also claims that they built a lithography machine that can produce 28 nanometer ships. So less advanced, way less advanced than TSMC but it brings China one step closer if this is true.

If this is true, and if it's at economic yields, it brings them one step closer to having their answer to A SML, which there's still a huge long ways off. this should not be kind of the jump from 28 nanometer lithography machines to like, you know, seven nanometer, like DUV, let alone EUV is immense.

You can check out our hardware episode to learn more about that, but it's the closest that I've heard of China having an answer to a SML on the litho side, and they're coming with a whole bunch of other things as well. Again, more and more kind of integration of stuff in the Chinese supply chain. And the last story of a section also about China Pony ai. Pony AI wins the first permit for fully driverless taxi operation in silicon China's Silicon Valley.

So they are gonna be able to operate their cars in Sheen's hun, district a part of it. And this is quite significant because the US based companies, Tesla and Waymo, are presumably not going to be able to provide driverless taxi operation services in China. And so that is a huge market that is very much up for grabs. And pony AI is one of the leaders in that space. Yeah, China is making legitimate progress on AI that should not be ignored.

One of the challenges with assessing something like this is also that you have a very, a sort of friendly regulatory environment for this sort of thing. China wants to be able to make headlines like this and also has a history of burying right fatalities associated with all kinds of accidents from, from covid to, to otherwise. And so it's, it's always hard to do apples to apples here on what's happening in the west. But they do have, you know, a big data advantage.

They have a big data integration advantage. Big hardware manufacturing advantage. Wouldn't be surprising if this was, if this was for real. So there you go. Maybe an interesting kind of jockeying for position as to who's, who's gonna be first on full driverless, right. And, and pony AI has been for around for quite a while. Pounded in 2016. Yeah. Actually in Silicon Valley.

So yeah, they've been, I. Leading the pack to some extent for some time, and it makes sense that they're perhaps getting close to this working. Onto projects and open source. And we begin with a new challenging A a GI benchmark, to your point, of us having to continue making new benchmarks. And this is coming from the ARC Prize Foundation. We covered arc a GI previously at a high level.

The idea with these arc benchmarks, they test kind of broad abstract ability to do a reasoning and pattern matching. So, and in particular in a way where humans tend to be good without too much effort. So 400 people took this arc, a GI to test and were able to get 60% correct answers on average. And that is outperforming AI models. And they say that.

Non reasoning models like GP 4.5, cloud three seven, Gemini two are each scoring around 1% with the reasoning models being able to get between 1% and 1.3%. So it also, this is part of a challenge and that there is this challenge to be able to beat these tests under some conditions operating locally without an internet connection. And I think on a single GPU, I forget and I think just with one arm. Yeah, exactly. Half halfway transistors have to be turned off. That's right.

Yeah. So yeah, this is an iteration on RKGI at the time. We did cover also a big story where oh three matched human performance on RKGI one. At a high computational cost. So not exactly at the same level, but still they kind of beat the benchmark to some extent. On this one, they're only scoring 4% using the level of $200 of computing per task.

So it clearly is challenging, clearly taking in some of the lessons of these models beating arc a GI one, and I do think a, a pretty, yeah, important thing or interesting thing to keep an eye on. Yeah, they are specifically introducing a new metric here, the metric of efficiency. The idea being that, you know, they don't want these models to just be able to brute force their way through the solution, which I find really interesting.

There's like this fundamental question of, of is scale alone enough? And scaling Maximalist would say, well, you know, what's the point of efficiency? the cost of compute is collapsing over time. And then algorithmic efficiencies themselves there's kind of algorithmic efficiencies where conceptually you're still running the same algorithm, but just finding more efficient ways to do it. it's not a conceptual. Revolution in terms of the, the cognitive mechanisms that the model is applying.

So think here of like, you know, the move from attention to, to flash attention, for example, right? This is like a, an optimization or like kv, cash level optimizations that just make your, your transformer kind of run faster, train faster and inference cheaper. That's not what they, they seem to be talking about here. They seem to be talking about just, you know, how many cracks at the problem does the, does the model need to take?

And there's an interesting fundamental question as to whether that's a meaningful thing given that we are getting sort of these more algorithmic efficiency improvements without reinventing the wheel and hardware is getting cheaper and, and, and all these things are compounding. So if you can solve the.

Benchmark this year with a, you know, certain hardware fleet, then presumably you can do it, you know, six months from now with like a 10th of the, a 10th of the, the hardware, the 10th of the cost. So it's kind of an interesting argument who designed this benchmark is obviously on one side of it saying, Hey, the, in some sense, the like elegance of the solution matters as well. It's, it's, yeah, I think it's, it's sort of fascinating. Apparently.

To give you a sense of how performance moves on this. So opening AI's oh three model oh three low. So, the version of it that is spending less test on compute was apparently the first to re, well, sorry, famously, I should say not apparently. It was the first to reach like some basically close to saturation points on ARC agi. I, one, it hit about 76% on the test. That was what got everybody talking like, okay, well we need a new, a new a GI benchmark.

That model gets 4% on R kgi I two using $200 worth of computing power per task, right? So that gives you an idea that we, we are suppressing the curves yet again. But if past performances any indication I think these get saturated pretty fast and we'll be having the same conversation all over again. next story also related to a challenging benchmark. This is a paper challenging the boundaries of reasoning and olympiad level muff benchmark for large language model.

So new math benchmark, they are calling a limb math. And this has 200 problems with two difficulty levels, easy and hard. Where easy is similar to Amy, an existing benchmark and hard being, I suppose similar to, very hard, like super advanced types of map problems that even humans can get. They curate these problems from textbooks, apparently printed materials. And they specifically excluded online repositories and forums to avoid data contamination.

And on their experiments, we're seeing that advanced reasoning models deep seek R one and O three mini achieve only 21.2 or 30% accuracy respectively on the hard subset of a data set. So, Still some challenge to be solved. I guess in a few months we'll be talking how we are getting to 90% accuracy. Yeah. We'll, we'll have the, the next version of, of a limb math. Yeah. It's, they came up with a, a couple of you know, pretty interesting observations. Maybe not too surprising.

Apparently models consistently are better on the English versions of these problems compared to the Chinese versions. So they, 'cause they collected both. So that's kind of cool. and, and they do still see quite a bit of guessing strategies. Sorry, the models get to the end of the thread and they're kind of just throwing something out there, which presumably is increasing through, you know, false positives the score somewhat.

One thing I will say, I, first of all like, yeah, kudos an interesting strategy to go out into the real world and bother collecting these things that way. it does make me wonder like how well I. You could meaningfully scrub your data set of problems that you see in magazines, say, and like, be confident that they don't exist somewhere.

Obviously there are all kinds of data cleaning strategies, including using, you know, language models and other things to, to peruse your, your data to make sure that it, it isn't referenced on the internet.

But these things aren't always foolproof and there've been quite a few cases where people think they're doing a really good job of decontaminating and not leaving any of that material online to essentially like have models that have been trained already So yeah, I'm kind of curious how, you know, whether we'll end up discovering that part of the saturation on this benchmark is at least at first due to overfitting.

Yeah. And part of the challenge with knowing is we don't know the training data sets for any of these companies, right? For OpenAI philanthropic. These are not to publicly release data sets. And I would say it there's a 100% chance that they have a bunch of textbooks that they bought and scanned and included in their training data. Oh, yeah. So, who knows, right? A couple more stories, actually, another one coming out of China, we have one open, advanced large scale video generative models.

And this is coming from Alibaba. So as we say, this is, as the title says, a big model, 14 billion parameters at the largest size. And they also provide, 1.3 billion parameter model that is more efficient train on a whole bunch of data, open sourced and seemingly outperforming anything that is open source in the text to video space quite a bit, both on efficiency, on speed, and on kind of appearance.

The only one that's competitive is actually who on video, which I think we covered recently, things like open SOA are quite a bit below in terms of I guess like appearance, measurement stuff. So open source you know, steadily getting to a point where we have good text to video as we had with a text to image. Yeah, and just, I mean, anecdotally some of the, the images are, are pretty, they're pretty, pretty damn photorealistic. Or some of the, sorry, some of the yeah, stills.

I will note there's kind of this amusing, I don't know if this is intentional. can't see it to see the prompt anywhere, but there is a photo on page four, this paper that looks an awful lot like Scarlet Johansson. So that's kind of a, if intentional, I guess a bit of a, a bit of a swipe at OpenAI there, which is mildly amusing. But anyway yeah, there you go. I mean, China, especially on the open source stuff is serious. I mean, a, this is Alibaba, right?

So you know, they've got access to scaled training budgets, but they're not even China's like leading lab, right? That you, for, that you, you wanna look at Huawei and you wanna look at at deep seek. But yeah, pretty impressive. Exactly, and, and I think kind of interesting to see so many open sourcing. Like Meta is the maybe one company in the US that's doing a bunch of open sourcing still. Google doing a little bit with smaller models.

Basically the only models being released are the smaller models like Gemma and Phi. But we are getting more impressive models out of China. And there's certainly a lot of people using R one these days because it is open source. speaking of that, the next story is about deep seek V three. We have a new version as of March 24th. This is another naming convention that's kind of lame, where the model is deep seek V three dash O 3 24.

Just a kind of incremental update, but significant update because this is now the highest scoring non reasoning model on some benchmarks exceeding Gemini two pro and meta Lama 3.3 70 B. Yeah, outperforming most models basically while not being a reasoning model. So presumably this is an indicator I. R one was based on Deep Seq. V three. V three was a base model. V three was also a very impressive model at the time. Trained a cheaply, that was a big story.

Presumably the group there is able to improve V three quite a bit, partially because of R one, synthetic data generation, things like that. And certainly they're probably learning a lot about how to squeeze out all the performance they can. Yeah, I think this is a case where there are just so many caveats, but any analysis of something like this has to begin and end with a frank recognition that deep seek is for real. This is really impressive. And now I'm just gonna add a couple of buts, right?

So none of this is to take away from that top line. We've talked about in this episode quite a few times how the lab, so, so Gemini 2.5 is no longer a, just a simple base model. All the labs are moving away from that by default, not releasing new base models. And so yes, deep seek V three, the March 25 version is better than all of the base models that are out there, including the proprietary ones. But labs are losing interest in the proprietary base model. So that's an important caveat.

It's not like deep seek is moving at full speed on just the base model and the labs are moving at full speed on just the base model. And, and that's kind of apples to apples, but the most recent releases of base models. Are still relatively recent still. GPD 4.5 opening eyes has been sitting on it for a while as well. So it is so difficult to know how far behind this implies. You know, deeps seek is from the frontier. This conversation will just continue.

And the, the real answer is known only to some of the labs who know how long they've been sitting on certain capabilities. There's also just a question of like, deep Sea could have just chosen to invest more, certainly now that they have state backing, who knows? But into, you know, meeting this benchmark for publicity reasons as well. So none of this is to take away from this model. It is legitimately very, very impressive. By the way, all the specs are basically the same as before.

So context window of 128,000 tokens you know, anyway, same, same parameter accounts and all that. but still, I, I think a very impressive thing with some important caveats not to, I. Read this right off the prompter, so to speak, in terms of assessing where, where China is, where deep seek is, right. And pretty significant I would say, because also deeps seek V three is fairly cheap to use. And you can also use it on providers like rock, rock with aq.

So if it is exceeding you know, models like cloud and OpenAI for real applications, it could actually significantly hurt the bottom line of OpenAI and traffic, at least with startups and the non-enterprise customers Yeah. For people using the base model, right. And I guess that's the bet that everybody's making is that that will not continue to be the default use case if you're doing open source, much, much more interesting to be shipping base models, right?

Because then other people can apply their own RL and post-training schemes to it. So you're gonna see probably open source continue to disproportionately ship some of these base models. I wouldn't be surprised to find the, the full frontier of base models be dominated by, by open source for that reason in, in the years to come. but there's a question of like, yeah, the value capture, right? Are people spending more money on base models? I don't think so.

I think they're spending more money on, EE genic models that we're seeing start to dominate. And one more story. This is one is about OpenAI and not about a model. This we saw an announcement, or at least a post on Twitter with Sam Altman saying that OpenAI will be adding support for the model context protocol, which we discussed last week. And that is an open source standard that is basically defining how you can use models as a protocol when you use the API. You know, a bit significant.

'cause it's coming from a tropic, we're not introducing a competing standard. We're adopting what is now an open standard that the community got excited about. So I guess that's cool. It's nice to see some coal and of course when you have a new standard and kind of everyone jumping on board, that makes it much easier to build tools and kind of the whole ecosystem benefits if a standard turns out to be what everyone uses. And there's no, like, weird competing different ways to do things.

Yeah. And I think there was so much momentum behind this already that you know, it just made sense even, even at scale for open AI to move in that direction. Yeah. Onto research and advancements. And as we preview at the beginning, the big story this week is coming out. Ona Ro. This came out just yesterday, so we are still kind of absorbing it and can't go into full detail, but we will give at least an overview and, and the implications and kind of results.

So there's a pretty good summary article actually you can read that is less technical from MMIT Technology review. The title of that article is Philanthropic Can Now Track the Bizarre Inner Workings of a Large Language Model. And this is covering two blog posts from philanthropic. One is called Circuit Tracing, revealing Computational Graphs in Language Models. They also have another blog post, which is on the biology of a large language model.

Essentially an application of that approach in the first blog post to Cloud 3.5 hiku of a lot of interesting results. So. There's a lot going on here, and I'll try to give a summary of what this is presenting. We've seen work from a tropic previously focusing on interoperability and be you know, exposing kind of the inner works workings of models in a way that is usable and also kind of more intuitive.

So we saw them for instance, using techniques to be able to see that models have some high level features like the Golden Gate Bridge famously, and you could then tweak the activations for those features and be able to influence a model. This is essentially taking that to the next step where you are able to see a sequence of high level features working together and coalescing into an output from an initial input set of tokens.

So they are doing this again as a sort of follow on of the initial approach. At a high level, it's taking the idea of replacing the layers of the MLP bits of the model with these high level features that they discover via, you know, a previous kind of thing. They have a new technique here called a cross layer transcoder. So previously you were focusing on just one layer at a time, and you're seeing these activations in one layer. Now you're seeing these activations.

In multiple layers and you see the kind of flow between the features via this idea of a cross layer transcoder. And there are some more details where you start with a cross layer transcoder, you then create something called a replacement model, and they also have a local replacement model for a specific prompt.

The idea there is you're basically trying to make it so this replacement model, which doesn't have the same weights, doesn't have the same set of nodes or computational units as the original model, has the same overall behavior, has the same roughly is equivalent and, and matches the model closely, as closely as possible so that you can then see the activations of a model in terms of features and can map that out to the original model sort of faithfully.

So let's just get into a couple examples where one they present in the blog post figure five, you can see how they have an input of the National Digital Analytics group. and they are then showing how each of these tokens is leading to a sequence of tokens in this graph. So you start with digital analytics group that maps onto tokens that correspond to those specific words. Inver parentheses. After the parentheses, there's a future that's just say slash continue an acronym.

And then at, and, and the second layer of the computational graph you have say DRE one underscore, say something A and then say something G as three features. And there's another feature called say DA, and that combines with say G to say dag. And DAG is the acronym for digital analytics group. So that's showing the general flow. Of features. We, they also have very interestingly, a breakdown of math.

They have something, I think 36 plus 59, and they're showing that there's a bunch of weird features being used here. So 36 maps onto roughly 30, 36 and something ending of 6 59 mops that do something starting with five, roughly 59 59, and something ending of nine. Then you have like 40 plus 50 ish and 36 plus 60 ish, and then eventually through a combination of various features, you wind up at the output of 36 plus 59 is 95. So that's a high level thing.

it is giving us a deeper glimpse into the inner workings of LLMs in terms of the combinations of high level features and the circuits that they are doing internally, it's building on actually a paper from last year called Transcoders Find, interpretable, LLM, feature circuits from Yale University and Columbia University. They use a similar approach here but of course scaled up.

So as with a previous work for philanthropic to my mind, kind of some of the most impactful research on interoperability and some of the most successful research because it's really is showing, I think at a much deeper level what's going on inside these large language models. Absolutely. And, and again, this is where I think I was caveating at the outset of today's episode. I haven't had the chance to look at this yet. And this is exactly the kind of paper that I tend to spend the most time on.

So apologies for that. I may actually come back next week with some hot takes on it. It looks like fascinating work. From what I have been able to gather. It is it, I mean it's pretty close to a decisive repudiation of the whole. And not the people make this argument so much anymore. The stochastic parrot argument of people like Gary Marcus who like to say, you know, oh, LLMs and auto aggressive models are not that impressive. They're really just kind of predicting the next token.

They're stochastic parrots. They're basically like robots that just put out mindlessly the next word that's most likely. I think anybody following the interpretability space for the last like. Two, three years has known that this is pretty obviously untrue. As well as people following the capability side, just with, some of the things we've seen, but one example they gave was there's a question as to whether a model uses fully independent reasoning threads for different languages, right?

So if you ask what is the opposite of small and English and French will the language use, sorry, will the model use language neutral components? Or will it have a notion of smallness that's English, a notion of smallness, that's French. That's maybe what you would expect on the stca stochastic parrot hypothesis, right? That, well, you know, it's an English sequence of words, so I'm, I'm gonna use my kind of English submodel. Turns out that's not the case, right? Turns out that.

Instead it uses language neutral components related to smallness and opposites to come up with its answer, and then it'll pick only after that, only after it's sort of reasoned in latent space at the conceptual level. Only after that does it sort of decode in a particular language. And so you have this unified reasoning space in the model that is decoupled from language, which in, in a way you should expect to arise because it's kind of a more efficient way to compress things, right?

That's just like you, you have one domain and, and essentially all the different languages that you train the thing on are a kind of regularization. You're kind of forcing the model to reason. And to, in a way that's independent of the particular language that you're choosing to use to reason ideas are still the same. And then yeah, there's this question around interpretability, right? This thing will confabulate, you gave that example of adding 36 and 59.

If it does this weird reasoning thing where it's almost doing like a, if you like math, you know, like something like a, I dunno, tailor approximation where you kind of get the, the leading digit, right? Then the next digit, then the next digit, rather than actually doing it in a, symbolic way. But then when you ask it, okay how did you come up with that answer?

It will give you the kind of common sense I added the ones I carried, the one I added the tens, you know, that sort of thing, which is explicitly not the true reasoning. It seems to have followed, at least based on this assessment. This raises deep questions about how much you can trust things like the reasoning traces that have become so popular that, companies like deep Seek and OpenAI have touted as their.

In some cases, chief Hope at aligning super intelligent ai, it seems like those reasoning traces are already decoupled from the actual reasoning that's happening in these models. So a bit of a warning shot on that too, I think. Right? And to that point about the multilingual story, pre notable, not just the technique itself, but the second blog post on the radiology of a larger language model. They applied it to Claude 3.5 hiku, and have a bunch of results.

They have the multilingual circuits, they have addition medical diagnosis life of a jailbreak. They're, showing you how a jailbreak works, actually, also how refusal works. So some very, kind of pretty deep insights that are pretty usable actually in terms of how you build your olms. So much to cover in terms of the things here. So we probably will do a part two next week. And onto the next story.

We have a chain of tools utilizing massive unseen tools in the chain of thought reasoning of frozen language models. This is a new fine tuning based tool learning method for LLMs that lets some efficiently use unseen tools during chain of thought reasoning. So you can use this to integrate unseen tools. And they have actually a new data set. Also simple tool questions that has 1,836. Tools that can be used to evaluate tool selection, performance.

And what tools by the way, is, you know, kind of calling an API LLM can say, okay, I need to do this fact this web search, or I need to do this addition, whatever. And it can basically use a calculator or it can use Google, right? So pretty important to be able to do various things. And this is gonna be adding onto the performance of reasoning models. This is a, a really interesting paper.

They're classic multi-headed hydra of things that you're trading off anytime you want to do tool use in, in models. So some, you know, some of these techniques, like you imagine fine tuning if you fine tune your models to use tools. Well, you can't use your base model, right? You can't just use a frozen LLM. you're not gonna succeed at using a huge number of tools because the more you fine tune the more you forget, right? There's this catastrophic forgetting problem.

And so it can be difficult to have the model simultaneously know how to use like over a thousand tools. And if you fine tune. You're never gonna be able to get the model to use unseen tools because you're fine tuning on a specific tool set. You want to teach the model to use. There's similar challenges with like in-context learning, right?

So if you're doing in-context learning, you have a needle in a haystack problem, if you have too many tools to pick from and the model start to start to sort of fail. So anyway, all kinds of like challenges with existing approaches. So what's different here? What are they doing? So start with a frozen LLM that's a really important ingredient. They wanna be able to use preexisting models without any modifications, and they are gonna train things.

They're gonna train models to help that frozen LLM that you start with do its job better. But that's not gonna involve training any of, the original LMS parameters. So they're gonna start by having a tool judge. And basically this is a model that. When you feed a prompt to your base, LLM, it's gonna look at the activations, the, the hidden state representation of that input at any number of layers, and it's gonna go, okay, based on this representation.

For this particular token that I'm at in the sequence do I expect that a tool should be called? is the next token gonna be a call to a calculator, a call to a a, a weather app or something like that? And so this tool judge, again, operating at the, activation level, at the sort of hidden state level, which is really interesting it's gonna be trained on a dataset that has e explicit annotations of like, you know, here are some prompts and here is where tool calls are happening.

Or, sorry, here some text. And here, here annotated are where the tool calls are happening. But that data's really expensive to collect. So they also have synthetic data that shows the same thing. So they're using this to kind of. Get the tool judge to sort of learn what is and isn't in activation space. What does and doesn't correspond to a tool call. So essentially training just a binary classifier here.

And then during inference, if the judge scores for a given token that that the tool call probability is above some threshold, then the system will go ahead and call a tool. When it does that, it does it via a separate. Kind of model called a tool retriever. And this tool retriever is, I mean, it's not a model, it's a, it's itself a system. It uses two different models, a query encoder and a tool encoder. So this is basically rag, right? This is retrieval augmented generation.

You have that represent all of your different tools. You a thousand or 2000 different tools. And then you have a way of representing, of embedding the query, right? Which is really. A modified version of the activations associated with the token that the tool judge decided was a tool call, my God. Anyway, from here's rag. So if, if you know the rag story, that's what they're doing here and then they, they, anyway, they call the tool. So a couple of of advantages here, right?

Frozen LLM don't need to fine tune no catastrophic forgetting issues. They are using just the hidden states. So that's sort of fairly simple. And the tool retriever, right? This, this thing, the system that's deciding which tool to call is, interestingly, it's trained using contrastive learning to basically like in each training mini batch, when you're feeding just a batch of training data to the system to get it trained up. You're basically instead of comparing.

one tool versus all the other tools in the dataset to figure out like, should I use this one or, or another. You're just comparing it batch wise, like to all the tools that are called a referenced within that batch just to make it more tractable and, and computationally efficient. So anyway, if you know, contrast of learning, that's how it works. If you don't, don't worry about it. It's a bit of a detail.

But it's I think a really important and and interesting paper because the future of of a GI has to include essentially unlimited tool use, right? That's something that I think everybody would reasonably expect and the ability to learn how to use new tools, and this is one way to kind of shoehorn that in potentially. And just a couple more papers. Next one, also an interpretability paper. The title is Inside Out Hidden Factual Knowledge in LLMs.

And it's also quite an interesting, so the quick summary is they are looking to see what knowledge is encoded inside an LLM that it doesn't produce. So it may have some hidden knowledge that it knows facts, but we can't sort of get it to tell us that it knows we facts. The way they do that is they define knowledge as whether you rank a correct answer to a question higher than in an incorrect one.

So you basically know which fact is a fact based on it being what you think is the right continuation and the comparison of external knowledge to internal knowledge is externally. You can use the final token probabilities with visible kind of final external thing. Internal. You can only use you can use internal activations to get that estimate of rankings. And that's an interesting result here. LLMs encode 40% more factual knowledge internally when, when they express externally.

And in fact, you can have cases where NLM knows. The per the answer perfectly to a question, but fails to generate it even in a thousand attempts. And that's due to, you know, sampling, I suppose processes and, and perhaps I need to do a deeper dive. But it could be various reasons as to why you're failing to sample it. It could be you know, too niche and your prior is overwriting. It could be sampling techniques. But either way, another interesting finding about the internals of L lms.

Yeah. This is almost the it's the closest I've seen to like hardcore quantification of go's famous aphorism, I guess where he says prompting can reveal the presence, but not the absence of capabilities in language models. Right? It can reveal that a model has a capability, it can't reveal. That it doesn't have the capability, and this is what you're seeing, right? It's pretty, it's pretty kind of intuitive.

if you try a thousand times and you don't get an answer that you know the system is capable of delivering, then that means that you just haven't found the right prompt. And in general, you'll never find the right prompt for all prompts, right? So you will in general always underestimate the capabilities of a language model.

Certainly when you just look at it in output space, in token space this is why increasingly like all of the strategies, the kind of safety strategies, like the ones that open AI is pitching with just looking at reasoning traces are looking really suspect and kind of fundamentally broken. You need. Representational space interpretability techniques, if you're going to make any, any kinds of statements.

And even then, right, you have all kinds of interesting Steganography issues at the level of the activations themselves. But interesting paper. We'll, I guess we'll have to move along 'cause of, we're, we're on a compressed time. Oh yeah. We're doing like a shorter episode today just because we got started like half an hour late. So this is, this is why we're, we're like kind of blasting through. But I do think this is really important and interesting paper.

last paper we are wrapping up with another new benchmark. This is from Sana ai and their benchmark is based on Sudoku, I think it's called Sudoku Bench. And this benchmark has not just a classic Sudoku you've seen, but also a bunch of variations of Sudoku with a kinda increasingly complex rule sets as to how you can fill in numbers on the grid. Sudoku by way is you have a grid. There are some rules and according to those rules, you have to figure out which numbers go where, basically.

And so they introduce this benchmark and because there is a progression of complexity, you see that. Even top line reasoning models you know, that can crack the easier ones, but they are not able to beat the more complex sides of this. And you know, there's pretty much a fair amount of distance to go for models to be able to beat this benchmark. Yeah, the take home for me from this was as much as anything that I have no idea how Sudoku works because apparently there are all these variants.

Like I remember being in high school, I had friends who loved Sudoku, and you know, it was just that, that thing that you mentioned where there's like a, you know, a nine by nine grid and, and you know, you have to put the numbers from one to nine. And in each of the Components of the grid, using them only once and then all that jazz. But now apparently there are all kinds of versions of Sudoku unlike chess and go that, you know, those have the same rules every time.

This is like so some versions apparently they give as examples. They require deducing the path that are rat takes through a maze of teleporters. So like if the rat goes to position X, then it gets magically teleported to position y, which could be somewhere completely uncorrelated. and that's kind of framed up in a Sudoku context. there's another one that requires moving obstacles, cars, they say in the correct locations before trying to sell. There's all kinds of weird variations on this.

And they basically design a spectrum right from really, really simple, like four by four Sudoku all the way through. I. With more and more constraints and kind of sub rules added it seems to just generally be a very fruitful way to play a c Toric game and, and procedurally generate all these different games that can be played. And then ultimately where they land is they share this kind of data set of how these models perform. You can sort of think of this as another arc. A GI kind of benchmark.

That's what it felt like to me was, you know, it's interesting coincidence to see this drop the same week as Arc a GI two. Basically all the models suck. That's the take home. The one that sucks the least is oh three mini from from January 31st. And so it has a correct solve rate of 1.5% for the full. scale version of these problems. They have simplified problems as well, so you can actually kind of track progress in that direction. But anyway, I thought this was really interesting.

They have a collaboration with a YouTube channel called Cracking the Cryptic to kind of put together a bunch of essentially kind of training data, I guess, evaluation data for these things. But yeah, this is you know, sna ai and, and they are the company that put together that AI scientist paper that we covered a while back. They're back at it with this, I, I wanna call it an a GI benchmark 'cause that's kind of what it feels like. I. moving on to policy and safety.

First up, some US legislation Senator Scott Wiener is introducing the bill SB 53 meant to protect AI whistleblowers and boost responsible AI development. So this would first be including provisions to protect whistleblowers who alert the public about AI risks. It's also proposing to establish call compute research cluster to support AI startups and researchers with low cost computing. This is in California in particular.

So this would be protecting presumably whistleblowers from some notable California companies and letting startups perhaps compete. Yeah, this is actually really interesting, right, because we covered extensively SB 10 47, which was the bill that successfully came out of the California legislature, which Gavin Newsom vetoed over the objections of not only a whole bunch of whistleblowers in the AI community, but also Elon Musk who actually did come out and endorsed.

Very unusual for him, you know, being a sort of like libertarian oriented guy. he endorsed SB 10 47, the original version of SB 10 47 contained. A lot of things, but, but basically three things, right? So one was the whistleblower protections that's included in SB 53. The other was Cal Compute that's included in SB 53, which leaves us to wonder, well, what's the thing that's missing? Right? What's the difference with SB 10 47 and it's the liability regime.

So to SB 10 47 included a bunch of conditions where developers of models that cost over a hundred million dollars to develop, could be on the hook for disasters if their safety practices weren't up to par. So if they developed a model and it led to a catastrophic incident, and it costs them over a hundred million dollars just to develop the model, essentially this, like, this means you've gotta be, super, super resourced to be building these models.

Well, if you're super resourced and you're building a model that, that's like a hundred million dollars plus to train, yeah, you're on the hook for. Literally catastrophic outcomes that come from it. I think a lot of people looked at that and said, Hey, that bar doesn't sound too low. Like, that's a pretty reasonable bar to meet for these companies. But that was vetoed by Gavin Newsom.

So now essentially what they're doing is they're saying, okay, fine, Gavin, like, what if we get rid of that liability regime and we try again? that's kind of the, the state that we're at here. So they're working their way through the California legislature. We'll see if it ends up on Newsom's desk again, and if so, if we get, you know, yet another scrapping of the legislation. Right. I should be clear that this is a senator in the California legislator, not in the federal government.

Yes. He represents San Francisco, actually a Democrat from San Francisco, which is kind of interesting. And yeah, the main pitch is balancing the need for safeguards with the need to accelerate and, you know, in, in a position to the objections raised to 10 47. Next up we have a story related to Federal US policy. The title is Nvidia and Other Tech Giants Demand Trump Administration to reconsider AI Diffusion Policy Set to be Effective by May 15th.

So this is a policy initially introduced under the Biden administration that broadly categorizes countries into free groups based on how friendly they are with US national security interests. So the first category would be friends and that can import chips about restrictions. Second would be hostile nations, which are completely barred from acquiring US Origin, AI technology. And then there are other countries like India, which face limitations.

And of course companies like NVIDIA aren't very happy about that 'cause that would mean less people buying rare chips. I think that's basically the story. Yeah. No surprise. There's a lot of lobbying against the AI diffusion policy. By the way, this is one that came out of the Biden administration, but interestingly, has so far not been scrapped. That's really interesting.

Because, you know, so many executive orders from the Biden administration have been, got rid of, as you would expect as part of the, the Trump administration settling into their, their seats. So yeah, I mean this is you know, Nvidia kind of trying it on again, Oracle trying it on again, you know, see if we can loosen up those constraints. we'll, I'm sure be talking more about this going forward. I. And next up, another story related to export controls.

Our favorite topic the story is that the US has blacklisted over 50 Chinese companies for the export blacklist. So this is from the commerce of Bureau of Industry and Security. There are now actually 80 organizations on this entity list with more than 50 from China. And this is companies that are allegedly acting against us, US National Security and foreign policy interests.

Yeah. And so an example, yeah, they're blacklisted from acquiring us items to support military monetization for advancing quantum mythology and things like ai. Yeah, this is, this is one of the, the cases where I just, I don't know when it comes to the policies policy side, and these are still I think they're still Biden era policies basically that are operating here. We may see this change, but for now, like, dude, come on.

So get that, like two of these firms that they're adding to the blacklist, we're supplying sanctioned entities like Huawei and its affiliated chip maker, high silicon. Okay. So high silicon is, is basically, it is basically Huawei it's kind of a division of Huawei that is Huawei's Nvidia, if you will. They do all the chip design. So then they, blacklisted 27 entities for acquiring stuff to support the CCPs military modernization and a bunch of other stuff.

When it comes to the AI stuff, like, okay. Among the, the organizations in the entity list, they say we're also six subsidiaries, six of Chinese cloud computing firm in Inspir group. So in Inspir is a giant, giant cloud company in China. They actually famously made essentially China's answer GPT-3. Back in the day, you may remember this, if you were tracking, it's called UN 1.0 or. Also it was called like Source 1.0. but the, the fact that like, this is China's game, right?

They keep spinning up these like stupid subsidiary companies and taking advantage of the fact that, yeah, like we're not gonna catch them. We're playing a losing game of whack-a-mole. It's super cheap to spin up subsidiaries and you import shit that you shouldn't until they get detected and shut down. And then you do it again. Until we move to a blacklist model, sorry, a whitelist model rather than a blacklist model with China, this will continue to happen, right?

Like you need to have a white list where by default it's a no, and then certain entities can import and then you wanna be really, really careful about that because basically because of civil military fusion. Any private entity in China is a PLA people's Liberation Army, a military, Chinese military affiliated entity. That's just how it works. It's different from how it works in the us that's just the, you know, the fact of life.

But until you do that whitelist strategy, like you are just waiting to be made to look like a fool. Like people are gonna spin up new subsidiaries and we will be doing articles and stories like this until the cows come home. Unless that, that changes. So this is kind of one of those things where, you know, I don't know why the Biden guys didn't do this. I get that there's tons of pressure from, from US industry folks. 'cause it's, it is like, it is tough.

But at a certain point, if the goal is to prevent the CCP military from acquiring this, this capability, we gotta be honest with ourselves that this is the solution. There is no other way that will fail to kind of, or succeed rather, at this kind of whack-a-mole game. And one more story. This one focused on safety more so than policy. Netflix's Reed Hastings gives 50 million to Bodo college to establish an AI program.

This would be a research initiative called AI and Humanity, and a focus on the risks and consequences of ai rather than sort of traditional computer science AI research. The college will, be using these fonts to hire new faculty and support existing faculty on this research focus. 50 million is, is quite a bit, I would imagine, for doing this kind of thing.

Yeah, it's sort of interesting 'cause I wasn't aware of, you know, there are all these kind of big luminaries every which way on, on this issue. And we hadn't heard anything from from Netflix, right. From Reed Hastings. And so I guess now we know where where at least that part of Fang comes down on the equation. Yeah, it's interesting. Of course this is a gift to Hastings alma mater. He graduated from this college decades ago. Auto, synthetic media and art. We have just a couple more stories.

First up, a judge has allowed the New York Times copyright case against OpenAI to go forward. OpenAI was requesting to dismiss the case and so that didn't happen. The judge has narrowed the lawsuit's scope, but upheld the main copyright infringement. Claims. VI Judge is also saying that they will be releasing a detailed opinion not released yet. So pretty significant I think because you know, there's a bunch of lawsuits going on.

But this is the New York Times, you know, a big boy in media publishing certainly probably has experienced lawyers on their side and able to throw down in terms of resources with open ai. So the fact that this is going forward is pretty significant.

Yeah, I mean, actually, I mean, nowadays they, they may not have the kind of resources that they, they once had, especially surprisingly, they're, they're kind of successful you think 'cause They managed to move to a subscription based online model and survive better than other media entities in recent decades. I don't know if they're as big as they used to be, but they're surprisingly successful still. Yes, I was just looking it up.

They're, apparently there's subscription revenue is, is let me see in the quarter. Okay. Quarterly subscription revenue of $440 million. Jesus. Okay. That's pretty good. That's pretty good. Wow. Okay. I, I would not have would not have expected that. I'll, I'll have to update my, Well, there you go.

I mean, we will, we'll get we'll get the, the opinion, whatever judge Stein means when he says expeditiously, which I guess in lawyer talk or, or legal talk probably means sometime in the next decade. But there you go. another similar story, although this time to the other side, a judge has ruled that philanthropic can continue training on copyrighted lyrics.

For now, this is part of a lawsuit from the Universal Music Group that was saying, wanting an injunction to prevent philanthropic from using copyrighted lyrics to train the models. That means that yeah, philanthropic can keep doing it, assuming it is doing it. And this is also saying that the lawsuit is gonna keep going. There's still an open question as to whether it is legal for philanthropic to do it, but there's not yet a restriction prior to the actual case.

Yeah, this, this is like very much not a, not a lawyer territory. so injunctions to my understanding essentially are just things where the court will say ahead of time before something would otherwise happen. They will step in and say, oh, up, up, up. Like, just so you know, like, don't do this. And then if you violate the injunction, it's a particularly bad thing to do. So this is sort of like the, would be the court anticipating rather than reacting to something.

So that's that's what the publishers are asking for, hence the statement. From the judge on the case saying, publishers are essentially asking the court to define the contours of a licensing market for AI training, where the threshold question of fair use remains unsettled. The court declines to award publishers the extraordinary relief of a preliminary injunction based on legal rights.

So basically, we're not gonna step in and, and kind of anticipate where this market is going for you and just say, Hey, you can't use this based on legal rights that have not yet been established. So essentially it's for another court to decide what the actual legal rights are. We're not in a position until that happens to grant injunctions on the basis of, of what is not settled law.

So once we have settled law, yeah, if it says that you're not allowed to do this, then sure, we may grant an injunction saying, oh, anthropic don't do that. But for right now, there's no law in the books and we don't really have precedent here. So I. I'm not gonna give you an injunction. That's kind of the, at least my read on this. Again, lawyers listening may be able to just smack me in the face and set me right, but kind of interesting. Yeah, sounds right to me.

So, well with that, we are done with this episode of last UK AI Lots going on this week, and hopefully we did cover it. And as we said, we'll probably cover some more of these details next week just because there's a lot to unpack. But for now, thank you for listening through apparently this entire episode. We would appreciate your comments, reviews sharing the podcast, et cetera, but more of anything, we appreciate you tuning in, so please.

Transcript source: Provided by creator in RSS feed: download file