#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max - podcast episode cover

#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max

Apr 18, 20252 hr 43 minEp. 247
--:--
--:--
Listen in podcast apps:

Episode description

Our 207th episode with a summary and discussion of last week's big AI news! Recorded on 04/14/2025

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • OpenAI introduces GPT-4.1 with optimized coding and instruction-following capabilities, featuring variants like GPT-4.1 Mini and Nano, and a million-token context window.
  • Concerns arise as OpenAI reduces resources for safety testing, sparking internal and external criticisms.
  • XAI's newly launched API for Grok 3 showcases significant capabilities comparable to other leading models.
  • Meta faces allegations of aiding China in AI development for business advantages, with potential compliances and public scrutiny looming.

Timestamps + Links:

Transcript

Hello and welcome to the last week in AI podcast. We can hear us chat about what's going on with ai. As usual, we'll be talking about the major news of last week, and you can go to the episode description to get all those articles and links to every story we discuss and the timestamps as well. I'm one of your regular hosts, Andre Kko. I studied AI in grad school and now work at a generative AI startup. I'm your other host, Jeremy Harris.

I'm with Gladstone AI and ai National Security Company Yeah, I guess that's, that's it. That's the story. Yeah. I like how you this, this description National Security Company, ai. Yeah. It's, it's, it's an AI national, basically like we, we work with partners in, in the US government and private companies on dealing with national security risks that come from increasingly advanced AI up to and including like super intelligence, but also sort of a GI, advanced ai. The whole, the whole gamut.

That's kind of our, our area. Yeah. Yeah. I just like that phrase, ai, national security company. You'd feel there's a lot of AI, national security companies, but I imagine it's a pretty small space. Yeah, it's, it's actually kinda weird, like it's. I guess on the national security side you could say Palantir is in a way, they're more about, you know, like the application level. What can we build today?

I would say that companies like OpenAI and Anthropic and like Google DeepMind should be thinking of themselves as an AI national security company. Mm-hmm. Just like, to the extent that you're building like fucking super intelligence and shit, you think that's on the roadmap? Like Yep. you're in the national security business, baby. So I guess it's a, it's a short way of summarizing what otherwise could go on for some time. Just like, Hey, I mean, what Asate does is.

More than a one-liner two, although it's maybe cleaner, one liner. Maybe. Maybe. this week we've got a slightly calmer week than we've been seeing, I think for a while. Some sort of medium size news. Nothing too crazy. But as we'll be starting out, I think Gyp, T 4.1 will be one of her stories that's gonna be pretty exciting. Some other kind of incremental news developments, applications in business some stories related to startups and ro really open AI competitors projects in open source.

We got, as always, more benchmarks coming out as people try to continually evaluate these AI agents and how successful they are. Research and advancements, we are gonna be talking about yet more test time, reasoning stories, and how to get. Those models aligned and better at reasoning without talking forever. And in policy and safety. Some more stories about open AI policies and the drama going on with all the lawsuits and whatnot. And okay. That's an evergreen comment though, isn't it?

Like we, we could have that in every episode. There's, there's always a bit more to say, so yeah. Yeah. That's just how it is of OpenAI. And let's just go ahead and dive straight in tools and apps. We are starting with open eyes announcement of GPT-4 0.1. This is their, a new family of AI models. It's including also GP 4.1 mini and GP 4.1. Nano and Visa models are as per title, all optimized apparently for coding and instruction following.

They are now available through the API, but not through Chad, GBT. And they have a 1 million token context window which is what you would get with I believe Claude Opus and also Gemini. Kind of the big models, I believe all have 1 million as input. That's, you know, a very large amount of words in a code base. So I think it an interesting development for. Open the eye to have this model, this kind of focus with the most recent I guess sequel to GPT.

Kind of reminds me of what philanthropic has done particularly of clot code. People like getting all about vibe coding, having agents build software seems a little bit aligned for that. Yeah, it does. It's, it's really all about kinda moving in this direction of cheaper models that actually can solve real world software engineering tasks. And that's why in the eval suite, you tend to see them focus on SW bench scores. Right.

Which, you know, in fairness, this is more SW bench verified, which is opening eyes version of SW bench, which we've talked about before. But anyways, software engineering benchmark, it's meant, meant to test real world coding ability. it does really well, especially given the, the cost associated with it. You're looking at between, you know, 52 and, and 54.6, a bit of a range there because anyway, there's some solutions tobe verified problems that they couldn't run on their infrastructure.

So they kind of have this range of scores. Comparable too. I mean, it's, it's all about this pato frontier, right? Like you, you get to choose your own adventure as to how accurate and performant your model's gonna be versus how cheap it's gonna be. And this is giving you a set of kind of on the cheaper side, but more performant options, especially when you get on the, the nano end of things.

The also has a whole bunch of other multimodal abilities, including the ability to reason over video or, or kind of analyze video. It comes with a more recent knowledge cutoff too, which just, you know, intrinsically is a value add. So you don't need to really, kind of do much other than provide more UpToDate training to add some value to a model. Up to June, 2024, by the way, is that cutoff?

So, you know, kind of cool if you're worried about software libraries that are a little bit more recent, for example, that might be a helpful thing. But also obviously it has tool use capabilities baked in now as all these coding models do. So yep, pretty, pretty cheap model. Pretty frustrating for anybody who's trying to keep up with the nomenclature on which, which index are we at now?

I thought we were at 4.0, but then I thought that we were gonna switch and, and we're just gonna have the O series. So no more, no more base models. But then the 4.5 comes out. That's the last base model. Okay, we're done there, but no, no, no. Well, let's go back and do 4.1. So confused right now. Exactly. Yeah. This is a prequel to 4.5. I guess. I just decided to release. And I assume we're not going with o because this is not omni model.

I assume it only processes text per, it's focused on coding. It does apparently have. So they say that it has some, some video capabilities. Right, right. Like to, to understand content and videos. Like, so I, yeah, I don't, I did not understand that point way. Blog post multimodal. Do you have to be? Yeah. It's like how multimodal do you have to be before you call the omni model is the next question. Right.

Well on your note of improving on benchmarks looking at a blog, it actually is a pretty impressive boost of GT 4.1. Mm-hmm. Compared to GT four Oh on as be bench verified, GT four oh gets 33%. GP 4.1 gets 55% and that's higher by a little bit than OpenAI oh three mini on the high and OpenAI oh one on high compute. So pretty impressive for a non-high compute, non, I guess test time reasoning model. Two. Be even better than some of these.

More expensive typically and, and solar models much better of a g BT 4.5 as well. Interestingly. So I will say it's a lot of internal comparisons. So they're showing you how it stacks up against like other open AI models, which you know, even like when cloud 3.7 sonet came out, like it, its range is like 62 to 70% on suite bench verified. So, you know, this is quite a bit worse than cloud 3.7 sonet, but that's where the accuracy cost trade off happens, right?

Yep. And next story also has to do with OpenAI. This one though is about chat, GPT and some new features there. And particularly the memory feature in chat GPT that basically just stores things in the background as you chat. That's getting an upgrade. Apparently. Che Bt can now reference all of your past conversations and that will suppose. Be much more prominent. Actually this was funny.

A coworker posted and, and it was like, whoa, it referenced this thing from recent interactions and they didn't even know memory was a thing on Chet. So I imagine this might also be tweaking the UX to make it maybe more clear that this is happening. This does tweak the UI as well. So you can still use saved memories where you can manually ask to remember and You can have chat GBT, reference chat history, where it will, I guess, use that as context for your future interactions.

Yeah, it's really exciting as part of the announcement there. Also letting us know that chat, GPT can now remember all the ways in which you have wronged it. And where you sleep and eat who your loved ones are, your alarm code and what you had for dinner last night. So really exciting to look forward to those interactions with the totally not creepy model. Yeah. No, this is actually true though. It is a cool step in the direction of these more personalized experiences, right?

Like you need that persistent memory because otherwise it does feel like this sort of episodic interaction all kinds of psychological issues I think are gonna crop up once we do that. Obviously, like the world of her, which is quite explicitly what Sam a has been pushing towards, especially recently. You know, I mean, I, I don't know how people are gonna gonna deal with that long term, but in any case, as if to deal with with objections of that shape they do say.

You know, as always, you're in control of chat, GT's memory. You can opt out of referencing past chats or memory altogether at any time in your settings. Apparently if you're already opted outta memory, they'll automatically opt out of referencing your past chats by default. So that's, you know, that's useful. And apparently they're rolling out today to plus and pro users except in certain regions.

Like a lot of, in like the EU type thing, including Lichtenstein, because you know, there it's the first time I've seen that giant market cut out. I know. Yeah. I guess very stringent regulations over in Liechtenstein. Yeah, interestingly rolling out first to the pro tier, the like crazy $200 per month tier, which seems to be increasingly kind of the first way to use new features. And this says will be available soon for the 20 dollars plus subscribers. And onto the landing round.

A few more stories. Next up we got Google and they also have a new model. This one is Gemini 2.5 Flash. So they released Gemini 2.5 Pro, was it, I think not too long ago. And, and people were kind of blown away. This was a very impressive release from Google. And kind of really the first time with Gemini sort of was seemingly leading the pack and a lot of people were saying, oh, I'm switching from Claude to Gemini with 2.5. It's better.

And so this was kind of an exciting announcement for that reason. Now we've got the smaller, faster version of Gemini 2.5 Pro I. Yeah, and it, I mean, it's, it's designed to be cheaper again. It's like, it's all part of the same, the same push, right? So typically what seems to happen is model developers will come up with a big, kind of pre-trained model. And once you finish doing that, you're kind of in the business of mining that model in different ways.

So you're gonna create a whole bunch of distillates of that model you know, to make these cheaper, kind of lightweight versions that are better on a per token kind of price efficiency. Standpoint. So that's what happens, right? You get the big, the big thing gets done that may or may not be released. 'cause sometimes it's also just too expensive to inference. That's what a lot of people have suspected is what happened with cloud three Opus, for example, right?

It's just too big to be, to be useful, but it can be useful for kind of serving as a teacher model to distill smaller models. Anyway, that's, that's more of the same here. Boy, is, is this field getting interesting though, as you say? I mean, it's, I remember when Open OpenAI was the runaway favorite. I'm really, I'm curious what the implications are for fundraising for OpenAI.

Is it just that they haven't released their latest models to kind of like, you know, demonstrate that they're still ahead of the pack? all kinds of questions as well around the acceleration of their safety review process that we'll get into as well, that ties into this.

But things right now, like, I'm, I'm really gonna be interested to see, I. If it's even possible for OpenAI, I, I don't know that they'll be able to raise, frankly, another round without IPO-ing, if only because they've already raised $40 billion and they're kind of close to the, the end of the source of funds. But there you go. Yeah, I think it's an interesting time. For sure. For a while it seemed like OpenAI was by far ahead of everyone, right?

When for years, even before this became a sort of consumer, very business based OpenAI kind of got a head start, so to speak, for GP Free. They were the first ones to recognize LLMs and, and really create LLMs. And yeah, for a while they had, you know, the first impressive text to image models, the first impressive text to video. They had audio to speech as well with Whisper.

But yeah, in, in recent times, it's increasingly harder to point to areas where OpenAI is leading the pack or like significantly differentiated from philanthropic or Google or other providers of similar offerings. And speaking of which, next up we've got a story about XAI. They're launching an API for grok free. So Grok three recently launched, I think we covered it maybe a month ago. Very impressive. Kinda similarly competitive model and the same ranks as Chad, GBT and Claude at the time.

You could play around with it, but you could not use it as a software developer as part of your product, whatever. 'cause you needed an API for that. Well now it is available and you can pay to use it at a free dollars per million input tokens and $15 per million output tokens with rock free mini costing. Significantly less.

Yeah. So they have also the, option to go with a, a faster version, like, I guess a, a version where my read on this is it's, it's sort of same performance, but I guess lower latency. So for instead of three bucks per million tokens of input, it's five bucks per million tokens. And then instead of 15 bucks per million output tokens, it's 25. So they kind of have this that's for the full gro three and they have a similar thing going on with gro three mini. But kind of interesting, right?

Like if you wanna get, I guess maybe a head in line on, on a latency standpoint introducing that option. So it's another way to kind of segment the market. So that's kind of cool. It, we are seeing price points that are a little bit. On the high end, I mean comparing sort of similarly to like 3.7 sonnet. but also like considerably more expensive than the Gemini 2.5 pro that we talked about earlier, that, that came out I guess a couple weeks ago. But still it's impressive.

It's XAI again, kind of coming outta nowhere, right? I mean, this is pretty remarkable. There has been some talk about the context window. So initially I think the announcement was that there was supposed to be a 1 million token context window. I think that was announced back in February. It seems like the API only lets you get up to about 131,000 tokens. So. Where that delta is. I mean, it, it, it may well come from the serving infrastructure, right?

So the base model may actually be able to handle the full 1 million tokens. But they're only able to serve it up to 130,000 for right now, in which case, you know, you might expect that to increase pretty soon. But anyway yeah, really really interesting and another of these entries, right, in the kind of frontier models that all look kind of the same.

Not a coincidence by the way, because everybody's getting comparable allocation from Nvidia, comparable allocation from TSMC, like, it all kind of comes from the same place. And so unless you have 10 times more chips, like, don't expect to have 10 times the, the scale or, or a significant, significant leap in capability, at least at this point.

I think everyone has scraped the internet got largely similar data sets, and it's, I think, also kind of the secrets of a trade are probably less secrets than it used to be. It seems like with GR for instance you know, they got into it a year ago and it became slightly clearer on how to train large language models by that point, in part because of Lama in part because of open efforts things like that. Well, and, and Jimmy, Jimmy Ba also like the founding engineer was also like, you know, Google.

Yeah. And, and they had like, yeah. Very experienced people who've already done this. So, yeah, I think there is, one of the, the interesting things here is like there is a lot of secret sauce that isn't shared. But it's adding up to the same thing, I just find that really interesting from a, almost like meta, like zoomed out perspective.

It's like you have this human ant colony and it, the ant colonies may, may have different shapes or whatever, but fundamentally the, the economics that they're constrained by that, or the almost laws of physics and engineering are, are pretty similar. And until we see a, a paradigm shift that's big enough to give you like a 10 x lift that, and there's no response from, from other companies, we're, you know, we're gonna be in this, in this intermediate space.

Don't expect that to persist by the way too long in the age of inference, because there, I think little advantages can compound really quickly. But anyway, that's maybe a, a conversation for a later time. Next up, we have a story not related to a chatbot. It's conva, which is basically tool suite for design, I think, and, and various kinda applications related also to PowerPoint. Veev announced their visual Suite 2.0, which has a bunch of AI built into it.

So they have Conva code, which is tool with generative AI coding. And that lets you generate widgets and website with text. So kind of built in vibe coding, I guess. And they also have a new AI chat bot, and that lets you use their generative AI tools like editive photos resizing generating content all through this chatbot interface. it's increasingly the case that I guess people are building their AI into their product suite in cleaner ways, better ways.

It seems like we are getting to a point where some of this stuff is starting to mature and people are iterating on the UX and trying to really kind of make AI part of the tooling in a more natural way. Yeah, it's, it's one of the most interesting sort of design stories I think that we've seen in like, actually in decades. I mean, this is a, a pretty fundamental shift. Think about the shift from, you know, web 1.0 to Web 2.0. This is, this is again, a kind of similar leap, right?

Where all of a sudden it's a whole new way of interacting with computers and, and the internet. And so, you know, designers are probably having a field day. So yeah, we're, I'm sure we're gonna see a lot more of this stuff. Obviously, we're only like. Two, three years into this process. But we'll say it. It's also kind of funny that you open the story saying, Hey guys, like exciting. 'cause this is a story. It's not about chatbots, and there's a chat bot in the freaking thing.

Just shows you where we, where we are. Yeah. Yeah, that's a good point. And one last story. This one is way to meta and also a chatbot. Well at least a model. This is the maverick model from LAMA four. We cover Lama four, I believe, in the last episode, and covered how it was met with a lot of, let's say, skepticism and, and people calling them out for seemingly having good benchmark numbers but not actually being impressive in practice.

Well, this is an update on part of that where the LAMA four seemed to be doing really well on LM Arena, where people rank different models. Turned out this was a special variant of LAMA four optimized trial M Arena, and the vanilla version. Is way worse. It is kind, kind of matched with what seems to be the case for Lama Foreign General. It's underwhelming.

So just a sort of reaffirming of the fact that they pretty much gamed the benchmark and it was Yeah, pretty kind of pretty nonsense, pretty clearly stunt that they should not have pooled, I think with Lama for Yeah, I mean this tells you a lot. It can't help but tell you a lot about the state of AI at Meta, right? Like the, there are, there are a couple things that companies can do that are pretty, like, undeniable indications of actual capability or, or the direction they're going in.

You know, companies often have to advertise roles that they're gonna hire for. So, you know, they're forced to, to kind of telegraph to the world something about what they think about the, the future by doing that. And then there are things like this where it's a, you know, very clearly a stunt and like a pretty gimmicky one at that. Look, the reality is this is Goodheart's law in part, right?

So Goodheart's Law is if you pick a, a target for optimization, in this case the LM CIS leaderboard, and you, push too hard in that direction, you're gonna end up sacrificing overall performance. There're gonna be unintended side effects of that optimization process. You can't be the best at everything all the time, at least until we hit the singularity. and this is a reflection of the fact that yeah, meta made the call to actually optimize for marketing more than other companies.

I think you know, other companies we just would not have, have made this move. That being said, I think the real update here is. Any excitement you had about the LAMA four, like any variant of Lama four's performance on LMSs basically just like ditch that and you're basically in the right spot. I wouldn't, so what they're doing in this article is they're basically saying like, oh, look at how embarrassing La Lama four Maverick is on, a wider range of benchmarks.

It's even scoring below GPT-4 Oh, which is like a year old. So that's like, that's truly awful. that may be true, but it's also not like this is the version that was fine tuned for for the Ellen Marina and. Like, I, I wouldn't even think of that as a, an interesting benchmark. It's like you, you fine tune a model to be really good at, I dunno, biological data analysis and then you complain that it's not good at math anymore. And that kind of just makes sense.

We know that's already true, but anyway, so all, which is to say this is a fake result or the original LM Arena result is basically fake. As long as you delete that, purge that from your memory buffers, you're thinking about LAMA four the right way. It's a pretty disappointing launch. The update here is about meta itself, I guess, and just like, you know, something to think about, because we've heard about some of these high profile departures too from the meta team, right?

Like they're, they're forced to do a clean sweep. Y Koon is trying to do damage control and go out and say like, oh, this is a, like, it's like a new beginning. And this is, I mean, dude, I, open source was supposed to be the one place where they could compete. Like we've known that meta can't, can't generate truly frontier models for a long time. but they were at least hoping to be able to compete with China on open source. And now that doesn't seem to be happening.

So there's a big question is like, okay, what, like, what is the point guys? I mean, we're, we're spending billions on this. There's gotta be some ROI. Right. Just to dive into a bit more details, the one that we got the initial results on that ranked very very well was this Lama four Maverick experimental, which was optimized for Conversationality. And that's El Marina. You, you have people talking to various chat bots and inputting their preference.

So, seemed like it was pretty directly optimized for that kind of benchmark of El Marina. And, and I believe they also did say that it was. Partially optimized for that specific benchmark. And as you said, the vanilla version, the kind of general purpose is I mean, not horrible, but ranking pretty low compared to a bunch of models that are pretty old.

I, I think 32nd place right now compared to all, a whole bunch of other models below deep seek, below cloud 3.5, Gemini, 1.5 Pro, things like that. Auto applications and business First story relate to a Google and a new TPU. So this is their seventh gen TPU announced that Google Cloud next 25, it's called Ironwood. And they're saying that this is the first PU designed specifically for inference with in the age of inference.

I think people pointed out that TPUs initially were also for inference, so this is a little bit of a, maybe not accurate. But anyway they as you might expect, have a whole bunch of stats on this guy. You know, crazy numbers like that A TPU can scale up to 9,216 liquid cooled chips. Anyway, I'm gonna let you take over the details.

'cause I assume there's a lot to say on whatever they announced with regards to what people are also building for GPU clusters and, and generally the hardware options for serving ai. Yeah, no, for sure. And, and I actually didn't notice that the first Google TPU for the age of inference thing. I like, I like that kind of, sort of pseudo hypee thing. I wish that the first email I'd sent after oh one dropped I'd like formally titled it, you know, the, my first email in the age of inference.

That would, that would've been really cool. I missed opportunity, but yeah, essentially, as you say. A-A-A-T-P-U it is optimized for thinking models, right? For, for these inference heavy models that use a lot of test time compute. So, you know, LLMs, Moes, but specifically like doing the inference workloads that that you have to run when you're doing RL post training or whatever. So it's in the water, but it certainly is a, a broader tool than that. It is giant.

Geez, when we talk about all these chips linked together, like we have to put in a bit of context. So I think the best comparable to this is maybe the B 200 GPU and specifically maybe the, NVL 72 GB 200 configuration. So, essentially, and we talked about this a little bit in the hardware episode, but so the, the B 200 Is one part of system called the GB 200. GB two hundreds come in ratios of two GPUs per one CPU, and you'll have these racks with like 72 GPUs in them.

And those 72 GPUs, they're all connected really, really tightly by these NV link connectors, right? So this is extremely high bandwidth high bandwidth interconnect. And so the question here is, so, so Google has essentially like groups of like 9,000 of these TPUs in one, what they'll call one pod. And they are connected together, but they're not connected through connections interconnect with the same bandwidth as the NVL 72.

And so you have with the NVL 72 kind of like smaller, smaller pods, if you will. the connection bandwidth between them is much higher. And so these, Google systems are like a lot larger but a bit slower at that level of abstraction, at the kind of full interconnect domain level. So doing a side by side is kind of tricky because what it means to have like 72 GPUs or 9,000 kind of, or 72 chips or 9,000 I should say, sort of varies a little bit. But the specs are super impressive on a flop basis.

So the Ironwood hits 4.6 PETA flops, that's per chip, and the B 200 is gonna hit 4.5 tariff flops per chip. So very, very comparable. There. Not a huge surprise because, you know, both have great design and both are relying on similar nodes At TSMC there are a whole bunch of cool stuff on the memory capacity side. So these chips, the TPUV sevens are actually equipped with 192 gigabytes of HBM three memory.

That's really, really significant amount of, like these stacks of dram, basically the HBM stacks. about double what the what a typical like B 200 dye will have. So it's pretty. Pretty, or have, I should say, feeding into it. And that's especially helpful when you're looking at really large models that you wanna have on the device that have like Moes. So you might be able to fit like a full, a full expert, say a really big one on one of these HBM stacks.

So that's a, a pretty, pretty cool feature. all kinds of details that get into like how much coherent memory do you specifically have? Like how the memory architecture is unified. We don't have to dive into too much detail, but the Bo bottom line is this is a really impressive system. The 9,000 or so TPUs in one pod, That comes with a, a 10 megawatt footprint on the power side. So that's like 10,000 homes worth of power just in like in one pod. Pretty, pretty wild.

There is a, a lightweight variant with a, I think it was like about 200 chips in a pod as well. For sort of more lightweight, kind of setups, which I guess they would probably do at, at inference, like data centers they've set up for inference closer to the, the edge or where the customer will be. But yeah more power efficient too, by the way. 1.1 kilowatts per chip compared to more like 1.6 kilowatts for the Blackwell. That's becoming more and more important.

The more power efficient you can make these things, the more compute you can actually squeeze out of them. And power is increasingly kinda that rate limiting rate limiting factor. So this is a big launch. There's, my notes are a bit of a mess on this 'cause it's just like, there, there, there's so many rabbit holes we could go into and maybe worth doing at some point, like a, a hardware update episode launches, but might leave it there for now. Yeah, this announcement kind of made me reflect it.

Seems like one of the questions with regards to Google is they are offering very competitive pricing for Gemini 2.5, kind of undercutting the competition pretty significantly. Yeah, that could be, you know, at a loss just so that they can gain more market share. But I imagine having TPUs and having, you know, a very advanced cloud architecture and ability to run AI at scale makes it more feasible for them to offer things at a lower price.

And in the blog post for this announcement, they actually compared to TPUV two, TPV two was back from 2017, and so this iteration of TPUs have 3,600 times the performance of TPV two. Right. So like. Almost 4,000 x multiplier and, and way more of A-D-P-U-V five as well. And as you said, the efficiency comparison. Also, they're saying that you get 29.3 flops per watt compared to TPUV two. So, you know, way more compute power, way less energy use for vacuum compute power.

Just shows you how far they've come in these years. And you know, it does seem like this, there's quite a significant jump in terms of both flops per watt and peak performance compared to Trillium and V five. So, another reason I guess to think that we might be leveraging this to be more competitive people typically don't train their own models on the cloud, they are running models and so it sort allowed them to. Really, yeah. Support customers using their models relatively cheaply.

Yeah. And, and the interconnect is a really big part of this too, right? So, so there is this move in the industry to kind of move away from at least the Nvidia Infinity band in interconnect fabric. That is kind of, I don't wanna say like industry standard, but you know, anything by Nvidia is definitely gonna have some momentum going for it. So Google actually invented this thing called inter interconnect, which is a. Unhelpfully vague in general term.

But ICI, and this is essentially their replacement for that. and that's a big part of what's allowing them to hit like really, really high bandwidth on the backend network. So now that when we say backend, like kind of connecting different pods, connecting essentially parts of the compute infrastructure that are relatively far away. and that's important, right? When you, when you're doing giant training runs, for example at large scale, you are gonna do that a lot.

It's also important interconnect bandwidth is for inference workloads. For a variety of reasons. so is also just like HBM, like capacity, which they've again dialed up. I mean, this is like double what you see, at least with the H 100. And onto the next story we are gonna talk about Aaro. They have announced a $200 per month Claude subscription called Max. So that's pretty much the story. You're gonna get higher rate limits, the a hundred dollars per month option, that's a VE lower tier.

You're gonna get five times the rate limits compared to Cloud Pro with $20 subscription. And for the 200 per month option, you're getting 20 times higher rate limits. I think an interesting development. We had OpenAI releasing their pro tier I think a few months ago now it's pretty fresh. And Noro also coming with a $200 a month tier, I think.

Partially a little bit of expected developments in the sense that if you are a power user, you're almost definitely costing anthropic and OpenAI more than you're being charged for $20 per month. It's pretty easy to rack up more cost if you just, you know, are doing a lot of processing of documents, of, of chats. And so, you know, it's, it's a kind of unprecedented thing to have $200 per month tools, at least in the kind of productivity space.

Adobe of course, and, and number tools like that charge easily this kind of very significant amount. Anyway. Yeah, that's what I came to think is it might be a trend that we'll be seeing more of AI companies introducing these pretty high ceiling subscription tiers. A hundred percent. And, and I mean, I'm actually, I'm a a Claude Power user for sure. So this is just definitely for me, I mean, the number of times I run out it, it's so frustrating.

Or has been where you are using Claude, you're in the middle of a problem and it's like, oh, this is your last query. Like, you have to wait another, it's usually like eight hours or something before you get more ability to query. that's really frustrating. So awesome that they're doing this. I think, I'm trying to remember how much I'm paying for it. I, I think it's 20 bucks a month or so.

So the 100 bucks per month for five times, the amount of usage is actually just like they're, all they're doing is really kind of allowing you, at least if my math is right here, just allowing you to. proportionately increase the 200 bucks a month for 20 times the amount. Okay. That's, you know, I guess a 50% off deal at that scale or something like that. but still these are really useful things.

I mean, the number of times I have thought to myself, man, I would definitely pay like a hundred bucks a month to not have this problem right now is quite high. So my guess is they're gonna unlock quite a bit of demand with this suggest maybe that they've solved something on the compute availability side. 'cause they didn't offer this before, despite knowing that this was an issue and I'm sure that they've known this was an issue.

So yeah, I mean they, they may have just had some, some compute come online. That's at least one explanation and a few more stories related to OpenAI. First up, we've got, I guess a new competitor to OpenAI that's slowly emerging. It's safe, super intelligence. The AI startup led by OpenAI co-founder Ilia Sr. One of the chief kind of minds of research going back to the beginning of OpenAI and to 2023 when he was famously involved in the ouster of Sam Altman briefly before Sam Altman returned.

Then Ilya Skove left in 2024 is launching this I guess play for a GI and now we're getting the news that they are raising 2 billion in funding and the company is being valued at 32 billion. So this is apparently also on top of a previous 1 billion raised by the company, and I think it's impressive that. In this day and age, we are still seeing startups with prominent figures, getting billions of dollars to build ai.

It, it doesn't seem like there is saturation of an, of investors willing to throw billions that people who might compete at Frontier. Hard, hard to saturate demand for super intelligence, or at least speculation. yeah, pretty wild. The other, the other kind of update here is with, alphabet jumping in.

We are, I think learning for the first time at least, I wasn't aware of this, that safe super intelligence is accessing or using TPUs provided by Google as their predominant kind of source of compute for this. So we've already seen Anthropic partnering with obviously Google as well, but Amazon to use traum chips and, and in French as well, I believe, but certainly Traum.

And so now you're in a situation where, you know, SSI, like Google's trying to say, Hey, Linda, like, optimize for our architecture. And that's not a small thing, by the way. Like, I know it might sound like, okay, you know, which pool of compute do we optimize for? Like, do we choose, do we go with the TPUs, the Nvidia like GPUs or do we go with, you know, Amazon stuff? But the choices you make around this are. Extremely.

There's a lot of lock-in, like vendor lock-in that you get, you're gonna heavily optimize your workloads for a specific chip. Often the chip will co-evolve with your needs depending on how close the partnership is. That certainly what's happening with Amazon and Anthropic. And so for safe super intelligence to throw in their lot with Google in this way does imply a, like pretty intimate and deep level of partnership. Though we don't know the exact terms of the investment.

So maybe like, presumably just because they are using TPUs, there's something going on here with compute credits that alphabet is, I would guess offering to save super intelligence as at least part of their investment in much the same way that Microsoft did with OpenAI back in the day. But something we'll presumably learn more about later.

It's a very interesting positioning for Google now, kind of sitting in the middle of a lot of these these labs including Anthropic and Safe super intelligence. And the next story also related to a startup from a former high ranking opening eye person. This one is about Mira Mira's Thinking Machines which has just added two prominent Xop AI advisors, Bob McGrew and Alec Redford, who were both formerly researchers at OpenAI.

So another, yeah, quite related or similar to safe super intelligence in that. Not a lot has been said as to what we are working on really as to much of anything. But they are seemingly raising over a hundred million and are recruiting, you know, the top talent you can get essentially. I mean, they, like, I don't know how Amir has done this. I don't know the, the detail. I mean, she was very well respected at OpenAI. I, I do know that.

And John Schulman, she's recruited, obviously we talked about that. He's their chief scientist, Barrett Soft, who used to lead model post training at OpenAI is the CTO now. So like it's a pretty stacked deck. And if you add as an advisor, Alec Radford, that is wild. Like to see Alec's departure from, from OpenAI, even though he had been there for like a, a decade or whatever it was as a reminder, right?

Like he is the, the GPT guy, he did a bunch of other stuff too, but he was, you know, one of the need offer. Yeah, he was, yeah. One of the lead offers of the papers on GPTs, as you said. Exactly.

Yeah. And, and, and just kind of known to be a, you know, people talk about the 10 x software engineer or whatever, like he, he was lived like what, 1000 x you know, AI researcher to the point where people were using him as the, as the metric for like, when we'll automate AI research, like, I think it was Dke Patel's on his podcast. When are, when are we gonna get, you know, 10,000 Alec automated Alec Radford's or whatever. That was kind of his bar. So yeah, truly like exceptional researcher.

And so it was a big deal when he said like, Hey, I'm, leaving OpenAI. He is still, as I recall, he was leaving the door open for collaboration with OpenAI as part of his kind of third party entity he's formed. So presumably he's got crossover relationships between these, these organizations and presumably. Those relationships involve support on the research side. So he may be one of few, very few people who have direct visibility in real time into multiple frontier AI research programs.

God, I hope that guy has good cybersecurity, physical security and other security around him. 'cause who would that be an interesting would that be an interesting target? Next up, we got a story not related to chatbots, but to humanoid robots. The story is that hugging face is actually buying a startup that builds humanoid robots. This is Pollen Robotics. They have a humanoid robot called, called two. And apparently hugging face is planning to sell and open it for developer improvements.

So kind of an interesting development. Hugging face is sort of a GitHub of a. Models, they host AI models and they have a lot to do with open source. So this is building on top of a previous calibration where hugging faced released Le Robot and open source robot. And, and also we released a whole software package doing robotics. You know, building on top of that. And yeah, I don't know. Interesting thing for hanging face to do, I would say.

Yeah, I, I saw this headline and my first reaction was like, what the fuck? when you think about it, it, it can make sense, right? So the, the, the classic play is we're gonna be the, the app store for this hardware platform. And that's really what's going on here. Presumably, you know, they, they think that humanoid robotics is gonna be something like the next iPhone. And so essentially this is a commoditize, your complement play.

You have the, the humanoid robot, and now you're gonna have an open source sort of suite of, software that increases the value of that humanoid robot over time. For free at least for you as the company. So hugging face is really well positioned to do that, right? I mean, they are the GitHub for AI models. There's no other competitor really like them. So the default place you go when you wanna do some, some, you know, AI open source stuff is hugging face. It kind of makes sense.

Remains to be seen how good the platform will be. Like Pollen Robotics, I'm not gonna lie, hadn't heard of them before. they are out there and they are acquired. So, I mean, it, it, it'll be interesting to see what they can actually do with with that platform and how quickly they can bring products online. And last story for the section, Starkey Developer Cruso apparently could spend 3.5 billion on a Texas data center.

This is on the AI startup Cruso, and the detail is apparently not only are we gonna be spending this amount of money, we're gonna be doing that mostly tax free, where are getting an 85% tax break on this billions of dollars project. So, I guess a, a development on Stargate and just showing the magnitude of business going on here.

Yeah, the, the criterion for qualifying for the tax break is for them to spend at least 2.4 billion out of a planned $3.5 billion investment, which I mean, I don't think is gonna be a problem for 'em. Looking at how, how this is all priced out. they've since registered two more data center buildings with a state agency. So we know that's coming. We don't know who the tenants are going to be, but Oracle is sorry for one of those buildings.

Oracle is known of course, to be listed for the other, so important maybe context if you're new to the data center sort of space or universe. What's happening here is you've essentially got. There's a company that's gonna build the physical data center that is cruso. But there are no GPUs in the data center. They need to find what's sometimes known as a hydration partner or, or like a, a tenant, someone to fill it with GPUs. And that's gonna be Oracle in this case.

So now you've got Cruso building the building you've got Oracle filling it with GPUs, and then you've got the actual user of, those GPUs, which is gonna be OpenAI because this is the Stargate project. And on top of that, there are funders who can come in. So Blue Owl is a private credit company that's lending a lot of money. JP Morgan is as well.

So you've got, this is, you know, it can be a little dizzying, but you have, you know, blue Owl JP Morgan funding cruso to build data centers that are going to be hydrated by Oracle and served to open ai. That is the whole sec. So when you see headlines where it's like, wait, I thought this was an open AI data center, whatever. That's really what's going on here. there's all kinds of like.

Discussion around, well, look, this, this build looks like it's gonna create like three to 400 new full-time jobs with about $60,000 worth of minimum salaries that at least is, is part of the threshold for these tax breaks. And people are complaining that, hey, that doesn't actually seem like it's that much to justify the enormity of the tax breaks that are gonna be offered here. I just think I would offer up that the employment side is not actually the main value add here, though.

Like this is first and foremost should be viewed as a national security investment much more than a, like a, you know, jobs and, and economic investment or, or as I should say, as much as an economic investment. But that's only true as long as these data centers are also secured. Right? Which at this point frankly, I, I don't believe they are. But bottom line is it's, it's a really big build. There's a lot of tax breaks coming and a lot of partners are involved.

And in the future if you hear, you know, blue Owl and JP Morgan and Cruzo and all the rest of it this is the reason why. Moving on to projects and open source. We start with a paper and a benchmark from OpenAI called Browse Comp. And this is a benchmark designed to evaluate variability of agents to browse the web and retrieve complex information. So it has 1,266 facts seeking tasks where the agent, the model equipped to do web browsing is tasked with finding some information and retrieving it.

And apparently it's pretty hard. Just base models. GP four O not built for this kind of task are pretty terrible. They get 1.9% ability to do this 0.6% if it's not allowed to browse at all. And deep research their model that I. Is optimized for this kind of thing is able to get 51.5% accuracy. So a little bit of, you know, room to improve on, I guess, finding information by browsing. Yeah, and this is a really carefully scoped benchmark, right?

So we often see benchmarks that combine a bunch of different things together. You know, thinking about like SWE bench verified for example. Yes, it's a coding benchmark, but it also, depending on how you approach it, you could do web search to support you in generating your answers. You could use a lot of inference time, compute what capabilities you're actually measuring. There are a bit ambiguous.

And so in this case, what they're trying to do is explicitly get rid of other kinds of skills So e essentially what this is doing is it's, yeah. Avoiding problems like generating long answers or resolving ambiguity, that's not part of what's being tested here. Just focusing instead on, can you. Persistently follow a, like an online research trajectory and be creative in finding information. That's it, right?

Like the skills that you're applying when you're Googling something complex, that's what they're testing here and they're trying to separate that from everything else. they give a couple examples. Here's, here's one. So please identify the fictional character who occasionally breaks the fourth wall with the audience, has a backstory involving help from selfless aesthetics is known for his humor, and had a TV show that aired between the 1960s and 1980s with fewer than 50 episodes, right?

So this is like a really, like, you would have to google the shit out of this to figure it out. And that's the point of it. They set it up explicitly so that current models are not able to solve these questions. That was one of the three core criteria that they used to determine what would be included in this benchmark. The other two were that trainers were supposed to try to perform simple Google searches to find the answer in just like five times basically.

And if the answer was not on any of the first pages of search results, they're like, great, let's include that. It's gotta be hard enough that it's not, you know, trivially solvable. they also wanted to make sure that it's like harder than a, a 10 minute task for a human, basically. So the trainers who built this dataset made sure that it took them at least 10 minutes or more to, to solve the problem. So yeah, pretty interesting benchmark.

Again, very narrowly scoped, but in a way that I think is pretty conducive to Pinning down one important dimension of AI capabilities and they do show scaling curves for inference. Time compute. No surprise there. More inference. Time compute leads to better performance. Who knew? Right. And as you said, narrowly scoped and meant to be very challenging. They also have some data on the trainers of the system who presumably rated the answers of the AI models.

Were also kind of tasked with doing the benchmark themselves. And on 70% of our problems, humans gave up after two hours where you, like, you just couldn't finish a task. And then they have a little distribution on the task that they could solve. The majority took about two hours. You got some like a, a couple dozen, maybe a hundred taking less than an hour. The majority takes over an hour. And on the high end there's just one like data point at four hours.

So yeah, you have to be pretty capable web browser. It seems to be able to answer these questions. Next story is related to bite dance. They're announcing their own reasoning model, seed thinking, V 1.5, and they are saying that this is competitive with all the other recent reasoning models competitive with deep CR one. They released a bit of technical information about it. They say that this is optimized via RL sema similar to deep CR one. And it is fairly I guess fairly sizable.

It has 200 billion parameters total, but it is also an mixture of experts model. So it's only using 20 billion parameters at a time. And they haven't said whether this will be released openly or not really, just kind of announced the existence of a model. Yeah, the, the stats look pretty good. it seems like another legit entry in the canon.

I think we're, right now, we're waiting for labs to come out with ways to, to scale their inference, time compute strategies such that we see them use their full fleet fully efficiently. Once we do that, we're gonna get a good sense of like where the US and China stack rank relative to each other. But I think we're, we're just kind of along that scaling trajectory right now.

We haven't quite seen we haven't quite seen the, the full, the full scale brought to bear that either side can one little interesting note too is this is considerably, I mean, it's about twice as activated parameter dense as deep seek V three. or R one. So with, V three R one, you see 37 billion activated parameters per token at about 670 billion. So it's like, about one in 20 parameters are activated for each token here. It's about one in 10.

So you're seeing a, in a way, a more dense model, which is kind of interesting. All of this is sort of building on the results from V three and R one. So always, always interesting to see what the architecture choices are. I guess, we'll, we'll get more information on that later, but that's an initial picture. So they, they actually ended up coming up apparently with a new version of the Amy Benchmark as well as part of this. So Amy, is that kind of math Olympiad problem set?

That has been, I. Somewhat problematic for data leakage reasons, for other reasons as well. So they kind of came up with a curated version of that specifically for this. and they call that beyond a, anyway, so on that benchmark they show their model outperforms deep CR one, it outperforms deep CR one, basically everywhere except for SW bench. So that's definitely impressive.

I'm, I'm actually kind of surprised, like I would've thought SW Bench would've been one of those places where you could, especially with more compute, which I presume they have available now I would've imagined that that specifically would translate well into SW Bench because those are the kinds of problems that you can rl the crap out of you know, these like coding, coding problems. So anyway, yeah, kind of interesting.

The, the benchmarks clearly show like it's, it's not as good as Gemini 2.5 Pro or O three mini high, but it definitely is closing the gap. I mean, on RKGI, by the way, and I. I find this fascinating and I don't have an explanation for it, un until we have more technical data about the, the paper itself, like it out does, like, not just R one, but Gemini 2.5 Pro and O three Mini High, supposedly on RKGI. that's kind of interesting. That's a, a big deal.

But could always be an artifact of some weird, like over optimization. 'cause again, on all the other benchmarks that they share here, it's quite far behind. So or not quite far behind, but it's, it is somewhat far behind Gemini 2.5 Pro, for example. So, anyway, kind of an interesting note and we'll presumably learn more as time goes on, right? They also released a 10 page technical report.

pretty decent amount of information, which is refreshing compared to things like, you know, oh one or oh three. Something I was not aware of. Ance had the most popular chatbot app as of last year. It's called Doba. And recently Alibaba kind of overtook them with an app called Quark.

So yeah, wasn't aware that Ance was such a big player in the AI chat bot space over in China, but makes sense that they're able to compete pretty decently on the space of developing frontier next up moving to research investments. The first paper is titled Sample Don't Search, rethinking Test Time Alignment for Language Models.

This is introducing Q Align, which is a new test time alignment method for language models, models that makes it possible to align better without needing to do additional training and without needing to access the specific activations with the the logics.

You can just sample the outputs just with text that the model spits out and are able to get it more aligned, meaning more kind of reliably following what you want it to do by just scaling up, compute at test time without being able to access weights and, and do any sorts of training. found this a really fascinating paper. and it teaches you something quite interesting about what's wrong with current kind of fine tuning and sampling approaches.

So the, funnily enough, the, the optimal way to make predictions is known, right? We actually know what the answer is to like, build a gi, great, we can all go home, right? Then. No, this is Bay Theorem, right? The Bayesian like way of, making predictions, making inferences is mathematically optimal. At least if you. You know, if, if you believe all the, the great textbooks like the logic of science and, you know, like et Janes type stuff, right?

So the challenge is the actual like Bayesian update rule which takes prior information, like prior probabilities, and then adds essentially accounts for evidence that you collect to get your posterior probability is not being followed in the current hacky janky way that we inference on LLMs.

And so the, the true thing that you wanna do is you wanna take the probability of generating some output based on your language model, like just the, the probability of a given completion given your, your prompt and. You really want to, like, you kind of wanna multiply that with an exponential factor that that in the exponent scales with the reward function that you want to kind of update your outputs according to.

So if, if, for example, you you wanna assign really high rewards to a particular kind of output, then you, what you, what you should wanna do is take the sort of tendencies of your initial model and then multiply them by the, or the reward w waiting essentially the, the e to the power of the reward, something like that. And by combining those two together. You get the Bayesian, kind of the optimal Bayesian output, very roughly. There's a, anyway, a normalization coefficient doesn't matter.

But you have those two factors. You should be accounting for your base models', initial proclivities, because it's learned stuff that you, anyway, for Bayesian reasons ought to be accounting for. But what they say is actually like. Typical search-based methods like Best Event, they fundamentally ignore the probability assignments of the base model. They focus exclusively on the reward function. You basically generate a whole bunch of things according to the base model.

You generate a whole bunch of different potential outputs and from, from that point on, all you do is you go, okay, which one of these gives me the best or highest reward, right? You do something like that and that causes you to basically from that point on, you throw away everything. Your base model actually knows about the problem set and what they're observing mathematically is that that is just a bad idea. And so they're gonna ask the question, can we sample from our.

Our base model in a way that yes, absolutely accounts for the reward function that we're after, but also that accounts for what our initial language model already knows and for mathematical reasons. The, the one approach that, that ticks this box that does converge on this kind of Bayesian optimal approach, it looks something like this. So you start with a, a complete response. Get your initial LLM to generate your output, right?

So maybe something like the answer is 42 because of calculation X, right? You give it a math problem and it says the answer is 42 because of calculation X, then you're gonna randomly select a position in that response. so for example, like the third token, right? You have like, the answer is, and you're gonna keep the response up to that point, but then you're gonna generate a new completion from that point on. And just using the base language model.

So here you're actually using your model again to get it, to generate Something else out, usually with high temperature sampling, so that the answer is fairly variable and that gives you a full candidate response. An alternative, right? So maybe now you get the answer is 15 based on some different calculation, and they have a selection rule for like calculating the probability with which you accept either answer. And it accounts for the reward function piece.

So which, of those alternate answers is scored higher or lower by the reward? This is a way of basically injecting your LLM into that decision loop. And accounting for what it already knows. It's pretty detailed or not pretty detailed, pretty nuanced. You almost need to see it written out. But the core concept is simple.

During sampling, you wanna use your LLM, you don't wanna just set it aside and focus exclusively on what the reward function says, because that can lead to some pretty pathological things like, you know, just over optimizing for the reward metric. And that ends. Leading to reward hacking and, and other things.

So from a Bayesian standpoint, just like a much, much more robust way of doing this, and they demonstrate that indeed this leads to better infant scaling on sort of math benchmarks like G-G-S-M-A-K. So I, I thought a pretty interesting paper from a, a very fundamental standpoint, giving us some, some insights into what's wrong as well with current sampling techniques. Right? Yeah. And they base this method or build on top of a pretty recent work from last year called Quest.

The title is Quality Aware Metropolis Hastings Sampling from Machine Translation, which is just to say that, you know, it's a slightly more theoretical or, or mathy kind of algorithmic type of contribution building on, let's say lots of equations. If you look at the paper, it's gonna take you a while to get through it. If you're not sort of deep in the space. But it does goes to show that, you know. There's still room for algorithmic stuff, for kind of research beyond just big model.

Good, you know, lots of weights make for smart model. Next paper is called Concise Reasoning via Reinforcement Learning. So one sort of phenomena we've discussed since the rise of reasoning models. First with O one then with deep CR one is that it seems like the models tend to do better when you do additional computations at test time. When you do test time Scaling also seems that we are kind of not at the point where it's at all optimized.

Often it seems the models do too much output more than is necessary. And so this paper is looking into how to optimize the amount of. Output from a model while still getting the correct answer. And the basic idea is to add a second stage in the training of a model. So after you train it on being able to solve the problems or reasoning same as you did with R one, they suggest having a second phase of training where you enforce conciseness while maintaining or enhancing the accuracy.

And they show that you're able to actually, do that more or less. Yeah. This is another, I think, really interesting conceptual paper. So the, the, the motivation for it comes from this observation of a couple of contradictory things, right? So first off, test time inference, time scaling is a thing. So it seems like the more inference time compute we pour into a model the better it performs.

So that seems to suggest, okay, well, like, you know, more tokens generated seems to mean higher accuracy. But if you actually look at a specific model quite often the times when it uses the most tokens are when it gets stuck in a rut, it'll, it'll get locked into these I'm trying to remember the term that they use here, but like, these, like dead ends, right? Where it just, it's a state from which reaching a correct solution is improbable, right?

So like you talk yourself, you paint yourself into a corner type thing. so they construct this really interesting theoretical argument that seems pretty robust. They, they demonstrate that like getting the right answer is gonna be really, really hard for your model. And you set your, reward time horizon for your model to be fairly short. So essentially the model the model does not look ahead very far. It's, it's focused kind of in on the near term in RL terms.

So in RL terms have, has a limited parameter. Parameter, less than one. In this case, then what you find is that the model almost wants to like. Put off or delay getting that negative reward. If it's a really hard problem, it will tend to like, try to just write more text and write more text and kind of procrastinate really. Before, yeah, this is one of the fun details is the, algorithm itself.

We, reinforcement learning loss favors, longer outputs if a model is encouraged to keep talking and talking, especially when it is unable to solve a task. So if it's able to solve a task quickly, it gets more positive reward that it's happy. If it isn't able to solve a task, it'll just, you know, keep. Keep going and going, right? Yeah, exactly. And, and that, that's it.

So the sign kind of flips, if you will, the moment that the reward is anticipated to be positive or let's say the, the model actually has tractable problem before it. And so you have this funny situation where. Solvable problems create an incentive for more concise responses. Because in a way the model is going like, oh yeah, yeah. Like, I can taste that reward. Like, like, I wanna, I wanna get it. You know?

Whereas if, if it knows, like, it's like if you know you're gonna get slapped once you finish your marathon, well you're gonna move pretty slowly. But if you know you're gonna get a nice slice of cake, maybe you, you run the marathon faster. That's kind of what's going on here. Not to like overdo this too much, but that is something that is almost embarrassing, right? 'cause it drops out of the math. It's not even an empirical finding.

It's just like, Hey guys, did you realize that you were not deliberately incentivizing your models explicitly through the, through the math here to do this thing that is deeply counterproductive and so. When they fix that all of a sudden they're able to so dramatically decrease the, the response length relative to the performance that they see.

and they show some really interesting scaling curves, including one that shows an inverse correlation between the length of a response and the improvement of the quality of the response which is, which is sort of interesting. So, yeah I thought this was a really, really interesting, I mean, it makes you think of. Like the conciseness of a model as really a property of a given model that can vary from one model to another. And a property that that's Yeah. Determined in part by the training data.

This is where this idea of that secondary stage of training becomes really important. They have an initial step of RL training that's just like, you know, the, the general. I guess, you know, whatever deep seek R 1 0 1 0 3 type reasoning stuff. But then you include a, a training step after that that explicitly contains solvable problems to kind of polish off your model and make sure that the last thing it's trained on is problems that it wants to solve concisely.

And, and so that's by the math gonna, you know, be problems that are actually tractable. And there you go. So, I thought just really fascinating and sort of embarrassingly simple observation about about the incentives that we're putting in front of these RL systems. Yeah. And, and the technique also is, you know, very successful vho it for the bigger variant of our one to 7 billion parameter model, you can get 40% reduction in response length and maintain or improve on accuracy.

And that's, you know, they don't have a computational budget presumably to do this optimally. You can presumably do even better, like optimize fervor to. Yeah, spit out less tokens while still getting the right answer. So very practical, useful set of results here. A few more stories. First, we have going beyond open data, increasing transparency and trust in language models with Almo Trace. So the idea is pretty interesting.

You're able to look at what in the training data of a model influenced it to produce a certain output. In particular, it allows you to identify spans of a model's output that appear verbatim in the training data. This is supporting the almo models, which we talked about, I dunno. A little while ago. Yeah, these are completely like the most open models you can get out of the market.

And so this is you can use it against those models and they're pretty large training data set of billions of documents, trillions of tokens. it seems like a software advance, but a systems advance, really. the core of it is you can imagine, like if you wanted to figure out, okay, my, my LLM just generated some output. What is the text in my training corpus that was. As like the most similar to this output or the contained long sequences of words that most clo closely match this output.

that's a really computationally daunting task, right? Because now you're having to go for every language model output that you've produced, you gotta go to your entire fucking training set and be like, okay, like, are these, are these tokens? There are these tokens there. You know, like how much overlap can I find on a kind of perfect matching basis? and what they're doing is actually trying to solve that problem and they, they do it pretty well and efficiently.

So you can see why this is a really, an engineering challenge as much as anything. So at the core of this idea is this notion of a suffix array. It's a, a data structure that stores all the suffixes of a text corpus in alphabetically sorted order, right? So if you have the, you know the word banana the suffixes are banana. Anana Nana Anna, nah, an A or whatever, you know, like it, it's kinda like you're breaking the word down that way. And then you sort that, sort those in alphabetical order.

So you have a, a principled way of sorting your the, of segmenting the different chunks that you could, you could look for in your output, right? So your output, you're like, oh man, like, which, which chunks of this text do I see perfect overlap with in, in the training set? And so if you have a, you know, small training corpus, like the cat sat on the mat and a like A-A-L-L-M output that set, like the cat sat on a bench.

What you're trying to do is set up suffix arrays that have like, you know. Whatever, like all the different chunking of, of that text. And then you wanna cross reference those together. And by setting them up in a principled way with sort of alphabetical ordering and the suffix vectors, you're able to use binary search to. So anyway, if you know what binary search is, then, then, you know, you know why this is exciting.

It's a very, very efficient way of searching through a, an ordered list, right? And you, and you can, you can only do it if your data's in the right, right format, which is what they're doing here. But once you do that, now you have a really efficient way of, of conducting your search. and so they're able to do that across the training Corpus like to do a binary search across the training corpus.

Then on the other side, in terms of the language model output they are able to like massively paralyze the search process to process many, many outputs all at the same time. Which again, amortizes the cost significantly. And so overall just a much better scaling properties for the search function. And it leads to some pretty interesting and, and impressive outputs again, like imagine.

You see the output that your language model provides, and you're just like, all right, well what's the, what's the piece of text in the training corpus that overlaps word for word most closely with different sections of this output? This is especially exciting if you're concerned about data leakage, for example, right?

You want to know well, did my language model answer this question correctly because it's basically just parroting something that was in the training set, or does it actually understand in some deeper way the content? So it's not a full solution to that because it could just be paraphrasing in which case this technique wouldn't pick it up. But it, it's a really interesting start. And, and it's part of the answer to our language model is just sarcastic parrots, right?

If you're able to rule out that there is any text in the training data that exactly matches what you've put out. Right. And I, I guess I should correct myself a little bit, they aren't claiming that the matches are necessarily kind of the cause of the output. They're not sort of, yeah. Competing in sort of influence function. They really are providing a way to efficiently search over the massive corpus to be able to do fact checking.

And they have a fun example in a blog where, for some question, the mo the Model Omo claimed that its knowledge cutoff was August of 2023 on true. The actual cutoff was in 2022. So then they looked at the output and then they found that in the, some, some document from somewhere. An open source variant of olmo, I guess a blog post or something like that. And that got swept up in the training dataset and made the model do this kind of silly thing.

So presumably also quite useful if you are a model developer or a model user to be able to fact check and, and see noise in your training data set that causes potential explanations or, or false outputs. Next, we've got a story from Epic ai, one of our favorite sources of stats and just interesting metrics on ai. This one is independent evaluations of GR free and gr free mini on Apex. Benchmarks and research version is gr free and gr free mini are really good. They are out there with cloud free.

Seven sonnet, all three mini comparable, even with low amount reasoning, graph free mini to the higher reasoning levels on some of these benchmarks. So just reinforcing, I guess, with general impression that we got with Rock, that it's quite good. Yes. Very, very well said. It is quite good. Yeah, it, it is actually pretty. Pretty shocking, at least on Amy. I mean, it's so gr like rock three mini on high reasoning mode beats O three mini on high reasoning mode.

It is literally number one in that category. That's pretty remarkable. Again, hasten to remind people like GR and X AI came out of nowhere. They are not, what are they, like two years old now? This is crazy. It's supposed to take you longer than that. But yeah, they're, so, they're also more middle of the pack on the on other, like, for example, frontier Math is, it's not. It's just out of the top three. So it's number four. This is a really, really solid model across the board.

There's just no two ways about it. There was some debate about how open AI and grok were characterizing scores on various age agentic benchmarks just in terms of like how they were sampling and whether apples to apples is actually happening there. This, by the way, is, I suspect a big part of the reason why Epic decided to step in and do, and frame this as, as they put it, independent evaluations of ROC three and ROC three many just because of all the controversy there.

So they're basically coming in and saying, Nope, it is in fact a really impressive model. It isn't, I mean, everybody's claiming to have the best reasoning model. I, I I I give up on, on like assigning one clear best. I mean, depends what you care about. And honestly, the, the variation. In prompting is probably gonna be just as big as the variation from model to model at the true frontier for reasoning capabilities.

Just try them and see what works best for use case, I think is the clear winner in this instance, I. And moving on to policy and safety, starting once again with the OpenAI Law Fair Drama. OpenAI is counter suing Elon Musk. So they have filed with Countersuits in response to the ongoing legal challenges from XAI that are trying to constrain open AI from going for profit. And they are saying basically want, to stop Elon Musk from fervor, unlawful and unfair action.

They claim that Musk, actions including a takeover bid that we covered where he offered what, 97 billion to buy OpenAI of a nonprofit. And they, yeah, basically OpenAI is saying here, there's kind of a, a bunch of stuff that Elon Musk is doing. Please stop him from doing this sort of stuff. It, it's, it's sort of funny on, their characterization of the fake bid. Now we can't know what happened behind closed doors, if there are, were comms, if there weren't comms of, of whatever nature.

But certainly from the outside, I. I'm confused about what would make it fake. Like it was the money he was offering not real. Was it monopoly money? Like he came in and offered ostensibly more money than what OpenAI was willing to pay for its own nonprofit subsidiary or for-profit subsidiary or whatever. Like, it, it seemed pretty genuine. and so it, it just, it it's odd that, and I, they would nominally have a fiduciary obli obligation to actually consider that deal. Seriously.

So it's unclear to me what, you know, what the, the, the claim is with legal grounding. The suit is fascinating, or the original Elon suit is fascinating, by the way. We covered this back in the day, but just like to remind people.

So Elon sued OpenAI of course, for trying to essentially so the nonprofit that currently has control over the for-profits activities, they essentially wanted to buy out the nonprofit and say, Hey, we'll give you a whole bunch of money in exchange for you giving away all your control effectively. And you'll be able to go off and do cute charitable donation stuff. And there are people arguing, well, wait a minute.

Like that is like the nonprofit was set up explicitly to keep the for-profit in check because they. Correctly reason that for-profit incentives would cause racing behavior would cause potentially irresponsible development practices on the security and the control side. So you can't just replace that function with money, like money opening Eye itself does not institutionally believe that money would compensate for that.

They believe they're building super intelligence, control of super intelligence is worth way more than like, you know, $40 billion, whatever they'd be paying for it. And so this is the claim anyway, the judge on this case seems to view that argument quite favorably by the way that you can't just swap out the role of the nonprofit for a bunch of money. That the kind of opening eyes, public commitments among other things.

Do commit it to kind of having some sort of function in there, at least that those claims are plausibly backed and, and would, would plausibly do well in court. The main question is whether Elon has standing to represent that argument. The question is, did OpenAI enter into a contractual relationship with Elon through email? 'cause that's really the, the closest thing they have to a contractual agreement. About the, the kind of nonprofit remaining in control and all that stuff.

And and, and that seems much more ambiguous. And so Elon right now is in this awkward position where he has a, a, seems like a pretty solid case. That's what the judge is telegraphing here. But he may not actually be the right per, he may not have the, the right to sort of represent that case. The Attorney general might.

So there's speculation about whether the judge in this case is flagging the strength of the case to get the attention of the Attorney General, so the attorney general can come in lead the charge here. But everything is so politicized too. Elon is associated with the Republican side. California's Attorney General is gonna be a Democrat. So it's, it's all a big mess. And now you have. Opening eye, kind of counter suing, potentially, partly for, for the marketing value at the very least.

But we're just gonna have to see it. It really doesn't, I mean, there, there seems to be a case here, there seems at the very least, to be an interesting case to be made. We saw the judge dismiss Elon's motion to kind of like quickly rule in his favor, let's say, and, and block the for-profit transition. I, I would be surprised if this initial move, like this Countersuit would go through us.

I mean the, I imagine there'd be a pretty high standard that that opening I would have to meet to show that these lawsuits are frivolous. And that'd be tough given that you now have a judge coming out and saying, well, you know, the case itself seems pretty strong. It's 50 50, whether Elon's the right guy to represent. So you know, anyway, it's a mess. Yeah, it's, it's a real mess. I dunno how technical the term, by the way, counter suing, I guess. It's in the document itself that they filed.

They have a bunch of counterclaims to the already ongoing case. And yeah, it's, it makes for pretty fun reading just to find this one quote here. Early in the document, this is like in a 60 page document. They say Musk could not tolerate seeing such success from an enterprise he had abandoned and declared doomed. He made it his project to take down OpenAI and to build a direct competitor that would seize the technological lead, not for humanity, but for Elon Musk.

And it says the ensuing that the campaign has been relentless through press attacks, blah, blah, blah, blah, blah. Musk has tried every tool available to harm open the eyes, so. Very much a continuation of what we've seen opening I doing via blog, calling Musk out about his emails. They also posted on X with the same kind of rhetoric saying Elon's never been about a mission. He's always had his own agenda. He tried to seize control of open AI and merge it with Tesla as a for-profit.

His own emails prove it. Yeah. OpenAI is definitely at least trying to go on the attack if nothing else. Yeah, it's funny. It's very kind of off brand or, or I guess it's now their new brand, but it used to be off brand for them Right. To do this sort of thing. They had a very kinda above the fray vibe to them. Sam a was sort of like this untouchable character and, and it does seem like they've kind of like started rolling in the mud and man, it's a, yeah. Interesting.

Yeah. It seems like tactically they really just want to embarrass Elon Musk as much as I can. Yeah. So this is part of that. And the next story also related to OpenAI, as you alluded to earlier, it is covering that it seems that OpenAI has reduced retirement resources allocated to safety testing of its frontier models. This is apparently related to their next gen model of free. And this is according to people familiar with the process.

So some insiders, presumably the safety evaluators who previously had months now often just have days to flag potential risks. And this kind of tracks with what you've seen come out regarding the split in 2023 with aboard from Sam Altman and, and generally the vibes we are getting from OpenAI over the past year, I. Yeah. Consistent with people that like, that we've spoken to as well. Unfortunately at, at OpenAI.

And, and you know, the, the reality is that they are, I mean this is, this is the exact argument by the way that was made for the existence of the nonprofit and, and it controlling explicitly the activities of the for-profit. Like this was all foretold and prophecy one day. There's gonna be a lot of competitive pressure. You're gonna wanna cut corners on control, you're gonna wanna cut corners on security, on all the things. And we wanna make sure that there is.

As disinterested a a, an empowered party as possible, overseeing this whole thing. And surprise, surprise, that is the one thing that Sam a is trying to rip out right now. Like, it's sort of interesting, right? Like, I mean, it's, it's almost as if, it's almost as if Sam is trying to trying to solidify his control over the, the entity and get rid of all guardrails that previously existed on his, on his control. But that I, I can't possibly yet. I mean, that it's, it's a ridiculous assertion.

Anyway, yeah. Like some of the quotes are pretty interesting. You know, we had more thorough safety testing when the technology was less important. This is one person who's right now testing the upcoming O three model. Anyway, all kinds of things like that. So, yep. No, no particular surprise. I wanna say. This is like, pretty sadly predictable. But another reason why. You gotta have some kind of coordination on this stuff, right?

Like, you can't, if AI systems genuinely are going to have WMD level capabilities, you need some level of coordination among the labs, there is no way that you can just allow industry incentives to run fully like rampant as they are right now. You're gonna end up with like some really bad outcome, like people are gonna get killed.

That's a, a pretty easy prediction to make under the, a nominal trajectory, if these things develop, you know, the bio weapon, the cyber offensive capabilities and so on, like that's just gonna happen. So the question is like, how do you prevent these dynamics, these racing dynamics from. Playing out in the way that they obviously are right now at OpenAI.

I will say, I mean, it's very clear from talking to people there, it's very clear from seeing just the objective reports of like how, how quickly these things are being pumped out, the amount of data we're being given on, on the the kind of testing side. It's unfortunate but it's where we are. And next yet another story about OpenAI and kind of a, a related notion or kind of related to that concern.

The story is that ex OpenAI staffers have filed an amicus brief in the lawsuit that is seeking to make it so open air cannot go for profit. So Amicus brief is basically like, Hey, we wanna add some info to this ongoing lawsuit and, and give our take. And so this is coming from a whole bunch of employees that have been at the company between 2018 and 2024 as Steven Adler, Rosemary Campbell, Neil Chow and like a dozen other people who were in various technical position were researchers.

Research leads, policy leads. The gist of the brief is, you know, ex OpenAI would go against its original chart charter where it'd go for profit and it should not be allowed to do that. And it, you know, mentions some things like, for instance OpenAI potentially being incentivized to cut corners on safety and develop powerful AI that is consecrated for the benefit of a shareholders as to opposed to the benefit of humanity.

So the, the basic assertion is open AI should not be allowed to undertake with transition. It would go against the kind of founding charter and, and I guess policies set out for OpenAI. Yeah. And one of the big things that they're flagging, right? So if, if OpenAI used its status as a nonprofit to reap benefits, let's say that it's now gonna cash out by converting to a for-profit. That itself is a problem. And one of the things that's being flagged here is like recruiting, right? Recruitment.

The fact that they were a nonprofit, the fact that they had this very distinct bespoke governance structure that was designed to handle a GI responsibly was used as a recruiting technique. I know a lot of people who went to work at OpenAI because of those commitments, many of them have since left. But the, there's a, a quote here that makes that point, right?

In recruiting conversations with candidates, it was common to cite open AI's unique governance structure as a critical differentiating factor between OpenAI and competitors, such as Google or Anthropic. And as an important reason they should consider joining the company. The same reason was also used to persuade employees who are considering leaving for competitors to stay at OpenAI, including some of us, right? So this is like, I. Not great.

Like if, if you have a company that that is actually like using the fact of being a nonprofit at one time and then kind of cashing that out and, and turning into a for-profit. So, you know, without making any comments about, about the competitors, you know, philanthropic has a different governance structure.

There're a public benefit corporation, but with, but with kind of oversight board, XEI is just a public benefit corporation, which really all that does is it gives you more latitude and not less. It sort of like, sounds like it's just a. Just a positive. But it's, it's complicated. It doesn't actually tie your hands. It gives you the latitude to consider things other than profit when you're, you know, as, as a director of the company. really it's, you're just giving yourself more latitude.

So when opening AI says, oh, don't worry, we're gonna go to a public benefit corporation model. It sounds like they're switching to something that is kind of more constrained, that is still constrained or, you know, motivated by some public interest.

But the legal reality of it, as I understand it at least, is that's just going to give them more latitude so they can say like, oh yeah, we're gonna do X, Y, or Z. if X, Y, or Z isn't profit motivated, it doesn't mean that you have to do specific things, I guess, unless they're, they're in, in the, the kind of additional legal context around that. Anyway, bottom line is I think it, it's actually a pretty dicey situation from everything I've seen.

It's not super clear to me that this conversion is. Gonna be able to go ahead at least as planned. And the implications for the SoftBank investment for like all the, the, like tens of billions of dollars that OpenAI has on the line are, are gonna get really interesting. Yeah, it's quite the story, certainly a very unique situation.

And as you said, I think I'm a little surprised that thought OpenAI might be able to just, you know, not really be challenged in this lawsuit, but it seems like it may actually be a real issue for them. And one more story about OpenAI. It just so happens that they are dominating this section, this episode. They are coming out with an ID system for organizations to have access to future AI models via its API. So it, there's this thing called verified organizations.

They require to have a government issued ID from supported countries to be able to apply. Looking at their support page, actually couldn't see what what else is required to be able to be verified in this page. If I say unfortunately, a small minority of developers intentionally use the open AI APIs in violation of our usage policies, and they're adding their verification process to mitigate unsafe use of ai. While continuing to make advanced models available to developers and so on.

So it seems like they wanna prevent misuse or like, presumably also competitive behavior by other model developers out there. I dunno, seems like an interesting development. Yeah, I, it looks like a, a great move from open AI actually. To, yeah. It's, it's on this continuum.

Like I remember a lot of debate in Silicon Valley around like let's say like 2019 especially in the, like the YC community people were trying to figure out like, how do you, I. How do you strike this balance between privacy and and verifiability and, you know, where are things going with bots and all that stuff? Thi this is like kind of shading into that discussion a little bit. And it's a, it's an interesting strategy 'cause you're going at the organizational level and not the individual level.

It does take a valid government issued ID from a supported country, so a couple of, you know, implied filters there and then each ID is limited to verifying one organization every 90 days. So it all kind of intuitively makes sense. Not all companies or entities are eligible for this right now. They say they can check back later.

But yeah, so interesting kind of another axis for opening eye to try their staged releases where they're like, you know, first, we'll, we'll release a model to this subpopulation, see how they use it, then roll it out. This seems like a really good good approach and, and actually a pretty cool way to balance some of the, the misuse stuff with the, the need to get this in the hands of people and and just build with it. And one last story. The title is Meta Whistleblower Claims Tech Giant.

Oh, this is a long title. Anyway, the, the gist of it is there's a claim that, oh, I've never heard you like, give up a title. Yeah. Some of them, fortune I find is just yeah, it can be annoyingly wordy. But anyway, the claim is that meta aided in development of AI for China in order to curry favor and be able to build business there. And apparently they make quite a lot of money. Thiss from former Facebook executive Sarah Wynn Williams.

She just released a book that has a bunch of alleged details from when she was working in a high profile role there from 2011 to 2017. And in this testimony to the Senate Judiciary Committee, she said that that's what Meta did.

Yeah. And Senator Josh Hawley sort of led, led the way on a lot of this investigation and had some really interesting clips on, on X that he was sharing around, but yeah, it does seem pretty, I'll say, consistent with some things that I had been hearing about yeah, like, let's say the, the use of meta's open source models and potentially potentially meta's attempts to. Hide the fact that these were being used for the applications, that they were being used for.

Things that, let's say, would not look great in exactly this context. They were different from this particular story, but very consistent with it. The, the, one of the key quotes here is, during my time at Meta, she says, company executives lied about what they were doing with the Chinese Communist Party to employees, shareholders, Congress, and the American public. So remains to be seen. Are we gonna see Zuck dragged out to testify again and get grilled?

I mean, there's hopefully gonna be some follow on if, if this is true. I mean, this is pretty, pretty wild stuff. And meta used quotes, a campaign of threats and intimidation to silence. To silence. Sarah Wynn Williams, the the one who's testifying here, that's what Senator Blumenthal says. And anyway, so she, she was very senior director of Global Public Policy. This was all the way from apparently 2011 to 2017. So long tenure, very senior role. And this predates right, the whole llama period.

This is, this is way before that. And and certainly like, I mean, like anecdotally, I've heard things from behind the scenes that suggests that that practice a ongoing, if the people I've spoken to are to be believed. So anyway this is pretty, pretty remarkable if true.

Apparently. So Meta's coming back and saying that Wynn Williams' testimony is quotes divorced from reality and riddled with false services, sorry, with false claims while Mark, mark Zuckerberg himself was public about our interest in offering our services in China, and details were widely reported beginning over a decade ago. The fact is this. Excuse me. We do not operate our services in China today. And I will say, I mean, that's only barely true, isn't it?

Because you, you do build open source models that are used in China. And that for a good chunk of time did represent, again, at least according to people I've spoken to, basically the frontier of of model capabilities that Chinese companies were building on. No longer the case now. But certainly you could argue that Meta did quite a bit to accelerate Chinese kind of domestic AI development.

I think that you could have nuanced arguments that go every which way there, but it, it's, it's sort of an interesting, very complex space. So this is all in the context too, where we're talking about, you know, meta being potentially broken up. There's an antitrust trial going on the FTCs saying basically we, we want to potentially rip Instagram and WhatsApp away from meta. That would be a really big deal. So anyway, it, it's. It's hard to know you know who, who's saying what.

There is a book in the mix, so money is being made on this. But definitely would be a, a pretty big bombshell if this turns out to be true. Mm-hmm. Yeah, not too many details as to specifically ai.

From what I've read of the quotes, it seems that there was a mention of a high stake stakes AI race, but beyond that it's just sort of more generally about the communications with the Communist Party that the executives had and, you know, wouldn't be surprising if they were trying to be friendly and do what they could to get support in China. For sure.

And I just want to add like for, for context, what I've mentioned about like sort of like other sources of information along these lines, I haven't seen anything firsthand. So I just want to like, call that out. But it would be, yeah, it just would be consistent with this generally if it's to be believed. So just to sort of like throw that caveat in there.

A lot of, yeah, a lot of questions about a lot of different companies, obviously in the space, but meta has been one I think justifiably if this is true, to receive a lot of scrutiny. And that is our last story. Thank you for listening to this episode of last Week in ai. As always, we appreciate it. If you leave a comment somewhere, you can go to Substack YouTube, leave a review on Apple Podcasts.

Always nice to hear your feedback or just share it with your friends, I guess, without letting us know. But either way, we appreciate you listening and please do keep tuning in.

Transcript source: Provided by creator in RSS feed: download file