#190 - AI scaling struggles, OpenAI Agents, Super Weights - podcast episode cover

#190 - AI scaling struggles, OpenAI Agents, Super Weights

Nov 28, 20242 hr 37 minEp. 229
--:--
--:--
Listen in podcast apps:

Episode description

Our 190th episode with a summary and discussion of last week's* big AI news! *and sometimes last last week's

Hosted by Andrey Kurenkov and Jeremie Harris. Note from Andrey: this one is coming out a bit later than planned, apologies! Next one will be coming out sooner. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Sponsors:

  • The Generator - An interdisciplinary AI lab empowering innovators from all fields to bring visionary ideas to life by harnessing the capabilities of artificial intelligence

In this episode: * OpenAI's pitch for a $100 billion data center and AI strategy plan outlines infrastructure and regulatory needs, emphasizing AI's foundational role akin to electricity.  * Google's Gemini model challenges OpenAI's dominance, showing strong performance in chatbot arenas alongside generative AI advancements.  * DeepMind's AlphaFold3 gets open-sourced for academic use, while new chips from NVIDIA and Google show significant performance boosts.  * Anthropic and TSMC updates highlight strategic funding, regulation influences, and the complex dynamics of AI hardware and international policy.

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Timestamps + Links:

Transcript

AI Singer

The

Andrey

hello and welcome to the last week AI podcast. We can in our chat about what's going on with ai. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news, and as always, you can also go to lastweekin. ai for the text newsletter with even more articles that we will not be covering. I am one of your hosts, Andrey Korenkov. My background is that I studied AI at Stanford, and I now work at a generative AI startup.

Jeremie

And I'm your other host, Jeremy Harris, um, co founder, CEO of Gladstone AI, national security AI company, and, uh, Yeah. If I look more disheveled than usual, it's cause I'm not particularly shoveled. Uh, also recent dad is that's right. I'm going to start to like, just drop that every time I go. Sorry. Sorry. New dad. I can't, I wonder how long it

Andrey

isn't working at a hundred percent. It's

Jeremie

just how it goes. It's nice. Cause that makes it seem like it ever was. But, uh, yeah, I mean, I think the, um, so, so the, the thing that's on today is I just, uh, Andre, let me go and do thing, I think at a bank. Um, which, which has a starting about 20 minutes late, which means we're going to try to end this on time and actually do a one and a half hour ish episode. I wonder how that's going to go. We tell ourselves this every time, every time, every time

Andrey

we'll give it a try. Uh, so. So actually moving on to a couple things. As usual, we will quickly acknowledge some listener comments. That was a fun one on YouTube. I really liked, uh, great podcast. Love the detail. Thank you for that. And then it makes me sound smart in exec meetings, which is a great outcome. Honestly, that is the

Jeremie

goal. , Andrey: yeah, that's one of the goals for sure is to make people sound smart, uh, in a daily conversation or in executive meetings. And then aside from that, I do want to give a shout out to people who leave us ratings without leaving a review. We are now at 239 on Apple Podcasts. So that number is creeping up, always nice. And we are still at 4. 7 out of five. So hopefully that reflects us staying consistent in quality.

And now a quick preview of what we're talking about in this episode in tools and apps, nothing huge, uh, some sort of, uh, Previews of what's coming more or less and applications in business. A lot of stuff on hardware and data centers is the focus. Got a few exciting open source things, including AlphaFold 3. A few pretty nerdy research stories, I'm going to say. As opposed to our normal jockey fair.

Andrey

Well, I think these are a little more conceptual, let's say. More about. How the internals of these things work and in policy and safety, going to be talking a little bit about the EU and some conversations about nuclear AI strategy, the usual kinds of things. No major news there. And one last thing before we start on the news, once again, we want to shout out to our sponsor, which is the Generator, Bobstone College's interdisciplinary AI lab. focused on entrepreneurial AI.

Bobsend is a number one school 30 years now. And last fall, professors from all across Bobsend partnered with students to launch this thing, the generator. Which is a lab organized into eight different groups, including things like AI entrepreneurship and business innovation, AI ethics and society, a future of work and talent, and so on. So this group has already peer trained a lot of the faculty of Babson on AI concepts and AI tools.

On the website, they say that the generator accelerates entrepreneurship, innovation, and creativity with AI. And they are fans of a podcast and we thank them for the sponsorship. You can go to their website if you want to hear more about it, uh, or just, you know, keep an eye out for any news coming out of them. And moving right along, getting started with news and in tools and apps, we are actually going to start with a sort of follow up to something we discussed last week.

Another article has come out in this kind of ongoing conversation around seeming slowdown potentially in AI development. So this one is OpenAI, Google, and Anthropic are struggling to build more advanced AI from Bloomberg. Nice overview article of the topic. And last week we discussed a little bit.

Last episode, I should say, this might come out a little bit closer, but we discussed how the new model from OpenAI Orion is seemingly not hitting desired performance goals, according to, in this case, two people familiar with OpenAPI. So, uh, We have a matter who spoke on a condition on anonymity. And this article also mentions that people from inside Google have said sort of similar things that the next iteration of Gemini is not living up to internal expectations.

And also that, uh, inside on Froppik. There's been a challenge with 3. 5 Opus, and in all of these cases, they just refer to people familiar with the matter. So in all of these cases, what seems to be happening is that you train these larger models, they perform better on variations. By not as much as would be expected. And this has led to a lot of conversation in the AI community. You've seen a couple of people being like, I told you, so everything is slowing down.

Gary Marcus has always loves to save us. And Yann LeCun also kind of. Posted something to that effect. We discussed it a little bit last week, and I think we can talk about it a bit more. My take is as before that perhaps it's not entirely surprising that to pure scaling approach of a bigger model, more data, more compute is challenging to keep going. We know that.

Seemingly the scaling laws we have haven't broken, so you're still getting the expected improvement in perplexity and being able to predict probabilities of words or letters. But in terms of how it translates to actual kind of intelligence performance on benchmarks or just being smarter, that's harder to say. And in some sense, you know, even according to the scaling laws, we know that as you get better and better, as you scale up more and more, you get something akin to diminishing returns.

It's harder and harder to get the same amount of improvement. Uh, you need to, you know, keep, uh, expanding to bigger and bigger sizes by orders of magnitude, roughly speaking. So we shouldn't be red. We, we take away, I think it shouldn't be read as we're hitting a wall as like AI improvement is going to slow down necessarily.

What this is implying is that It might be at a point where it's challenging to keep improving at the pace we've been by just doing scaling, which I think it's not entirely surprising. I mean, at some point, this seems like it would have happened sooner or later.

Jeremie

Yeah, I think this is really interesting. It is a conversation that I've been having with friends at the labs for the last year or so, as they've started to see some of these things come out. You know, the debate over what exactly this means for the future scaling. Um, you know, 11 observation is, um, you pointed out this idea of diminishing returns, right? And that diminishing returns does play out at the level of these essentially these log plots.

So you have, for example, um, the, uh, essentially the, the, the performance, um, you have to, yeah, like you said, keep exponentially increasing the amount of compute exponentially increasing the amount of, um, of data that you, you feed into these models to continue the same, um, Uh, I mean, the same linear trend performance, but, but this is the, this is the problem, right? What does it actually mean for a model to get, you know, X percent better at autocomplete?

That's what we're talking about, right? That's the thing that scales is like roughly how good this model is at predicting the next token if you're talking about an LLM. How that translates into actual concrete performance at tasks that we care about is the big question. In some sense, it's always been the big question. It's just that that mapping has historically been very tight.

So as you see the model get better, you know, GPT 1 to 2 to 3 to 4, you've consistently seen better next word prediction accuracy translate into greater general capabilities, including even agentic capabilities with GPT 4, right? And that was kind of insane, right?

Like, this idea that You get really good at text autocomplete and at a certain point that gives you agentic reasoning capabilities would have been insane Just like a year before it seems to be true So one of the big questions is as we do this, are we really building a world model that is more usefully robust? That allows for the building of better agents keeping in mind that We've now moved beyond that, that just pure scaling LLM paradigm.

We're now wrapping things up in, obviously there's post training, which includes now reinforcement learning specifically for, um, agentic behavior. And we have now all these inference time scaling laws that give us a whole bunch more and they compound multiplicatively with training time learning. Uh, scaling laws. I think that's kind of the right way to think about it. So you can tap one out, but not the other. And you can still continue this trend.

So there's a lot of uncertainty, I think in, in general, but when you talk to people at the frontier labs, um, I don't think anyone is expecting a slowdown. In fact, I've heard quite the opposite, right? One of the big themes is we're seeing AI be used to automate AI research itself more and more kind of closing that final feedback loop that gets you to like fully automated research. Um, so, so yeah, I mean, I think there's so much nuance here, uh, difficult to unpack in one episode.

In fact, I think we could almost do, we have a hardware episode we have to do. We'd almost do a scaling law episode and the path to kind of, uh, ASI. But I think, um, yeah, I think fundamentally the big blockers now are starting to look industrial, right? The industrial base is struggling to keep up with the energy demands of this kind of scale. It's hard to find that, you know, 500 megawatt. spare grid capacity that you need for that next cluster, the one gigawatt cluster.

Um, you know, five gigawatts seems out of reach. So, so pretty soon the scaling runs of 2027, 2028 are looking challenging to pull off. And we'll be talking about that as well today, but I think that's, uh, you know, all these things exist together and it's not enough. For just scaling to get you there because it's not economically viable, uh, from an energy standpoint.

Andrey

Exactly. So a lot of nuance here, right? And I think it is worth highlighting that it's, I've had the feeling over the last couple of months that we're seeing with. agentic AI with agents, something akin to what you've seen with other paradigms of AI, like video generation, image generation, where we're seeing the takeoff at this time. You're seeing the early days of agentic AI, like we've seen text to image, like we've seen text to video, right? We've seen a couple of demonstrations.

And that's what has happened over the last couple of years is we've seen a few demonstrations of it. And then in a year, in a few months, in half a year, in the case of Uh, text to video, you know, we, we see more and more like there's kind of an avalanche of the trend realizing itself, and that is definitely what's going to happen next year with agentic AI. We're going to see more and more of it coming out, uh, as we'll actually talk about in a bit.

So in that sense, there's not going to be any slow down. We're going to see new AI tools that can do even more, even regardless of whether we get these larger models that are even smarter. It is a very interesting question whether we can also scale up the models and get a major improvement in performance. Uh, I think There's a real question there on, for instance, not just about the scale of a model, but maybe the data, right?

We haven't used up all the data per se, but we know that quality of data matters in addition to quantity of data. And it may be that you've sucked up all the news articles, you've sucked up all of Wikipedia, you've sucked up all of Stack Overflow, you know, how much is there left? of the good data, so to speak, right? Uh, so I'm sure we'll be seeing some research hopefully coming out and exploring these topics of, is there potential for a plateau in the scaling laws?

Because again, also worth mentioning as the CEO of Entropic recently mentioned on a podcast that These are empirical laws, right? So these are not sort of physical laws. There's no theory behind them to my knowledge. So we may in practice see some sort of plateau and that the laws don't go infinitely. Uh, and that'll be interesting as well.

Jeremie

Yeah. I mean, my, my intuition is that the laws persist because the sort of intuition behind them, it seems pretty robust. I think this is all stuff that we need to talk about, that we have to do an episode on scaling laws. And so this is, but the one thing I'll say on the agentic picture is the last mile problem is right now.

The thing that the people are working on that is the toughest nut to crack, um, for an awful lot of like long term reasoning tasks that you're trying to solve with these systems, the ones often that deliver the most value. It's not good enough to have an agent that gets the steps right 99 percent of the time, because you're going to have to string together, you know, dozens and dozens of steps. So on average, you'll expect the thing to go off the rails.

Uh, so that last mile problem, multitask coherence is so, so important and, and that's the, the thing that's challenging to train for right now because there's so few examples of long term reasoning traces that you can actually train on to, to speak to your Andre, your example of, or your, uh, um, mentioning of the data wall issue. Um, there, there are potential ways of around this, uh, synthetic data actually does look quite promising for this.

You can have AI systems, audit, reasoning, traces, generate and audit, generate and audit in a sort of, um, AlphaGo style approach that's been, been tried and being tried. Um, and I think there's some promise there, but anyway, yeah, I think we need to do a scaling law episode.

Andrey

Okay. You heard it first. We are promising it and we will deliver at some point. of the next year. Yeah, that's right. Next up, a related story coming up is that OpenAI is apparently nearing the launch of AI agent tools to animate tasks for users. So, there's not much to the story, not much that we know yet. Apparently, this new AI agent is codenamed Operator.

And we'll be able to use a computer to take actions on a person's behalf, very much similar to what Anthropic launched recently on the API. This is according to people familiar with the matter. Uh, that was all over this. I know the people familiar matter of a real sources of news, apparently. Uh, so there was a staff meeting, uh, just last week and apparently opening as leadership announced plans to release a tool in January as a research. preview and through the API.

So, seems like it is coming relatively soon, at least if these plans come through and wouldn't be surprising given that Entropic already launched this in their API.

Jeremie

When involuntarily pinged by an agentic form of the operator model, OpenAI employees are said to have responded, Operator, Operator, don't call me. I'll call you later. Sorry, that was exactly as unfunny as it sounds in my head. That's a real niche joke. Isn't it? Isn't it? There's like three people that laughed. Uh, yeah, no, look, I mean, there's a bunch of speculation about what this model, sorry about that everyone, what this model might be.

Um, you know, like web browser tool could be, um, you know, task automation, all the usual stuff. I think right now, Um, the best way to think of these is as just experiments in, uh, in long term task coherence. The problem that we just talked about, Ani, with that last mile thing, I think is going to be the biggest challenge going forward for this.

Andrey

And moving right along, a couple more stories. First up, Google has dropped a new Gemini model and it is performing pretty well. So there's this Gemini exp 1114, which has stopped the LM arena, uh, The LM arena chatbot arena. That's a new name, I think, for a, uh, LMSIS. That's weird. I liked LMSIS better. But anyway, there's a stop to chatbot arena where AI models compete in a head to head format and users vote on which ones they liked the best.

And so this new experiment, Gemini, whatever it is, has much related to GPTO 4. 0 and outperformed OpenAI's O1 preview. Although, again, this is on which model people like better, which is sometimes kind of hard to say how that really, uh, translates to intelligence. And, uh, you know, so it's now alongside, uh, the OpenAI models, also Grok2 is up there, uh, as well.

Jeremie

Yeah, I mean, it is at least a more direct measure of what human preferences than, you know, your standard scaling curve that shows, you know, cross entropy or whatever. Um, but, but yeah, I mean, it, it, it is difficult to assess how much this means. There's speculation about whether this model is, you know, is this a version of Gemini 1. 5 or whether it's an early glimpse of Gemini 2, uh, to some degree, um, So often, obviously, version numbers are kind of meaningless.

Um, this is something that, uh, actually, yeah, you know, Dario mentioned the difficulty on his podcast with Lex Friedman of, of kind of naming these models. And I actually remember talking to a friend who told me like months and months and months ago, he's like, Oh yeah, so OpenAI is training GPT 5 right now. And like, here's, you know, here's some of the stuff. Some of the stuff that we know, and then it turned out to be a different model.

Like when it was released, the name just, just changed. And so these sorts of things just happen all the time. I think, uh, one, the thing this is getting at, and I think the thing to anchor on is like, what is the compute cluster that was actually used to train these models, right? When we're talking about Gemini 1. 5, Gemini 1, Gemini 2, Uh, the thing that really matters is, is like, is scaling working or is the next training paradigm working?

Um, it's possible that Gemini two is a fundamentally different training paradigm. So that would be a very interesting distinction. So these increments can matter, but the fundamental thing that makes the matter is the size of the training cluster, the amount of flops that go into it, um, or the, the bells and whistles that are added during training. And here, you know, It's very unclear what the answer to that is.

Um, so we'll, we'll just have to sit back and wait really for more information, but I do like the industry's trend towards dropping these mystery models in the middle of like some, some, some leaderboard seeing people freak out. And then later you find out, like, sometimes you don't even know which company, right. That has happened But,

Andrey

uh, yeah, here, here you go. Next, we are going to talk about image to video, a new trend as well. So this is ShangChu technology, and we've covered their AI tool Vido, I believe, that already could generate eight second clips from text. And now they have an update of a tool. You can give it three distinct images. So for instance, a shirt, person and like a vehicle and it can then create a video for you combining those images.

So yeah, part of a trend that we've seen multiple tools that take an image and then create a video out of it. Sometimes they kind of continue and generate a video that looks like what would be the continuation of the image. Here's another example of a tool launching with something like that capability. And now to the last story, we have a Forge Reasoning API beta from Noose.

So we chatted about Noose Chat and Hermes Free in the last episode, and they are also announcing this Reasoning API, which allows you to query an API like you do with an LLM, but built into it is a bunch of the known techniques for reasoning like tree search, chain of code, mixture of agents, et cetera. It is now available in beta to just a few users and I've seen some examples of people trying it out.

It does seem to empower kind of weaker models like Lama free, for instance, with these, uh, improved techniques to do better reasoning, uh, in a way, again, showcasing the trend towards both reasoning and agentic AI, which kind of go hand in hand. Yeah, it's actually

Jeremie

like, it's pretty wild looking at these metrics. And I personally feel like I need to dive into this more. Um, these seem almost like, uh, yeah, just really. Absurdly effective, um, uh, an absurdly effective framework. So on, on Hermes three, the 70 billion, uh, parameter version of Hermes three. So the model that they themselves that news actually built themselves, um, they're claiming like, geez, um, like 81. 3 percent on the math benchmark.

So that is outperforming Gemini 1. 5 pro, um, and GPT four, obviously, but, uh, as well as on a 3. 5, like, like this is. It's pretty insane. And you see sort of competitive numbers, maybe a little bit less performant, but competitive on GBQAD. Um, MMLU Pro as well. The AMI benchmark is the, the like weirdest one from what I'm seeing here.

Like there, I mean, it's tough because it's a very small benchmark that they don't show it here, but like there, there aren't a ton of, um, uh, of, uh, kind of samples in the, in the benchmark. Um, but it's outperforming nominally, even O1 released. I want to look in, I'd love to have more information about specifically. Like the eval part of this story, but.

They are using Monte Carlo Tree Search, um, and uh, sort of chain of, chain of code, sort of like chain of thought, uh, specifically when you connect your reasoning trace to a code interpreter, so you can get actually grounded feedback, uh, that doesn't, that doesn't hallucinate. So every once in a while you kind of get that, that clarity in your reasoning trace. Um, and then they also have a way of, of setting up kind of, uh, querying between multiple agents.

They call it MOA or mixture of agents. So it's this compound framework. Uh, Monte Carlo tree search is probably part of what's going on in the Oh, one set up is at least my guess for now, though, honestly, no one really knows. Uh, and it could be anything, but, uh, these are the sort of intuitive things you might try. If you're trying to replicate that, it seems superficially like.

They have, um, but again, I like, I just want to see a lot more information about the, um, about the actual eval side of things, uh, you know, what, what actually like on, on the Amy benchmark, I'd love to see, you know, the reasoning traces and all this stuff and qualitatively, it'd be really interesting to compare to the few, uh, Oh, one reasoning traces that we actually do have because, um, yeah, be, be curious to see how that, how You know, how those two compare.

Andrey

Yeah, exactly. There's very few details in the blog post and, and just these numbers really, but, um, certainly I, I think like believable that if you add and combine the various techniques that we already know can aid in reasoning in a good way. Yeah. You can then augment existing LLMs and make them much better at reasoning, perhaps matching O1 itself if you do it well enough, or at least getting closer to O1.

And after applications and business, first up, we have OpenAI discussing an AI data center that could cost a hundred billion dollars. dollars. So this is according to them having shared information with the U. S. government officials about these potential plans. Apparently this would be five times larger than any data center that's currently being developed.

So the, uh, OpenAI is a top level Policy executive Chris Lehane said this at an event in Washington, that this company has shared information about the potential impact of a data center like this. And we really don't know too much more. We know what they're kind of talking to a federal government. They are calling on it to expand the energy grid to be, to enable these kinds of things to happen. Uh, and they are suggesting various things like, uh, speeding up with permitting.

process for AI data centers like this.

Jeremie

Yeah. One of the things that they've called for in their briefings is to set up a, a national transmission highway act type thing to expand energy capacity. So basically just the, the, you know, national, whatever it was like highway act of the fifties, right. Where we, we set up all the, the, uh, interstates and all that stuff. Basically we need that, but you know, for, for energy infrastructure, is the argument here.

This is for, uh, by the way, it seems to be for a five gigawatt, um, cluster, which would be probably the Stargate cluster, right? The partnership with Microsoft and Open AI. I don't think they say it explicitly, but that that's what it seems to be. Um, you know, like one of the things, if you, if you're looking at this through a kind of Washington national security policy lens, like great, like good shit.

We definitely want this in the United States, but um, you can probably think about tying, um, The massive package like this to requirements on the part of the labs that use this infrastructure to adhere to certain security standards, right? So there are a whole bunch of really I'm saying this because we're working on an investigation right now of like lab security and all that stuff.

Um, there, there's some, some basic stuff that I think open AI is like keen to rush this through and be like, yeah, like give us goodies for free basically. Or, you know, well, somewhat for free. Uh, you know, we'll, we'll let some spillover happen and academic institutions will be able to access compute and stuff like that. But fundamentally, um, you know, just go and do the thing. I think this needs to be tied to like some pretty intense conditions.

If we're starting to think of, of, uh, AI as a national security technology, like the security situation is just abysmal right now and kind of needs to be. needs to be fixed up. So for 100 billion compute cluster, I think, uh, you have five gigawatts. This is, yeah, it's a behemoth. Like there's, there is not five gigawatts of spare capacity anywhere in the grid. Um, you know, every company I've talked to is like, yeah, you know, right now we're at most, we're, we're thinking about one gigawatt.

Realistically, we're in the few hundred megawatt range for stuff that's up and coming. So somehow that's got to change if we're going to be competitive with the CCP. And the best way to do that is obviously a big infrastructure buildup. The question is just, how do you tie incentives to use of that infrastructure?

Andrey

And speaking of data centers, the next story has to do with another one from XAI in this case. Apparently, XAI has gotten an approval to use 150 megabytes of power, which would enable all 100, 000 GPUs in that massive AI data centers to run concurrently. Apparently, so far, they've had an initial supply of eight megawatts, which is not sufficient to actually run everything. It seems like it would be an estimated amount of 155 megawatts to run all of those nine, uh, 100 GPUs concurrently.

And so here with this approval, it seems that they are making progress towards being able to use all of that concurrently. GPU power, which, um, you know, is, is still, there's still a lot of, uh, infrastructure considerations there. It may not be the case that you would even want to run all 100, 000 GPUs concurrently, but, uh, certainly now they have the option. This is the, uh, the big

Jeremie

XAI Colossus, uh, cluster. That's sort of like Gigafactory of compute sometimes called. Um, and yeah, this is, and it's also the one that, you know, Jensen Huang, um, had been talking about in terms of the, the insane speed of the build, I guess it was like 19 days from, you know, rolling out the first, you know, Uh, the first pods onto the, the floor of the data center to, you know, doing a training run or whatever.

And, um, ridiculously, ridiculously fast that I've seen stories by the way of like competitors flying planes, like, like Cessna's or whatever, over the site, just to understand how the hell Elon pulled this off. That's like the level of freak out that we're at right now in the, in the space.

And, um, uh, what Elon did by the way is like, because initially there was only eight megawatts available of energy on the, at the time of the opening of the data center, July, uh, he set up portable power generators to kind of bridge the gap. And there are a whole bunch of really interesting tweets. If you're into the hardware side, um, like check it out. I thought they were really cool.

Uh, of, of these sort of like new XAI employees coming in and saying, yeah, you know, he brought these in from Tesla and you use them to bridge the energy gap there. Um, and right now, so the big question is, you know, how, how far can this go? There are plans to double the capacity beyond 150 megawatts, um, which would, you know, further increase the GPU capacity. 150 is enough for 100, 000. Um, uh, H one hundreds. Um, and so anyway, uh, this is going to be a really, really big facility.

Obviously currently he's referring to it as the largest cluster in the world. Um, it wasn't the largest fully powered cluster in the world. Uh, but, uh, but it's, it's now truly going to be on its way to that.

Andrey

Next story also on AI hardware, but this time it's chips, not data centers. We have new results from the MLperf benchmark dealing with NVIDIA and Google chips. So the latest generation of these chips, in NVIDIA's case, that's the B200 GPU. In Google's case, that's the Trillium Accelerator. And both now have Kind of more clear results to be on just what the companies are saying.

So in Google's case, their sixth generation TPU Trillium has shown a 3. 8, uh, fold, uh, performance improvement about four times better. And the B 200 GPU seems to be about. twice better over the H100. So in both cases, it seems we're not hitting the kind of ceiling on improving chip performance. We are still doubling or even quadrupling in performance with each generation of chips.

Jeremie

Yeah, there's also more energy efficiency, especially with the, uh, the Google chips, apparently 67 percent increase in energy efficiency when you're talking about, you know, the difficulty of hitting that, you know, gigawatt cluster and so on. Like that's a really big factor. Increasingly, um, you know, people often think about energy through the lens of, you know, climate and all that stuff. But the thing that really matters is, you know, how much can we.

Can we squeeze out of the grid power that we have? Um, increasingly that's the bottleneck. And so, um, really important, you know, Google starting to look very prescient setting up their TPU program. Like, I don't know, like I can't even remember 2016, 2015. Yeah. Yeah. I mean that, that was like back then it was sort of wild. And now, you know, we're on to TPU V5, we're on to, you know, Trillium and all that stuff. Um, and really paying off.

One of the things that Google is especially good at is the multi data center. Compute game plan, like basically you, and it's relevant, right? When you run out of power, because if you don't have enough grid power, that is enough grid power in one location, because that's the real challenge. Then the question is like, all right, well, I guess I'm going to have to set up data centers in different geographic locations and have them work together.

And that means I need to be able to do distributed training across a like a large set of data centers. Um, that's something that Google had cracked early and hard and they've got great papers. We've talked about some of them on the podcast, um, but, but don't sleep on Google on that. They're really good at design and they're really good at data center design. These are not like crazy geographically separated by the way.

They're in the same general area for, for technical reasons, uh, that are a little bit challenging to overcome right now. But But, but still, uh, this is a big competitive advantage. The other piece on scaling. So the, the, the new TPU, the Trillium, um, can link up to 256 chips in a single pod in a single high bandwidth pod. So we're talking about high bandwidth.

Again, this will be punted to our hardware episode, but, um, the GPU to GPU communications, the things that like on NVIDIA devices would be done over, uh, like NVLink. So NVLink is like the, yeah, GPU to GPU, super, super high, um, high capacity, uh, cables. So usually you'll see like, I don't know, 36, 72 GPUs in a pod. This is 256. And, um, and so, so really, really expandable as well beyond that. So very much designed to be super, super scalable.

Uh, and, and they've got all kinds of infrastructure for it. So yeah, this is, I think Google is one to really watch. They're, they're the, they were the sleeping dragon that got woken up by chat TPT. And, uh, you know, now they have actually quite a strong compute advantage over Microsoft and open AI.

Andrey

And moving away from a hardware bit, we have almost like a gossip story, I would call this. So we're getting a bit of a detail on. The, uh, recruitment by Mia Mirati for her new venture. So this is the CTO, former CTO of Open AI. She left pretty recently and has announced she's doing something. We don't know at all what this new company would be. But now we do know that a fair number of people from Open AI are joining onto this endeavor.

So here we know that Miana Chen, a research program manager, He's joining this new company, apparently also ex head of post training, Barrett Zoff and former senior researcher, Luke Mance, uh, who also left OpenAI in recent months are teaming up on this. So no details beyond that, but seems like, uh, yeah, a lot of talent being, uh, Combined to this new company.

Jeremie

Yeah. I mean, you know, the one thing you can, you can pretty safely bet is that this is going to be a long AGI play with this team. Um, Barrett's office is one of the original, uh, co inventors of the mixture of experts model. So, uh, you know, very good at the, you know, on the foundation side and, um, you know, pre training is presumably then going to be part of the game plan as well. So they have the productization side.

Uh, in Mianna Chen, and they've got the more kind of pre training, um, obviously like Beretzov actually was, was the head of post training previously at OpenAI, but certainly has the, the pedigree to do, um, to do pre training stuff. So does Luke Metz, um, who's a sort of senior researcher. So anyway, uh, it'll be interesting to see. I think maybe it's just another, another AGI play. I would not, I'd be a little surprised if it wasn't actually.

Andrey

And one more story dealing with business, uh, we are seeing a little more rumors about Entropiq as their quest to get more money. So apparently Amazon is discussing a new multi billion dollar investment in Entropiq. Uh, there is, uh, apparently discussions on something similar to the initial 4 billion Amazon invested in Entropiq, uh, previously last year.

It seems that this time, though, there might be a caveat to that, where Anthropic would need to use Amazon chips specifically to train their models. And these chips aren't NVIDIA, so that could pose a challenge. technical challenges, these training chips, apparently. So again, not too much known. This is just internal discussions, but, um, could be interesting to see if this does come true.

Jeremie

Yeah. And apparently, like, so according to the article, they say that Any investment deal with Amazon could come in the form of convertible notes that become equity after Anthropic raises capital from other investors. This is a little weird. Um, usually this is something that you see in early stage startups, uh, when you're trying to avoid having to set a valuation for a deal, right?

So essentially what you're doing is you're delaying the discussion of what the valuation will be until there's a price ground later. This usually happens earlier stage because it's really hard to value a startup early on. You just have a couple of founders and an idea. Um, so yeah, not sure like exactly why the structure of the deal takes that form. It sort of reminds me of the safes that, uh, Y Combinator or that anyway, uh, Angel investors use a lot.

Um, but in any case, I think this is a really interesting deal to watch. The, the, uh, strings attached obviously are highly strategic for Amazon trying to force Anthropic to use their hardware. They're also their alternative to CUDA too. That's going to be really important because Amazon has an awful lot of catching up to do, right? They were caught flat footed. There's no other way to put it on the whole AGI scaling race.

And now they're trying to catch up, you know, they're, they're not developing their own models are really leaning a lot on, uh, on Anthropic to provide those. And now they're trying to force Anthropic to generate demand that can help drive the improvement of their own hardware. Um, I think this is a really interesting one to watch to understand how stable long term stable the relationships between hyperscalers and model developers like the opening.

I Microsoft relationship can be in the long run, right? Microsoft opening. I were already seeing cracks in that relationship. Um, to the extent that anthropic is now being sorry. The Amazon is now being kind of forced by business by business pressures to Google. require anthropic to ask anthropic to use their stuff more and more to the point where maybe it won't be workable. Uh, it kind of makes you wonder about a lot of these deals.

So, um, yeah, this is maybe a canary in a coal mine, but it has been a very fruitful partnership for the two. We don't know the details. Um, I think the big question is just going to be, like, what specifically are the requirements? What are the strings attached here? How much, uh, trainium, traininum? Traininum. Traininum. Come on, guys. It

Andrey

has to go into this. On to projects and open source. We got a couple stories here, starting off with the exciting news that AlphaFold3 has been open sourced. So we got the release of the source code and model weights. This is meant for academic use, so the license is a bit more restrictive. And, uh, there's really not too much else to say. We covered AlphaFold3 previously.

Obviously, this was a major improvement in the ability to model proteins and in Being able to apply it for things like scientific discovery, drug development, it was not open sourced when it was announced, uh, and sort of, uh, seems like, uh, kind of out of nowhere this suddenly came out. Yeah. I

Jeremie

think the law in AI is that the third iteration of your model is the one that you have to like withhold. Right. That's how GPD three happened. And now I'll fold three. Um, well, anyway, no, no longer, I guess, but, uh, so yeah, they have a deal where you actually, you can access the model weights. If you have Google's explicit permission for academic use only. So, you know, and they may be giving themselves a little bit of wiggle room there too.

The partnership with deep mind and isomorphic labs is really where a lot of this stuff is being monetized on the, like, I don't want to call it Google end. I'm not actually sure exactly. How the ownership structure works there, but they are kind of partner organizations in some sense. Demis Asabis actually helps, uh, isomorphic labs a lot as I think an executive. Um, but yeah, so, so, you know, they've got a, a diffusion based approach.

Anyway, the whole thing about this is it's not just about modeling the proteins. It's about modeling the interactions between proteins. you know, proteins when they're modified and ligands and things that bind to proteins, um, which, which is actually very interesting, right? When you think about medical impact, right? You're often concerned about how will these two things actually interact. That's the only way you can make an effect happen in the human body.

And so this is really where qualitatively alpha fold three is, is head and shoulders above alpha fold two. Um, so it'll be interesting to see if, if there's actual tangible impact that comes from this, there has been some from alpha fold two, but I think it's fair to say. a bit less, a bit less than people expected when I first came out. So we'll see if that changes here.

Andrey

Yeah. And, uh, yeah, to get access to awaits, you actually need to fill out a Google form and I like a little set of questions and deep mind will then decide where to give the weights to you. So they're being, Pretty cagey, pretty careful. And they really emphasize that this is not for commercial use.

If you're a university, nonprofit organization, research institute, you can use this, but they highlight over and over, uh, that this is not commercial use and in fact, there's some sort of like a stamp, some sort of marker in the weights related to reform submission, which is interesting. Uh, but certainly good news for scientists, for researchers. Um, And I think they did share this, uh, previously already in a sort of closed off process. So this is expanding, uh, the access to it.

And anyone can look at the source code for inference. So either way, even without two weights, this can inform people as to how to build these sorts of models. So does that imply they're

Jeremie

watermarking the actual, like. Uh, protein C like amino acid sequences or something that are generated by the thing. I'm not like,

Andrey

uh, each alpha fold free model parameters file will contain a unique identifier specific to this form submission. Interesting. Okay. Yeah. And next story it is that near plans to build world's largest 1. 4 trillion parameter open source AI model. This is near protocol and this is just a plan that they are aiming to, uh, kick off. So they are aiming to crowdsource this, uh, training.

They want to have thousands of contributors and for now you're able to start contributing to the training of what they say is a small 500 million parameter model and they're kicking this off today so very hard to say if this will come to be 1. 4 trillion Parameter's model is like three and a half times bigger than meta's biggest llama model. Very, very challenging to train. Personally, I'm a little skeptical, but uh, will be cool to see if you can even get to a fraction of this.

Jeremie

Yeah. So, so this is actually, it's kind of interesting that the two guys who are co founding this company are, um, uh, are former open AI, uh, guys there that actually part of the, the transformer research work that, that, you know, the attention is all you need, uh, follow up work that led to chat GPT. And that's interesting, uh, because this is actually the first time I've seen a kind of AI meets crypto project. I mean, there's a potential, um, uh, there's like bit tensor as well.

It was a bit tensor.

Andrey

Yeah. I think there's been many initiatives under crypto AI fraud, but we haven't talked about, so I'm not sure how meaningful any has been so far.

Jeremie

Well, that's the thing. So what I mean is like. So bit tensor was the first time I remember going like, okay, like this could actually, this could actually work an awful lot of it to me, to me. And it was just my bias, but it sounds a lot like, you know, Ben Gertzel, just getting really excited about, you know, singularity nets latest, like it's like almost. It's not quite, it's better than word association, but like some of it is actually just word association.

This is a time where I was like, you know what, this actually makes some sense. So, um, the, the pitches, right. And this is not investment advice. Uh, the, the pitch is something like, um, it takes a lot of upfront capital to, uh, to train a big model, right? In this case, the, the biggest one they're contemplating training would be 1. 4 trillion parameters. It would be 160 million to train and so on. Um, so, so why don't we. Uh, create a new token, a new token for each model we want to train.

And what we'll do is we'll auction away that token to raise money for the training run. And then if you hold those tokens. You can use them later to buy cheaper inference, or should I say you can use tokens to buy tokens? Anyway, so the point is it actually kind of makes structural sense like this is not the craziest shit I've ever heard come out of the crypto space and The other interesting thing is we had a little clue like kind of glimpse into the technical details.

They say You know to do decentralized network to have this like decentralized network of compute, which is what they need to do To pull this off, you're not going to have tens of thousands of GPUs crammed into one place. You're going to need, as they put it, a new technology that doesn't exist today because all the distributed training techniques we have require very fast interconnect. Okay, that's true. But he added that emerging research from DeepMind suggests that's possible.

Now when you talk, when you think about, okay, DeepMind and distributed compute in this way, the first thing that comes to mind to me is DeLoco. We've talked about that quite a bit.

I'm trying to remember if it was Noose or, um Another company that, uh, that's doing an open source kind of version of this to rely on that infrastructure were getting better and better at this sort of distributed training, um, doing it for really large scale training runs, even across like getting multiple data centers, multiple clusters to work together. Really, really hard. Um, so, you know, I think DeLoco, if that's part of the plan here is only going to be part of the plan.

Uh, there has to be some other solution hearing that goes on, but anyway, I flagged it cause I thought, you know, it's a pretty wild sounding story, but it's actually just. Sensical enough that I'm like, you know what? This is not the craziest shit I've ever heard uh, when it comes to using crypto, like the intersection of crypto and AI. So, so there you have it.

Andrey

Moving on to research and advancements, we've got a super paper to kick things off titled the super weight in large language programming. models. So this is one of these types of papers that dives into the inner workings of large language models and discovers something, in this case, I think very interesting. This is a partnership between the University of Notre Dame and Apple.

And what they say in this paper is that certain weights In an LLM, what they call superweights are kind of massively important. So we know that weights, the parameters of a neural net vary in importance. You can prune, you can set a bunch of them to zero. And it doesn't really affect your performance. And that's in fact how people kind of scale down models, make them more efficient, compress them. A lot of it is by finding weights that don't matter and then killing those off.

And what they show in this paper is that. You know, we know that some weights are important and you don't want to kill them off. But in fact, there are like these super weights that are even more important. That if you just, uh, zero out this one weight, literally, that leads to a massive drop in performance.

And that's like, If you just remove 7, 000 of the other largest weights, the weights that, you know, uh, contribute to the activations of a neural net, that isn't as important as this one weight. So you can say this one is like more important than thousands of other weights that also matter. What I found interesting here is this builds on previous research from earlier this year.

I actually was not aware of, and I don't think we talked about, there was this paper titled Massive Activations in Large Language Models that had already demonstrated that there are these, you know, massive activations in this paper, they call them super activations, that again are these outputs in the internals of a large Language models. So activation is just like the output at a given place in a neural net.

And so there it's been shown this past year, really like months ago that you have these, and it seems that these superweights play into that. Uh, so lots of kind of interesting ideas here. I've kind of surprising to find this out that there are these special weights that are massively important. Uh, don't think. I've seen anything that hinted at that so far.

Jeremie

Yeah, it is genuinely fascinating. I think mechanistically, it's really interesting too, right? The way, so these weights, it's not that they, they have really large values and that's what leads to the larger activation or whatever that, you know, they can take a range of values. The way they're found is essentially by like looking at, um, uh, essentially the, the, all the layers in the, in the transformer and specifically the MLP layers.

So. But let me take a step back here and just say, when you have a transformer, right? The transformer is made of blocks and the blocks are all stacked together. And then each block has two different kinds of layers. You got like a self attention layer, usually at the, at the beginning, and then an MLP layer, which is basically just like a, just a vanilla neural network that follows and kind of massages the data that comes out of the self attention layer.

And if you zoom in on that MLP layer, that second layer, uh, there, there are two steps in, or a couple of different steps in that layer. The first is taking the kind of low ish dimensional outputs of the attention layer and mapping them up to higher dimension. Okay. So, so you move them up, maybe you have like five, a 512 dimensional layer. You, you, you know, amplify it to 2000 dimensions say, and Then you can kind of in that higher dimensional space, you'll massage it in a way.

This is sort of like allowing your, I don't know, your papers to spread it on your desk more so you can work on them better. And then you do a down projection to recompress And what they found was that superweights consistently appear in the down projection part of these MLPs. So they keep showing up. They keep showing up in the, the transformer blocks MLP layer.

And specifically in the part of that layer that takes you Not that doesn't blow up the dimensionality, but that compresses the dimensionality after the number mixing has happened. And so it's sort of interesting, you know, why that exactly happens a super weight. Again, it isn't necessarily the largest weight by magnitude even at that layer. Um, and another kind of indication as to how this is working, these tend to happen very early on.

So in very early kind of early transformer blocks of the model, what they'll do is, uh, they'll actually kind of look for. really high activations that persist through the, all the, the, the layers of the model. And once they find the first instance where that happens, they'll kind of trace it back and try to mess with the weights until they, they make it disappear, thereby identifying this sort of superweight.

Um, the effective way that this, or the way that this happens in practice, the, again, the mechanistic side is, uh, the superweights seem to suppress stop word probabilities. So there's certain stop words. Um, like the, like the period character or a comma character that will cause the, the model to, you know, stop generating outputs. And, um, anyway, so, so the, the super weight seems to essentially, uh, suppress those stop words. So cause the model to keep generating outputs.

And if you knock it out, not only do you see that change, you actually, you know, see the quality of your output, just go to shit. Like you'll see some of the outputs that they, they show with and without the super weight. And it's like, it's from beautiful, coherent text to complete garbage. And, um, anyway, I thought, thought that was really, really interesting.

The super high activation, the super activation that you get the first time the super weight, you know, takes its effect is actually persistent to across the layers of the transformer. And so you end up seeing it persist through if you're. Yeah, tracking it higher level detail through skip connections, but it keeps appearing over and over in this kind of robust way. And so I think there's a lot to chew on from a mechanistic interpretability standpoint on exactly how and why this is working.

Um, but, but like you want, like, I was not tracking the, uh, the first. Uh, super activation results. So we're, we're sort of, it's, it's nice. Cause we're able to get a little bit more of an explanation now at the same time. And, and this is research as well coming out of Apple too, which, you know, not, not normally to me known as a big interpretability powerhouse.

Andrey

That's right. Uh, they do break it down a little bit. So it seems that The two are not exactly identical. If you restore the super activation, but still cut off the weight that doesn't kind of reverse the effect, you still lose a lot of quality.

And just to give a sense of the impact, they highlight one example where they say, Summer is hot, winter is X, and so, you know, a normal LLM would say winter is cold, versus if you remove it, it would say winter is Vuh, um, and these are the probabilities. The cold will have a high probability in a normal LLM, and then the probabilities will be more spread out. And you just save, uh, so not too much more information or interpretation of what this is.

What they also do in a paper is, uh, kind of on a more practical side of things, uh, explain how, if you do a super outlier aware quantization, as they call it. So if you quanta. when you quantize, when you lower the resolution of weights that affects your activations across the neural net. And it's already been shown that you can do better quantization by kind of keeping an eye out for important weights.

So in this case, what they show is that if you are careful to preserve these particular activations, these particular weights, that leads to a much less drop off in performance. So also from a practical perspective, this is very useful to know for being able to reduce the size of models while not losing performance.

Jeremie

Yeah, it's actually pretty wild. They'll like, they show you can take the, like all the weights in the model and do eight bit quantization. Um, but then you restore just the original super activation value in, in like 16 bit. And so just keep, you'll keep the high resolution for that one. And you'll, you'll recover an awful lot of the performance that you lost from, or you would have lost from the eight bit quantization of the whole model, which like, I don't know, to me, super counterintuitive.

To, to my mind, and I'll, I'll close with this, um, the weirdest figure in the paper by far. And, and this is like, again, things I never would have bet on in a million years. There's figure six. Um, what they show is if you take the superweight and you scale up its value for some amount of scaling up for some amount of scaling up, you will actually see the zero shot performance of this model on certain tasks.

Consistently go up so across a bunch of different model sizes They tried this and they consistently found there's some scaling where you get improved quality Like basically just like take this weight that's already been trained increase its value and that's consistently a good move That makes no sense to me. That seems weird.

Um, I mean, maybe you could argue that I guess through regularization or something there, I'd have to think more about their, the training scheme, but maybe there's some like kind of regularizing pressure that artificially dampens the weight relative to what it might've been. Otherwise it's not elastic enough or something. But, um, but still I find this like fascinating and I wouldn't have expected it.

So a lot of weird quirky results here that I think, you know, The mechanistic interpretability people should take a look at because pretty cool.

Andrey

And now to the next paper, also dealing with understanding how large models work, in this case diffusion models that generate images. And the title of the paper is Compositional Abilities Emerge Multiplicatively. Exploring diffusion models on a synthetic task. So they here explore composing things.

If your model can, let's say, create a square and create a rectangle, an image, can it then do something like, I don't know, a triangle on top of a square to keep, uh, to give a very simple example of what you can, of course, uh, You know, come up with many examples of the sort. And the question is, how do these abilities emerge? How can you compose different concepts in your outputs?

And the details are a little bit nuanced, but at a high level, they say that the ability to generate samples from a given concept emerges depending on the actual data. and the process by which you generate data. And there is this sudden emergence on the ability to compose tasks, uh, to do well on tasks that require compositionality. So, to be honest, I haven't dug in deep enough to explain this in a ton of nuance.

Jeremy, I think you can probably do your deep dive explanation and do a better job on this one.

Jeremie

No, no. I mean, that's great. Like, so the, I, first of all, I think this paper from a, you know, security of AI systems, um, AI risk, national security standpoint is really, really interesting. So there's been this debate about the sudden emergence of capabilities and language models where, you know, you train, you train, you train, and then. Out of nowhere, seemingly, suddenly this model can, I don't know, help you like make progress on designing bioweapons. It can write malware.

Like, where did that come from? Could we have predicted this, right? And a natural, um, model for the appearance of those capabilities is to say, okay, well, let's say like designing malware requires skills X, Y, and Z. Right? Um, the, and, and let's say that over the course of training, the model gradually becomes better at X, gradually becomes better at Y, but gradually becomes better at Z. And but, but to do the overall task, right, you have to do all those three things together.

If you have 80 percent performance at X, you're 70 percent at Y, 90 percent at Z, then your overall performance probably is going to be something like 80 percent times 70 percent times 90 percent because you have to succeed at each one of those together. In other words, multiplicatively to perform the dangerous capability properly.

That's kind of the threat model Um, I don't know if that's literally like, I mean, I don't think it's national security motivated, but I think that's probably the biggest implication, um, for national security here. And, and so, yeah, basically what they, they do is they make this really concrete. They take a diffusion model. So basically an image generation model and they get it to like generate different shapes, colors, and sizes of objects. Right.

So, you know, think like blue, small blue sphere. Right? Something like that. And they check like, okay, how performant is this at capturing the shape, the color and the size? Um, and what they'll do is they'll, they'll specifically not train the model on certain combinations of those features. So make sure that the never the model is never trained to make a large purple square. Right. Your large purple cube never, never has trained to do that.

Um, so you train it on a bunch of other combinations and then you check to see how well can it perform at that, at that new out of distribution tasks that it was never trained on. And it turns out that its success rate at that task is essentially multiplicative based on its success rate at the individual components, the shape, the color, and the size that went into it.

Um, they compare their Uh, sort of multiplicative model to an additive model that doesn't perform nearly as well at explaining the, um, the emergence of these capabilities. And, um, yeah, I mean, and this kind of makes sense.

Look, mathematically, uh, if you've got a background, like engineering, anything like that, uh, or physics, you, you probably know about the direct Delta function, um, how basically in the limit, if you have a large number of chained events that have to happen together with, with probabilities less than one. Um, what you'll find is like the, the, the output, your success rate is basically going to be zero almost all the time because you only have to fail at one of those things.

You only have to get zero at one of those things for the overall success rate to be zero, but there comes a point where all of a sudden you kind of crack the last nut. Right. You're like, you're, you know, 60 percent on one, 30 percent on all this stuff. Um, and there's, but there's one thing that's really holding you back and all of a sudden you crack it. And then out of nowhere, it almost seems as if in retrospect, you uncovered this, wow, this amazing new capability.

When in reality, that capability was actually a compound of many sub capabilities, each of which had to be chained together. And when you multiply a lot of numbers together, you tend to get zero. If those numbers are less than one, that is, uh, or between zero and one. you'll tend to get zero, uh, unless and until all those numbers kind of hit a minimum threshold. And, um, anyway, that's what this is all about.

It's in some sense, an obvious consequence of the same kind of math behind like the, yeah, the Delta function. Uh, um, uh, and, uh, I, I think it's just really interesting for that reason, but cool paper for, for, uh, AI security.

Andrey

Yeah, and from a practical perspective, so most of the results here on are on this synthetic tasks where they specifically do like shape, color, size, things like that. And they show that in that specific case, You do see the emergence of the combination of these concepts, and I deal a lot with the idea of these distinct concepts and how well you do in them.

But they also do have more practical tasks where they look at Celeb A, where it's like a bunch of faces, there are attributes like gender, expression like smiling, and hair color. And in this example, you can actually look at the concepts, look at the performance on things like, you know, how well you're doing at, uh, faces of men and women, and you see something similar to that synthetic case.

So in fact, this could also help in practice to mitigate the biased outputs in image generation, just as one example.

Jeremie

Yeah, actually one last thought too, and I think this is actually, um, pretty important for for large scale trends in AI. This whole debate about emergence, right, the sudden emergence of capabilities, there have been a lot of papers back and forth, I think somewhat pedantically, you've had people argue like, oh, emergence isn't a real thing, because in retrospect, we can find that this capability actually did smoothly start to emerge over time.

But all that stuff is like, If you know the right metric to look for in retrospect, in practice, we get surprised. We train a new model and we go, Oh shit, we didn't expect it to be able to do autonomous cyber attacks. And it just is, um, yes. In retrospect, you can design a cyber autonomy benchmark that will retroactively explain how you got here. You need to design the exact right benchmark to elicit those capabilities. Um, this now helps to focus what that debate actually was all about.

Fundamentally, that debate had been about identifying the right set of capabilities that compound together to lead to the capability you're interested in. So the, the, the reason that emergence is an issue in practice is that in practice, we actually don't know the full set of capabilities that need to be strung together to To give us a certain kind of overall, um, dangerous capability or, or interesting capability.

And so I think it's really interesting through that lens, I guess a new language or a new way of thinking about that, um, I was going to call it that age old debate. I think it's a debate that's been going on for like two or three years, so maybe not.

Andrey

And a couple more stories. The next one is titled mixture of transformers as sparse and scalable architecture for multi modal foundation models. So another mixture kind of approach here we have mixture of depths mixture of experts. And this time it's another one of those kinds of things. So make sure of experts we know is a pretty major deal. The idea there is basically that you take an input and you have a router that activates certain weights for certain types of tokens or inputs.

So that means that you can use subsets of the entire network for certain kinds of things. And as a result, You use less overall computation while generally getting better performance by training more overall weights that are specialized. And one way you can apply this is when you're dealing with multiple modalities where you can have different modalities being routed to different experts.

And this paper is presenting a sort of, uh, uh, uh, kind of, uh, generalization or, uh, a very specific version of that for multiple modalities. So the idea of mixture of transformers is that when you want a multiple modalities, like images, text, and audio, what you do is you have quite literally separate transformers.

And when you have your input, You group with different modalities, you, uh, have attention over all those modalities in the sequence, but then you route each modality to its own little transformer. So you literally kind of split them up across different weights. And what they show is by doing this, of course, you get the benefits of mixture of experts by, um, doing overall less computation.

Uh, so, you know, those middle layers, uh, you have typically feed forward layers, things like that, that need to get the entire sequence, uh, so you, by treating each modality separately, the individual transformers in that mix can be smaller and have fewer layers. So they evaluate this idea versus a more traditional sort of mixture of experts and across a dense transformer and show a major speed up, a doubling of speed in the training while getting, uh, also a good performance.

So, uh, yeah, it's, it's really building on to explorations at how you do multimodality. There's kind of two predominant approaches and this is extending one of those types of approaches.

Jeremie

Yeah, there's a lot of interesting potential advantages to this approach. Um, and, and they do, so they have distinct, they, they read distinct modalities, distinct, um, those say experts, but really transformers. Um, but then they do have global attention that, that spans across them. So you benefit from sort of cross pollination between modalities, right?

If there's one token that's, um, you know, like you interpreted visually say in another, that's, that's, interpreted text wise, um, you can still benefit from, from that, uh, interaction, uh, which is really important because often there's kind of emergent information that comes from say, the description of an image coupled to the image. The other advantage they have is so training stability, right?

One of the big, big challenges right now is you scale up these models enough and pretty quickly, uh, you run into issues with, with training stability, uh, you know, with getting, getting the loss to kind of stably, uh, drop over time. And this model, it's a mixture of transformers.

Um, because it uses simpler modality based routing, it avoids the additional complexity of having to learn the routing process in a mixture of experts, which is one of the things that makes training particularly unstable. And so, so essentially this is really helpful for scalability. Um, they actually show that. So they show that, uh, MOEs, in their experiments, um, get some diminishing returns at above a certain scale.

Um, and, um, and above 7 billion parameters in scale in their experiments, sort of more, more, uh, notably that stuff that I, I mean, you, you know, you can, you can come up with engineering solutions for a lot of this stuff, but the bottom line is, you know, mixture of transformers. is easier for them to use for these kind of the larger end of that scaling spectrum, so maybe an interesting option as well.

Andrey

And the last paper is contextualized evaluations taking the guesswork out of large language model evaluations or language model evaluations. The idea here is that in certain evaluations you have Tasks or inputs that may have different valid outputs depending on the context. So one example of a give is, for instance, what is a transformer? If your context is you're an electrical engineer, that has a different answer than from if you're a machine learning engineer.

As an example, so they tackle that general topic and say that if you actually specifically provide the contextual attributes of, you know, why are you asking this question? What kind of background are you coming from, et cetera, that can lead to more reliable and kind of meaningful, uh, outputs.

Jeremie

Yeah, I think it's also quite interesting as a call out of the issues that we have with model evaluations right now. They found that if you, so if you just increase that context, they ran some experiments with this, but you can actually flip win rates between model pairs. So you might find, for example, that, I don't know, like Gemini 1. 5 pro, um, seems to outperform Claude 3. 5 sonnet on a particular benchmark. But when you provide this additional context, all of a sudden that can flip. Right.

And so it's a question of like, which models work best with ambiguity and which work best with, uh, with more, more context about the user and who's asking the question. And part of that is just like, what's the baseline behavior of the model, right?

If the model is more baseline skewed towards, uh, Um, responding to you as if you're five years old because of the way it was pre trained or fine tuned, then, then it's just going to be better and if you have a benchmark that is geared in that direction too, you'll find the model performance better as well. Um, so I thought this was kind of interesting as a bit of a flag, like some of our, our relative rankings.

It even makes you think about, Uh, for example, like ellipsis leaderboards and things like that a little differently, right? Because now you're thinking, well, okay, sure. On average, people give a higher ELO score to one model or another, but everybody is different. And so some models may actually just be better at meeting the needs, the explanatory needs of one kind of person, uh, rather than another. The, the upshot is you valuations to factor in context.

And they do find just more reliable, more robust evaluations come from doing that. Um, to prove that they actually kind of create a synthetic data set of, of queries. So they'll take the basic queries that they get from some standard, um, kind of QA benchmark, and then they'll just like automatically generate a bunch of context about who's asking the question. And then they'll feed that to the model and evaluate based on that context.

That results in a much more robust, um, set of responses that, that presumably are better evaluation of the model's actual capability to adapt to the needs of a particular user.

Andrey

And moving on to policy and safety, the first story is dealing with VEU. The title is the Code of Practice for General Purpose AI Offers a Unique Opportunity for VEU. And an opinion piece co written by Nuria Oliver and Yoshua Bengio. Yoshua Bengio being a major AI researcher who is also a major proponent for AI safety. So with just our article is that just recently on November 14th, There was the release of the first draft of the code of practice for general purpose AI.

And this code of practice is answering a question for general purpose AI, which is things like chat GPT. How can you build these models while also addressing the potential risks? they create. And this article is basically making the case for that code of conduct. It sort of addresses, uh, the potential, uh, criticisms. So it says there are many stereotypes about Europe's approach to regulation.

And then it calls out that, although Many do criticize the EU for over regulation, for slowing down innovation, things like that. They argue that the code can create a flexible, targeted, and effective framework for ensuring thriving GPAI, general purpose AI innovation, while respecting the rule of law and the shared rights and values of the EU.

And they say that This has global significance as it is the first time that legal rules are turned into more detailed guidelines for the responsible development and deployment of general purpose AI. So not surprising I suppose that there was a sort of a code that was published alongside the actual law of the act. Code of practice is, is kind of suggesting how companies should, uh, act to Uh, follow the law of the EU regarding AI in general to develop models in a responsible way.

Yeah, it was sort of funny. I

Jeremie

saw something on, on X about, uh, uh, people talking about, um, how the, sometimes it feels like the, the EU AI Legislation regulation complex is just constantly teeing up the next policy document and one policy document just seems to bleed into the next almost effortlessly. And, uh, this definitely has that feel, um, nothing too surprising, I guess, in the set of, um, of recommendations that, um, uh, that Benjo, uh, is, is calling for in this document.

It's stuff we've seen him advocate for before, you know, alignment with EU rights and values alignment with the AI act and international approaches, proportionality to risks. This obviously is the big question, right? Like you can say, okay, we're going to do this in the way that's proportional to risks, but what specific compliance measures, what specific risks, that's almost the entire debate, right? That's, that's where, where people disagree.

So, um, the also calls for a future proof approach. You know, obviously we've seen Um, I think, you know, people talking about, um, the sort of like tiered based compute based approach to regulating AI systems are, uh, are, are, I think have, have come out looking pretty good, especially in the wake of the inference time scaling laws, right?

This idea that, um, regulation through the channel of compute is a really promising approach because there's no other way really to future proof what you're doing. Um, the one trend in AI is. If you want to do something really powerful that could have eventually dangerous capabilities, you're going to go through a regime where you have to scale your compute in a really intense way. Compute is expensive. It's easy to audit.

Um, it's got all these features that make it a natural focal point for regulation, much more so than models or applications, which are incredibly difficult. Like if you wanted to regulate at the application level, which a lot of people have been pushing for, Um, you know, you'd have to get all up in people's business in a much more intimate way like the end user's business instead of the model developer's business where the model developer has far more resources to comply.

So anyway, I think this is something that, um, that is going to be the topic of a lot of debate. A lot of these things are like directional things that everybody already agrees on. The question again is, is going to be how, how do you actually instantiate this?

Andrey

And the next story, dealing with safety, it's from Unfropic, and the title is Free Sketches of ASL for Safety Case Components. So ASL is AI safety level. It's a part of the Responsible Scaling Policy, RSP, of Unfropic. And their Responsible Scaling Policy outlines exactly what these AI safety levels are, one through three. for listening.

Categorize this kind of the potential risks that come with models at different levels of capabilities, and also the actions and safeguards that Anthropic in particular is committed to in order to, uh, responsibly develop advanced AI, they haven't developed ASL for, and this blog post is sort of providing hypothetical ideas of what issues. Four models that get into that more advanced level might be and how you might mitigate those.

Building on the recent report, I think we discussed about, uh, sabotage evaluations for frontier models. So this blog post, uh, contextualizes the idea that in general, these things would fall into the area of sabotage and would have, uh, implications for catastrophic misuse risk, as well as autonomous replication capabilities and autonomous AI research.

So basically the model can go off and do stuff that you don't want, make itself super powerful, and maybe, uh, catastrophically destroy humanity. Uh, it's quite a long piece, uh, detailing sort of hypothetical approaches to dealing with these scenarios and released with the idea to get some feedback, some conversations going, uh, while us sort of, I guess, exploring the idea of ASL 4. And Jeremy, of course, you certainly have more to say on this one.

Jeremie

No, well, I mean, so I thought this was a interesting read. I think Anthropic is, um, Uh, you know, doing a good job, I will say, like, unlike open AI, it seems thinking in public about, uh, what sort of autonomy, like intense levels of autonomy starts to look like, um, you, one of the things that, that, uh, I've talked to quite a few people at Anthropic about is, How uncertain they are about what to do at ASL 4.

Like once you get to systems that are genuinely autonomous, um, they've committed by the way to having a plan for ASL 4 by the time they have their first model that meets the ASL 3 threshold. So, and which, which could be imminent. Like it could happen this year, um, very likely to happen next year, uh, according to Well, a bunch of people I've talked to and then also Dario, I think in his podcast with Lex, uh, did share that too. Um, so they need to have ASL 4 online soon.

This is their kind of one of their first sketches attempts to do this. And they are confused internally about what exactly ASL 4 will require. The reason they're confused in part is that we currently lack the techniques we need to successfully audit ASL 4. Um, the behavior of AI models at that level of capability. And this comes through in a lot of the the safety safety case, case sketches, gonna say that three times fast, uh, that they share here.

So they kind of have three different sketches of scenarios that they consider to be quite plausible, uh, that, that, that they can do. You know, we could run into in, in coming years as we approach that ASL four threshold. Um, I'll, I'll just mention one. Um, so this is their case study in mechanistic interpretability. The, the thing I like about this by the way, is it gives you a good sense of the bag of tricks they expect to lean into once you get to this stage, right?

Once you get to the stage of models that Are autonomous enough, um, have a long term, uh, understanding of their goals and an understanding that it's to their advantage to do things like power seeking, right? Like hold off on displaying certain capabilities while they're being evaluated so that they can be deployed and have access to a wider action space. So they can ultimately achieve their goals in more, um, in more extensive ways.

And so what they're looking at is using sparse auto encoders that comes across very clearly. It's something that, um, I've heard a lot about. Uh, not just in anthropic elsewhere. Other labs are thinking about this too, but anthropics been really, um, uh, kind of ahead of the game on essays.

Basically, this is, we've talked about a lot, but it's a way of probing at the, um, at the structure of the activation space of your model and identifying, you know, is the model made maybe, uh, activating certain concepts related to deception, um, and, and, uh, and sort of escape things like that. And essentially leaning on that as one, one measurement, that's one theme auto interpretability techniques that scale really well. That's something that they're really focused on as well.

Um, and, uh, and they're, they're looking at as well activation engineering, right? We were talking about that about two years ago. I think when the first Okay. Um, cheese maze paper came out when Alex Turner, I think it was posted this on an obscure blog. We, we talked about that here.

And, and at the time it was already clear that was going to be part of not the diagnostic necessarily, but now they're, they are looking to use it for that, but also the mitigation, the actual kind of correction, corrective action. If you do spot indications of, um, of deception of, of bad behavior in your models. Like, you know, how you can steer, use activation steering to, uh, to change that behavior. So I thought that was really cool to see it pop up concretely here.

Um, they, they also, the case study three is also really interesting. Um, they talk about kind of figuring out when your model might be relatively benign after training, and then after fine tuning, you essentially cause one of its, um, personas to emerge. And so, so the idea behind this is, again, something we've talked about. It's fairly nuanced, but because Language models are generally pre trained with cross entropy loss.

Cross entropy loss is the, is what's known as a mass covering objective. In other words, it assigns some amount of probability to all plausible completions of a sequence, rather than just focusing on the most likely one. And, um, it has that That property because it penalizes the model heavily for assigning really low probabilities to sequences that do end up occurring.

So you want that feature, um, in contrast to other objectives that could encourage the model say to focus all its probability on just a single best completion. And. That's relevant because they're arguing that the pre training naturally leads to models that can exhibit a lot of different behaviors or personas because the model needs to maintain some probability on all the different ways the text could plausibly continue. So it's got to kind of hedge its bets.

And naturally that means it's got to be a composite of sub models, each of which could, could sort of look at alternative strategies. So there's a risk, if that's the case, that through post training, you may end up amplifying one of these Um, one of these kind of personas or, or submodels, uh, that are, uh, more capable or, or inclined towards strategic deception. So they talk about their mitigation strategy, all that stuff. None of this is final.

None of this is meant to be like the game plan, but I think it's an early hint of what could be the game plan as you approach, you know, more and more automated researcher type, like drop in replacement researchers, uh, which I think is where, where ultimately we're, we're headed, but, uh, kind of cool.

Andrey

And next, shifting more into U. S. policy, we got a duo of stories dealing with the TSMC FAB in Arizona and the CHIPS Act. So the first story is dealing with a finalized deal of funding under the CHIPS Act for this FAB. We are told that the TSMC will get 6 billion. point six in direct funding for a fab and also 5 billion in loan guarantees.

And under his deal, TSMC is committed to make chips of its most advanced production node that we know of today, a 16 in the U S although it will only happen around the end of a decade, three years after it. enters mass production in Taiwan. And a related story that also came out is that the Minister of Economic Affairs, uh, I guess in conversation mentioned that under current rules, CSMC cannot make their most advanced chips abroad.

So there's actually a law or a rule that, uh, the, whatever they are producing abroad has to be a generation behind whatever is being produced in Taiwan. So I suppose it seems like those stories go hand in hand.

Jeremie

Yeah. It's, it's kind of, it's kind of funny cause it's, you know, Taiwan's sort of domestic policy is designed to make sure that the earth teeters on a knife's edge when it comes to the semiconductor supply chain. They, they're like, no, no, no. There's a risk that Taiwan is going to be invaded by China and that our fabs are going to be knocked out and that the world's semiconductor supply chain gets nuked as a result.

Um, and so naturally the United States and other countries want to onshore, you know, Uh, you know, even TSMC facilities, right? Get us to build fabs in their countries. So they're, they're more robust. Um, but we don't want that. We don't want that because that takes away our leverage. It means that the most advanced nodes are not being made in Taiwan. And now they've, you know, they have an actual like policy, uh, to, uh, to enforce that, which, uh, you know, is kind of interesting.

Now, the way he phrased it was, we will not be producing two nanometer chips abroad. Um, this is. As you said, it's not actually just the two nanometer chip. Eventually they will. In fact, there are plans to do that, right? So, um, TSMC's first Arizona fab there, um, that is going to be a four nanometer fab, uh, which is actually starting up pretty soon in the next couple of weeks.

Uh, but there's a second fab and a third, and both of those are actually going to be like reaching all the way to two nanometers and beyond. And they're going to come online more in the kind of 2028 era. So, It, we are eventually going to get two nanometers in the United States, but we are not necessarily going to get the leading node under Chinese law by, by, sorry, under Taiwanese law. By then, um, you're going to have the next generation chips online, you have the A16 and so on.

Um, so this was made in response apparently to concerns that TSMC might be forced to produce some of these chips in Arizona ahead of schedule. Um, once, uh, once Donald Trump was reelected and, you know, who, who knows, but, uh, it, it is interesting how, how the election is having an impact. They're claiming TSMC that there is no impact that all this is kind of like business as usual, you know, whatever.

But, uh, I suspect that there's going to be some, some tougher negotiating around things like this in the future.

Andrey

And a couple more stories. This one, a callback to something we already covered in this episode, the article is OpenAI to present plans for US AI strategy and allies to compete with China. So we already mentioned how, as part of this policy proposal or policy discussion, OpenAI made the pitch for a hundred billion dollar data center. And as you mentioned, Jeremy, There's kind of an overall blueprint and it does propose some things that sound like acts like laws.

So for instance, there is that National Transmission Highway Act to enhance power fiber connectivity and natural gas pipelines. And they also have this It's North American compact for AI that would form an economic block meant to compete with China. Uh, so these are, you know, I guess OpenAI is getting into the game of trying to shape policy with these kinds of ideas.

Jeremie

Yeah. And one of the interesting things, so the overall, they're calling their blueprint for AI infrastructure. And, um, they're, they're talking about setting up artificial intelligence, economic zones, uh, and tapping the Navy's nuclear power experience, right. To, to, um, support with. Essentially getting government projects for power that would also be funded by private investors. So looking at how to align incentives there, which is really good.

Like the U S needs to rethink its energy strategy in a big, big, big, big way. You've got like AI is set to be, you know, double digit percentages of the U S is total economic demand in the coming years. Like by 2030, it's starting from, you know, basically not, or well, like more like 4 percent right now for data centers at large. And, and that. You know, that may, that may actually accelerate even faster.

The only thing that happens if you don't build energy infrastructure in the United States is these data centers will be built elsewhere. That is what will happen. We're already seeing that happen, or at least sort of flirtations with that with the UAE and another kind of, uh, um, yeah. Autocracies or, you know, governments like that, you don't want to be building your key national security infrastructure there.

Um, I will say, I mean, OpenAI in, in kind of at this point, pretty typical self serving nature, um, is picking their, their arguments pretty carefully here. So they made the case that this would be great for, you know, Uh, for job creation, you create tens of thousands of jobs, um, and it would boost GDP growth and lead to a modernized grid and all that stuff. Uh, I had a nice little chuckle about that just because open AI is, um, you know, they're looking at automating the entire economy.

So sure. Uh, I'm, I'm, I'm sure that's, that's the game plan. Yeah. It's all about those jobs. Um, anyway, it may, it may in the very short term create those, but you know, bottom line is that's, that's not where the future is headed. Um, A lot of bombastic and I think accurate stuff. They say as foundational a technology as electricity AI will be and promising similar distributed access and benefits, blah, blah, blah.

I mean, you can see them tweaking their language in real time to be more Republican coded after spending years and years, um, playing the, the, like the, the democratic side of the ticket. So it's just sort of amusing to see this happen. But, um, uh, anyway, it's, it's, uh, it is. There's a lot of thoughtful stuff there, but it is very obviously self serving and there are conditions that surely ought to be tied to building the kind of infrastructure we're talking about here, right?

Like the security situation in these labs is abysmal. OpenAI in particular has shown itself to be just not interested in engaging with like whistleblower concerns over security, whether you look at Leopold Ashkinbrenner or any of the number of things that even we highlighted in our report. Uh, from, from last year or from earlier this year. So I think, you know, you, you gotta, you gotta carrot and stick it.

Yes. Offer, um, uh, your power infrastructure, but it's gotta come with concrete, uh, requirements for heightened security that these, you know, labs. And again, I'll say opening eye in particular has just historically shown it's, it's interested in doing it to the, to the extent that it allows them to get by Congress and, and to have decent relationships with the executive and so on. But. Um, really, uh, not on the kind of cultural fundamental level among the executives at the lab.

And I think you, you got to look at Sam Altman for that in particular.

Andrey

And now to the last story, again, dealing with OpenAI and going back to safety. OpenAI loses another lead safety researcher, Lillian Wang. So that's. Pretty much a story. There was this announcement by Lillian that her last day will be November 15. She's a long time employee at OpenAI, seven years, was a leader in the safety team, also, I will say famous to me as the author of many very detailed blog posts. Uh, little log, uh, very deep, great overviews of AI research.

So. Again, we, we don't know how much to read into this. Of course, it's a part of a trend of safety figures, uh, leaving open AI. This could mean that there is some internal disagreements that could also mean that it was the right time to move on after many years. Uh, but either way, this is the news on this front.

Jeremie

That's true. I mean, I looked at her tenure, tenure, you know, seven years. Uh, it's quite something. Um, so yeah, could just be that. It's also the case though, that, um, If you, if you think that your lab is on track to build ASI, AGI, whatever, and, and you're on the safety team, um, you probably, you know, try to stick around almost no matter how painful that is, uh, if you think that, that you have any kind of influence, if you think that people will listen to you, and

I think, um, certainly what I've heard talking to whistleblowers at OpenAI, uh, some of whom have left recently, uh, the feeling is very much like, you know, You know, they were disempowered and this happened on the safety side happened on the security side. Um, just that the systemic, um, kind of systemically ignoring, uh, these, these, um, national security concerns that frankly were raised and trying to try to shut them down.

So I think in that context, it could be another instance of that trend of just like, You know, nobody's listening to me when I say that we have, you know, the CCP being a threat and this and that. And so, um, uh, I'm gonna leave or we're not on track to, to solve our control problems, whatever. Uh, or it could just be, it's been seven years. So, so hard to know. Speculator is going to speculate.

Andrey

Exactly. Yeah. I guess, uh, if we find out that Wang moves out to a tropic, and that does it for this episode, we did hit, uh, probably a little bit more of an, an hour and a half, but still not bad. I will pat ourselves on the back in this one. So thank you for listening.

As always, uh, you can always go to last week in that AI, uh, For the text newsletter, and also for the links to the news we covered also in the episode description, as always, we appreciate your comments, your reviews, your sharing of a podcast with friends and coworkers, et cetera, et cetera. But more than anything, we appreciate you tuning in. So keep. For doing that and enjoy this outro song.

AI Singer

Pray to say I state and re. Supervised at work, creating futures rare. Join us in this journey, through the bits and bytes. Coding our tomorrow, amidst the blinking lights. Each node and universe, in this ever growing chain. Last week in AI, we're breaking the domain. In the curve of dreams, where the logic flows.

The street in the club that grows faster Was hummed like the heartbeat of a new age Deep inside a code, a revolution to engage Fetched out deep into A. I. 's electric brace Reminds so agile in a digital space We'll sequence the song, we'll find our place Last week in A. I., a new era we embrace In circuits we fly, circling waves of thought In the halls of silicon towers, where dreams get caught I've been with proboscis, looking at you like St. Pete, sleep away, sleep night, in the blue top, St.

Pete. Whispers of code, the chorus is insane, exploring the unknown, the paradise zone. Last week in AI, where logic takes flight, in the tapestry woven by the mind's electric light. And in the light, the light, by a sound of the note, you say, the hell, when's the change, sir? We're standing on the edge of a universe underneath the stars. We're digital aliens, we're the only ones who can see the stars. Man, In a

Speaker

dance of minds, New horizons expand, On this evolving path, Let's chase the dawn, Last week in AI, With every bit to spawn, A world so bright. Circuit sweep the story, every charge it takes And now you're

AI Singer

gonna get us, we should take our stage Patterns in the data, the future turns its page In the balls, AI, we find our refrain Massive minds are forming on the digital plane And deceive electric dreams Let us sail tonight, let's make an AI Illuminated in the night

Transcript source: Provided by creator in RSS feed: download file