#175 - GPT-4o Mini, OpenAI's Strawberry, Mixture of A Million Experts - podcast episode cover

#175 - GPT-4o Mini, OpenAI's Strawberry, Mixture of A Million Experts

Jul 25, 20242 hr 47 minEp. 214
--:--
--:--
Listen in podcast apps:

Episode description

Our 175th episode with a summary and discussion of last week's big AI news!

With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)

In this episode of Last Week in AI, hosts Andrey Kurenkov and Jeremy Harris explore recent AI advancements including OpenAI's release of GPT 4.0 Mini and Mistral’s open-source models, covering their impacts on affordability and performance. They delve into enterprise tools for compliance, text-to-video models like Hyper 1.5, and YouTube Music enhancements. The conversation further addresses AI research topics such as the benefits of numerous small expert models, novel benchmarking techniques, and advanced AI reasoning. Policy issues including U.S. export controls on AI technology to China and internal controversies at OpenAI are also discussed, alongside Elon Musk's supercomputer ambitions and OpenAI’s Prover-Verify Games initiative.  

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Email us your questions and feedback at [email protected] and/or [email protected]

 

Timestamps + links:

Transcript

AI Singer

Last week in AI, we bring you news. Keep you in the know today. Trends and breakthroughs, all the headlines that will make your way. Stay updated, never fall behind.

Andrey

Hello and welcome to the latest episode of a last week in AI podcast where you can hear us chat about what's going on with ai. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news. And as usual, you can also head over to lastweekin. ai for the text newsletter of Last Week in AI with even more articles. I am one of your hosts, Andrey Kurenkov. I did a PhD where I studied AI a while ago at Stanford and I now work at a generative AI startup.

Jeremie

Hey everybody, I am your other host. I'm Jeremy Harris. I'm the co founder of Gladstone AI, an AI national security company that you have heard about on the podcast before, uh, cause I always say this at the beginning of every podcast. So this is fun. We're really getting meta now. Um, yeah, yeah. That's, that's what I got to say, actually.

Andrey

Hopefully it's useful to keep saying it. I figure it's good to give a background if we're any new listeners, but, uh, I think you're right.

Jeremie

It just feels, it feels weird and scripted every time, but that's just the information that, that the people hopefully need or want. I don't know.

Andrey

Yes. Uh, and Jeremy, you know, I think we did manage to improve your audio quality by just tweaking the recording, but now I am. In a different space. In fairness, that's

Jeremie

because you're joining us from the bottom of a well,

Andrey

which

Jeremie

is,

Andrey

which is fair. It might sound like that. I guess we'll see if I can post process it to be a little less echoey, but if that's the case, just know it's because I'm traveling and not because I suddenly stopped knowing how to make the audio be good. And before we dive into the news, as usual, just want to shout out to a few new comments and reviews. Got one new view on Apple.

Podcasts from blasting Fonda, which is, uh, yeah, a nice, uh, review says we are pretty balanced, says Goldilocks, just right amount of doom. So there's a

Jeremie

just right amount of doom. That's it. That's our,

Andrey

our main goal here over here. So thank you for that review. Appreciate it as always. And some more comments on YouTube as well. Uh, do appreciate the feedback there. Someone commented on the, uh, Point regarding how much geopolitics we cover kind of finding back a little bit saying that it's good to address and we will keep bringing that up just maybe a little bit less. Uh, and, uh, also someone mentioned that, uh, we may take away from last week was and you're just going to come

Jeremie

back.

Andrey

Yes. Yes. Uh, our little beard. Opening joke. It was

Jeremie

a weird interrupt. Yeah. You know, when like you start talking, I don't know if you guys have this experience sometimes you start talking and then you just, you don't know how the sentence is going to end kind of a la Michael Scott and you just keep going and you're like, Oh no, this is really getting me in trouble. And you just keep going and you can like, I'm doing that right now. So I did that yesterday or very last episode. Um, I forget what I said.

I said I was on the side of mothers, specifically mothers of like three or four children or something, because we had a weird comment about that. So anyway, I remain pro mother, uh, for what it's worth.

Andrey

Yep. Still the case. Well, that's, uh, enough for comments. As always, we do look at them, even if you don't mention them. So thank you for everyone who does review or comment. Now on to the news, starting with the Tools and apps section as usual. And the first one is one of the big stories of this past week. It is GP4O Mini, the latest release from OpenAI. So this is, uh, as the name implies, a smaller and cheaper version.

Of GPT 4. 0, and this would be replacing their kind of, uh, lowest tier of models. So they have GPT 3. 5 turbo, which was the smallest one they had now that it will essentially be replaced by this GPT 4. 0 mini model, which is going to be praised, priced at 15 cents per million input tokens. Yeah, it's 60 percent cheaper than GPT 3. 5 turbo. So it's significantly cheaper than even, uh, Claude Haiku, the smallest model from Entropic, I believe is 25 cents per million input tokens.

So this is really showcasing the dynamics of the race to the bottom on pricing going on here. Uh, and yeah, I'm sure. It seems to be pretty comparable to GPT 4. 0 and better than tuberose, also showcasing this thing we've seen over and over this year that the leading labs are finding a way to minimize the risk. Their models make 'em smaller while still retaining their performance to an impressive degree. So yeah, really, uh, significant announcements here.

Jeremie

Yeah, and there's a whole lot of stuff obviously going on in the background that we don't see here. Right. Open AI has not come out with a deep technical report. They haven't done that in a long time with these models. Neither have have the other labs, but, um, there's a lot of interesting stuff to note here. First off, the economic case, right? Priced at 0. 15 per million input tokens, 0. 60 per million output tokens.

So if you look at how cheap that is an order of magnitude, more affordable, as they say, uh, than previous frontier models and more than 60 percent cheaper than GPT 3. 5 turbo. So one interesting thing that comes to mind when you look at numbers like that, right, there's no way. That they're making money off, you know, simple kind of interactions, chat, GPT style with this model, right?

There's once you get into that zone where you're talking about 60 cents per million output tokens, you're basically guaranteed that the bulk of your use cases involves sort of inference time reasoning, you know, in some sense, like where you're not just doing one query. To get one response, you're querying the model multiple times to kind of refine and iterate on its answer, right? You're in that kind of gray zone where you're maybe on the way more and more towards agents.

This is something that we've talked about a lot in the podcast as an inevitability as the price of inference gets lower and lower and lower. And that's exactly what we're seeing here. It's part of a sort of paragraph that you can think of as a mini manifesto that opening I put to the bottom of this announcement where they're talking about like, Hey, yeah, like essentially we're heading in Sam Altman's words towards the world. With intelligence, that's too cheap to meter. That is the goal here.

And that means that all the interesting stuff, all the ROI, if you're going to be a company like open AI, eventually starts to look a lot more like inference time reasoning than just, you know, zero shot responses to. Uh, to customer queries and things like that. So I think that's a really interesting marker that we're very much on the way there. You highlighted as well.

The impressive capabilities given how small, how cheap this model is 82 percent on the MMLU outperforming GPT four on chat preferences on the LM sys leaderboard. So if you go to the LM sys leaderboard, like Check the head to heads of which models do people prefer that there's an ELO sort of rating system they use there where they basically compare, you know, uh, anyway, users opinions about which models, uh, perform best on a chat basis.

Um, and, and this one seems to be very competitive with absolute frontier models. So one of the things this tells us to your point, Andrea, yeah, we're seeing smaller models. Kind of move a lot, like move mountains, like do the things that much bigger models would in the past doesn't mean scaling is broken quite the opposite. What this means, you can think of it as, you know, opening eyes release cadence.

In fact, all frontier labs work this way is they'll aggregate a crap ton of compute, use it for a giant training run, produce something like GPD four. And then for the next year or so, they're mining that model for all kinds of different use cases, including doing things like knowledge distillation. So taking that much bigger model, finding more and more clever ways to distill all that knowledge into smaller, more efficient, compact models.

That allows them to learn stuff they can then apply to the next training run, to do an even more efficient training run. Use of even more compute. And this is why we're really seeing exponential increases in capabilities because you've got more compute compounded with all those lessons learned from compacting and making more efficient, these models.

So there's a whole bunch of great information about performance in this thing, but, um, uh, yeah, I think this is going to be a really, really interesting shift in the economics around, uh, inference time compute.

Andrey

That's right. And, um, yeah, as you might expect also, it's faster. They say it's, uh, about twice as fast as GPT 3. 5 turbo and GPT 4. 0. And, uh, just to give some context or perspective, yeah, you say like 60 cents per million output tokens. So a million output tokens is about 750, 000 words. And, uh, paying a ChiaGPT subscriber is paying 20 per month.

So, for that 20, uh, to, I guess, be Profitable or to make full use of that cost at that, uh, inference output, you would need to output like 25 million words. Um, and that's if you're paying, uh, you can access GP40 for free, but it's just an interesting thing to think about how, how far 20 goes. Gets you with these types of prices.

Jeremie

I think it also just completely makes the case that at least for, um, uh, at least for your casual user, right, this is not something that's just being used to, I don't know, like give Q and a, you know, queries. This is not the interaction type that this is being built towards it. You know, there is clearly gotta be some way that a crazy number of, of tokens, a large amount of compute is being dedicated to every query that's being asked.

There's just no other way the, you know, the economics here end up working out. So. It's, it's really interesting.

Andrey

Yeah. And, uh, also worth mentioning, OpenAI in addition to GPT 4 Mini, it did also announced new tools for enterprise customers, uh, dealing with, uh, things like regulated industry, uh, that require things like compliance. So these are timestamped interactions, uh, ability to audit, uh, in central. Chad, GPT, enterprise data.

So that's one of those steps where if you want to make real money, you go to enterprise and enterprise, you need these much more fine grained things to deal with regulation and compliance. Uh, so presumably, you know, that's where OpenAI really wants to make their money. Not so much Chad, GPT plus, but, uh, these kinds of things.

Jeremie

Yeah, they do. By the way, just on the safety side, they do have a note at the end that about 70 external experts in a whole bunch of different fields, including they say, like a social psychology and misinformation testing GPT 4. 0 to look at potential risks. And they've got a preparedness scorecard that I guess is going to be consistent with their preparedness framework that they announced earlier this year that they'll be putting out later. That's not out yet.

So if you're interested in the security or safety side, I guess that's forthcoming. Maybe keep an eye out for that.

Andrey

Next up, meet Hyper 1. 5, the new AI generation model, challenging Sora and Runway. So there you go, it's a new text to video model from the London based Hyper startup. Uh, this was founded by former Google DeepMind researchers. And this is an upgrade to their previous model. So now you can go from, uh, just four second generations to eight second clips here. And it also is fully HD before it was a bit lower resolution. So yeah, yet another player in this text to video space.

I can't say that I was aware of Hyper, but it just goes to show how busy this is getting.

Jeremie

Yeah, well, and apparently, like, just like you, this is the first time hearing them and, um, apparently they have, they claim onboarded over 1. 5 million users on their platform, which is remarkable.

I mean, people often talk about generative AI use cases, especially in, uh, You know, you can think of it as more like speculative, speculative aspects of generative AI, especially those multimodal domains that don't quite yet have, like, you know, the same level of proven market value is as chat GPT. Some argue may. Um, so now, yeah, I mean, 1. 5 million users.

I'd love to understand, you know, there is, uh, obviously a whole bunch of famous examples in Silicon Valley of companies that boom bust really hard, you know, going back way back in the day to like social camera, whatever these companies that just are. Yeah. Go vertical, but they just don't have enough value to keep users locked in. So I wonder if this is just sort of a wait list explosion situation and, you know, maybe the, the value is not there, but, but it could well be.

And the eight second video, these are short videos, right? Four second before, presumably they built their 1. 5 million user base off the four second version, because they're only watching the eight, eight second version now, um, you know, I guess the only things that come to mind are sort of stock video. You know, you're, you're looking at things like pexels and, and so on that, that you might be competing with. Maybe that's the play. Uh, but, uh, yeah, it, it's super impressive.

And there's a video obviously that they, they do show on the, um, sort of venture beat article here that they have embedded and it's, it's pretty good. Uh, it's a, uh, a lady with red hair and you know, it's moving in the breeze as it should, and there are clouds in the background, I guess that's a thing. So, uh, yeah, really impressive.

Andrey

Yeah, and, uh, I guess as we have seen with other entrances in space, like Luma still not at Sora quality sort of mind blowing from what I can see, it's, you know, still more obviously AI interesting that this company, uh, came out of stealth just for months ago and, uh, beginning of March, uh, with 13. 8 million in seed funding. So decent amount of money, but not. You know, gigantic amount of money for training video generation models.

So we'll be interested to see if, if this remains to be one of those, uh, companies vying for dominance in the text to video space, uh, which are now what, three, four, it's a lot.

Jeremie

And they're starting, the argument for them is starting to sound very similar as well, right? Like, I, I find every time I read one of these text to video, um, launch announcements, you just, you have to look for the AGI section where they're going to argue as they do here that, uh, you know, this is a physics engine.

Essentially they're, they're building a world model as, as you know, Sora first very publicly, I guess, made the case that there was a link between video generation and world model development, which some have argued already is a thing. With text data even, um, but yeah, so, so the case is being made here. You know, this is an interesting, um, uh, an interesting step towards world understanding and creating something like AGI that could replicate the emotional and physical elements of reality.

The article says, um, you know, they do highlight a couple of bumps in the, uh, in the product to date. It's by the way, 24 bucks a month build yearly. So that's actually quite pricey. Um, but, uh, But yeah, it's, uh, apparently, you know, some, some issues in terms of blurriness, uh, of the videos or overuse of the subject, uh, and, and object detail.

So I guess a lot of finickiness with respect to the prompt high prompt sensitivity, um, which is not always a great thing, but, uh, but still, yeah, another, another player, we'll see how they can compete. I mean, it's, again, it kind of makes me wonder, like, this stuff is at risk of getting commoditized. Where's the value actually going to accrue in this space and what companies are actually going to survive the test of time. It's, it's not at all clear, but. But here we are. Yeah.

Andrey

I, I just remember just last week we covered Odyssey, another entrance space of text to video. They were making the case of being like Hollywood grade and having, uh, these additional controls over lightning and stuff. So, uh, yeah, I think we'll see some evolution of the business case happening pretty soon here. Cause, uh, a lot of competitors and not a lot of revenue being made yet. I'm pretty sure. Out of Lighting Round, first up we have Enfropic releases cloud app for Android.

And there you go, this is just your ability to chat to, uh, Enfropic's chatGPT type, uh, Chatbot Claude, it will allow you to have free access to Claude 3. 5 Sonnet, and also some additional stuff. If you are subscribed to their pro or team subscriptions, that's a pretty much the story there. And for Aplik continuing to sort of try and commercialize, release, and get more users. I think in a bit to catch up to OpenAI.

Jeremie

Yeah, and apparently one of the key features, maybe unsurprisingly here, is the ability to sync conversations with Cloud across devices, um, and also upload photos or videos, things like that for real time image analysis. I think the, uh, the cross platform piece is interesting, especially as we start to get in this world where You know, we're, we're looking at more and more agent like models where you want some kind of consistency, some kind of continuity of your interactions.

You know, maybe you move from one room to another, uh, you know, you want to have that, uh, that sense that something is following you. Maybe you don't want it to the sense that something is following you. Cause that sounds a little, it's like. Terrifying. But anyway, um, that's, you know, to have this more natural user experience that, that may be an important piece. So kind of cool.

And apparently the, the iOS app, you know, what is first launched did see a pretty, as they put it, tepid reception. Um, that was two months ago, I think back in May, and they had about 150, 000 total global downloads. Um, that compares pretty poorly to chat GPT initial launch, which had half a million during the first five days. And so, um, Yeah, I think, uh, you know, Anthropic's trying to make up for, make up some ground here.

They have this narrow window, right, where Cloud 3. 5 Sonnet is the best performing model on the market for now. That may change, it may change soon, who knows. And so, you know, while you're ahead, uh, keep pressing your advantage. I guess it's a great time for them to be launching Cloud and the, uh, Android app.

Andrey

Yeah. I just Googled it out of curiosity. Chad GPT on iOS is the number one app in productivity. It has 1 million ratings. Claude is number 38 in productivity has 4. 8 thousand ratings. So yeah, fair to say they're a little bit behind, but also they're moving very fast with these kinds of releases for sure.

Jeremie

Yeah, they are.

Andrey

Next up, Google Vids is available to test out Gemini AI created video presentations. So this is a new productivity app in the Workspace Labs section of Google, and it will allow you to create presentation videos by dropping docs, slides, voiceovers, and video recordings into a timeline. So it, uh, Doesn't generate a video per se. It generates a presentation based on user instructions to Gemini. And then, uh, yeah, it's currently in preview.

So you would have to have kind of a worst workspace subscriptions, but presumably in the future, this will be more widely rolled out.

Jeremie

Yeah. Um, apparently there are a bunch of key features they're flagging here, where you can get Gemini to automatically insert things like stock footage for you. Um, make your script. Uh, for the video and, and even give it as they put an AI voiceover, which is kind of interesting. Uh, as somebody who produces quite a few videos, it's, uh, you know, you're starting to look at, starting to look at the creep into, uh, into that space and being able to, to create these things.

Um, it's a more piecemeal approach. It must be said than, uh, just like straight up Sora or some of those other products that we've talked about, right? This is like, you know, one kind of subset at a time of these, uh, of the video production process that they're, they're automating in this case.

Andrey

And one more story, YouTube music, sound search rolling out and AI conversational radio is in testing. So YouTube music, if you don't know, is the Spotify type app from Google. And they have now launched the sound search feature, uh, letting people search by recording a little clip of a song that's playing, uh, something that has already, uh, been in, you know, other platforms, but not in this one.

And apparently they're testing an AI generated conversational radio for us premium subscribers that create custom radio stations, but describing what they want to hear. And, uh, sounds a little bit like Spotify's AI DJ.

Jeremie

Oh, interesting. Yeah, that's true. I'm just, I'm just stuck on this AI generated conversational radio. You know, what, what? So,

Andrey

uh, Yeah, just for some context in YouTube music, you can do a thing called start radio and radio is just like create a playlist based on a song or artists or whatever else. So it seems like here you were able to just enter a prompt. It will ask you ask for music any way you like. And just ask for a vibe, ask for whatever, and that'll be interpreted into a playlist for you.

Jeremie

So in this case, but they're saying AI generated conversational radio. So, so this suggests the same thing, but basically you're going to listen to a, like a podcast or a show or something.

Andrey

I think it's a conversation in the sense that you give it a conversational input. The output is just a playlist. I know.

Jeremie

I basically managed to misinterpret like this entire, because this is the very, it was the very first sentence of the article. So this directly kind of fixed my interpretation of the whole thing. I was so confused. Uh, okay. That makes a lot more sense. Yeah. Cause AI generated conversational radio. Now I'm wondering, is that something people would want? Cause it sounds like it actually could be, uh, anyway. Well, and

Andrey

to be fair, uh, the Spotify AI DJ is that. In some sense, there's a DJ with an AI generated voice that like talks to you in between songs and whatnot. This is not quite that, this is just sort of asking a chat bot to describe your playlist and it will generate a playlist for you. Uh, but yeah, an example of seeing more AI integrated into a music streaming service, similar to what Spotify has started doing.

Jeremie

Well, yeah, and becoming a becoming the service in a certain sense, right? Like we're transcending the search paradigm, really, it's like, how, how can we do content generation on the fly? And, you know, back to our earlier conversation about inference time and inference costs, right? This is the sort of thing you can start to do when it costs like, you know, 60 cents to generate a million tokens of output, or the equivalent for, you know, for music, or video. So yeah, really cool.

And we'll see, maybe we end up with. Uh, sort of not, not live streaming, but, uh, you know, live generation YouTube at some point

Andrey

and onto applications and business. First up, we have a story about open AI and how they're working on the new recent technology under the code name strawberry. So this is, uh, Kind of a pretty well under wraps over some reporting of this coming out of Reuters. Apparently they talked to someone familiar with the matter and looked at internal documentation. Uh, don't know too much about it, but it does seem like essentially this is an internal project to lead to more intelligent systems.

We had a similar story about Project Q, I think last year. Where internally they are working on training AI to reason at a higher level. This is pretty much along those lines of not knowing exactly what they are doing, but they are working on something, presumably novel, to deliver advanced reasoning capabilities.

Jeremie

Yeah, this is actually so, um, according to some people, this is actually the project that was formerly known as Q star. So this seems like it may be a different internal code name for that project. Um, and we covered that in, in some detail before it's sort of worth exhuming. That, uh, that particular thing, uh, just for the purpose of this conversation, because it will tell us something about where things are going at OpenAI.

The goal here, by the way, is to make a system that achieves what OpenAI internally refers to as deep research. So this is the act of not just doing, well, again, it's kind of going more in that direction of. Uh, sort of deep reasoning and being able to do that system to thinking so that you can solve complex problems in a way maybe more akin to what a human can do, uh, rather than just pattern matching, you know, the sort of not that they are stochastic parrots, but the sort of

language model paradigm, um, apparently opening has been privately signaling to developers and other folks in the, in the kind of universe they're trying to sell to that they are on the cusp of On the cusp of releasing tech with a lot more advanced reasoning capabilities. And so it seems, and this is coming from four people, apparently. So they've really triangulated this who've all heard the company's pitches.

So it seems like there may be a fairly imminent launch of a project, which may be this so called a strawberry project, or maybe you can call it Q star. Uh, whatever, by whatever name. But, um, yeah, so, so there's a couple of things that are going on here. We do have a little bit of information on what, what this might be.

Um, it, uh, so there's this method we, we talked about this earlier, but self taught reasoner, this was the star part of Q star, uh, where basically you're allowing AI models to kind of bootstrap themselves into higher levels of intelligence by iteratively creating their own training data and sort of like having a reviewer model. evaluate that, uh, you know, that, that data and kind of closing that loop.

And so, uh, this was some work that Ilya Sutsky were actually, uh, pioneered and led at OpenAI. This was back when they merged their MathGen and CodeGen teams, uh, to work on this project that, uh, was called Let's Verify This Step by Step. That was at least a paper that came out of, it seems, this team.

And, um, yeah, it was basically based on this idea of verifying intermediate reasoning stages in A calculation and using a verifier to kind of make sure that, you know, those intermediate stages are correct. They seem to have been grounding this in mathematical problem solving. Why math? Well, because it's verifiable, you actually have a ground truth answer that is known to be correct. And so a verifier can prove essentially that.

a particular reasoning step in a reasoning chain is correct, so that gives you a good training signal. Um, once you train a model to optimize in that kind of mathematical reasoning paradigm, you've sort of grounded some important aspect of its reasoning capabilities. And so now, you know, if you have that, that model, you apply it in a kind of broader setting like a more general language model setting. you may not suffer from the same sort of logical.

Traps that, that sort of vanilla language models tend to fall into. And so this is really important because you have a foundation here for a sort of self play dynamic where you have, you know, models training themselves, generating their own training data with good grounding in a way that's not data bounded. You know, there's a lot of talk about the data wall that training is hitting. This may be one way around that in part. So there's a lot going on here.

Apparently opening eyes tested internally. A model that may be this Q star or strawberry model that hit over 90 percent on the math data set. Um, that is a big deal. The only other model I'm aware of that's hit anywhere close to that is Gemini that was fine tuned, fine tuned on the math data set. Here we have a general purpose model that's doing that as a side effect of its wider set of capabilities.

Though unclear how that maps on to, you know, the specifics of its training because we don't know anything about that. So. There's a lot of really interesting stuff going on. Apparently the big goal here is long horizon tasks. These tasks that require again, deep research system to thinking that's where open AI is heading. Uh, it's consistent with their overall strategy, right?

If essentially bootstrapping AI research by getting AI to automate AI research and hopefully safety research as well, though, a lot of the whistleblowers I've spoken to aren't too optimistic that open AI will, uh, we'll do the right thing on that, but, uh, certainly the capability side seems to be moving forward pretty fast at open AI.

Andrey

Right. And it relates pretty closely to a story we covered, I think, last week about the ranking system OpenAI unveiled internally at an all hands and also was covered with these five levels, uh, five stages of our artificial intelligence level one is where We are now chatbots, uh, AI, conversational language, and then level two would be reasoners and human level problem solving. Presumably that's what they're trying to get to and probably are nearing with strawberry.

Then level three would be agents. Level four would be, uh, innovators, AI vet. can aid in invention. So that's the long term goal of having AI that can do the research to improve AI. Uh, so makes a lot of sense that they choose this kind of ranking system. And then this project is trying to get them to level two from level one.

Jeremie

Yeah, it worth noting as well. The only reason we know about this is opening. I sort of announced this internally at an all hands meeting. Apparently they showed a demo of the project that they claimed have new human like reasoning skills. That was according to a Bloomberg report.

Um, you know, a lot of people I talked to Um, at, at open AI in the, in the frontier labs generally, but in particular open AI, um, do see this as something that may change over time, you know, that you're not necessarily going to see the same, even internal openness. They expect silos to, to pop up at open AI because they're developing a pretty secretive culture.

And so, um, you know, the sense is that, you know, you might end up with loyalists to, you know, Sam and Greg or whatever, being set up in silos that are overseen by only a small group of very loyal people because they have had, you know, Uh, quite a few leaks. So just to kind of plant this flag here that we may not see the same level of openness through leaks through other means going forward.

Um, so, you know, we're, I don't know if we will continue in the next few months, but anyway, just to kind of, kind of flag that, that idea that, yeah, this is an unusually high level of visibility, even though it is incredibly murky, uh, into the internal workings of open AI at this stage.

Andrey

Next up we have inside Elon Musk's mad dash to build a giant XAI supercomputer in Memphis. I believe we mentioned last week that we got the news that Elon Musk and XAI are planning to build their own supercomputer for something like 10 billion dollars instead of working with Oracle on one. And this article delves into some of the interesting and I don't know, slightly spicy details of how that's coming together. So the circuit computer is set to be built in Memphis.

And apparently that deal came together very quickly. So there was negotiations happening in March and, uh, the kind of city, uh, Officials who were negotiating with XAI really tried to move fast to do the green light of this. And this article goes into how that might've been a bit too fast, how there's a bit of pushback. For instance, the Memphis City Council is commenting on how we might want to Pumper breaks on that how it might be going too fast.

Council member Pearl Walker told Forbes that there has been quote hysteria among her constituents, people being afraid due to what might happen with their water and the energy supply, which, to be fair, is a very reasonable consideration given this will presumably use a ton of power. And, uh, the construction on this has already begun on what Musk is calling the gigafactory of compute.

So yeah, lots of details here going on as to Kind of a local government level, uh, reacting to this deal coming together.

Jeremie

Yeah. And it's, it's another classic example of the sort of like red tape paradox, right? That a lot of companies face as they want to build out capability internal to the U. S. We've talked about this in the context of national security, right? Like we've got, you know, open AI, Sam Altman turning to the UAE.

Looking to base data centers there because there's no electric infrastructure, no power infrastructure that can handle the kinds of buildouts, data center buildouts that they, they need for this, you know, projects like the Stargate cluster and so on and so forth. And so, you know, this is kind of the flip side of that. This is what happens when you try to do this domestically.

Um, there's, you know, pushback on in many cases, you know, sound, sound grounds where you have the small community, they don't have a lot of power, uh, to, to spare. And here comes. Really, I mean, there are a couple stats that are worth flagging here. So XAI is going to initially need up to 50 megawatts of electricity. So for context, one megawatt is about, so it's a thousand kilowatts and that's roughly a thousand homes, right? A home is roughly one kilowatt. Per year.

And so, um, sorry, rather one kilowatt. Um, so XAI has requested an eventual capacity of up to 150 megawatts. We can also from this, by the way, the article doesn't do this, but you, you could do some back of envelope math to figure out roughly how many GPUs we're talking about here. So roughly speaking, once you factor in things like cooling and, and other infrastructure that you need to run data centers, you can roughly think of one GPU as being about 600.

Uh, the 600 kilowatts in terms of energy, uh, energy spend. So, so roughly order of magnitude, actually one GPU is approximately the same as one house. Uh, that's, that's roughly how it goes. So if you look at the 50 megawatt version of this, we're talking about 30, 000 H100 GPUs roughly. And the 150 megawatts that they might go up to, we're talking 100, 000 H100 GPUs. That's actually, yeah, that's quite a lot. Uh, you know, the equivalent of.

You know, order of magnitude 100, 000 homes in a relatively small community, and that's part of the reason why you're seeing this sort of pushback also on the basis of the just sheer water demands that it takes to cool these systems like it is. It is significant.

I think if I remember from the article is something like 1 percent or so of the water, one or 2 percent of the water that they have going through their, um, uh, Oh, I forget what the If it was like a body of water or something they're using right now to pull their water from. So, yeah, I mean, I think it's one of those things where you got to kind of make a call. Obviously, massive economic benefits for this community.

You're going to have a lot of quality jobs coming in, data center builds, a lot of maintenance and upkeep. But this, you know, does raise the question of what happens to these communities. Do they have the right to have, you know, full engagement across the board? And certainly there have been, you know, There have been folks within this ecosystem who say, Hey, you know, I wasn't consulted.

This was all done in a fairly secretive way, um, as a, as a thing between the Memphis city council and, uh, XAI and, and, and also, uh, under NDA as well. So nondisclosure agreements that would have presumably prevented folks from, from engaging maybe the wider, uh, the wider city, the wider community, uh, to figure out the ins and outs of this deal. So, yep. Uh, this is, uh, this is an interesting, interesting kind of challenge and paradox.

And, uh, the other thing that's by the way, shaping this is there is some skepticism from sort of the history of what that some people are arguing. Uh, Ewan has of like over promising public infrastructure to the places that host his companies. So he's previously said things as he has here. He said, look, they're going to be a whole bunch of quality jobs. There's going to be infrastructure that will build to help this community, uh, kind of grow and build out.

Um, but some folks are pointing to back in 2018 when the boring company said they'd build a mass transit hyperloop beneath Las Vegas. And that just didn't. Turnout to happen, or at least it's been very slow going. There's been a bunch of safety violations. People have, have called out and sort of similar things happening. Um, uh, anyway, with, with some of the other companies that he's been involved in, so, you know, who knows?

Um, but certainly this is going to be a big old build and this will come to define this community going forward. If, if the project proceeds.

Andrey

And to those points regarding infrastructure, apparently XAI has made verbal pledges to improve public infrastructure. They want to build a new power substation and a grey water processing facility. And they want to do it themselves, not to get a, uh, whoever is in charge of that, I guess. because they could just do it faster. Musk has said that this, uh, place would be up and running in August. So clearly, you know, super trying to build out a very, very significant data center very quickly.

So yeah, an interesting article kind of highlighting the, uh, Practical realities of doing something like that on to the lightning round. First up, we have Apple and video and on fropik reportedly used YouTube transcripts without permission to train AI models. This is, uh, talking about how apparently these companies use transcripts from over 173, 000 YouTube videos without permission.

Uh, this is according to an investigation by Proof News, and this is of course coming after we've seen that already happen with OpenAI. Previous, uh, questions have arisen as to where they used transcripts. To train and YouTube CEO, Neil Mohan has already stated, but using YouTube's data to train AI models would violate the platforms terms of service. So yeah, a reminder, I guess, uh, not that we need a reminder, but all of this AI evolution, revolution, all these models are kind of happening.

With just everyone grabbing all the data they can, usually you're out of regard for things like, are you allowed to or not? And I guess we'll see how it shakes out.

Jeremie

Yeah, for sure. It's the cost of innovation in some way, but it's also, you know, should it be, is the question everybody's trying to answer. And in this case, I think it's worth mentioning. Um, so of course it is 173, 000 YouTube videos. It's not the view, the videos themselves. It is, of course, as you said, the transcripts, right? So we're not. You know, we should kind of, uh, double click on that. It is from an Aloyther AI dataset.

Aloyther AI being, uh, this company that's, they, I think they refer to themselves as something like an open source grassroots collective of researchers. So they're, they're sort of, they've got that vibe anyway. They're very big into open source.

Uh, actually the first company I'm aware of that replicated GPT 2, uh, Yeah, it was actually their, their, uh, co founder, Conor Leahy, who now is over at conjecture sort of, we've got famous by replicating a GPT two back in the day before open AI actually released it. Um, he went out and found a lawyer through AI to be kind of this open source play for AI safety.

But anyway, um, so it's, it's, yeah, 48, 000 channels that they pulled this data from, and it has been used apparently by all these companies, you know, Apple, NVIDIA, you know, So kind of, you know, kind of an interesting, uh, issue here. You have this great AI, of course, famous for making the data set known as the pile, uh, which has been used in a lot of other projects to they've since, you know, made other data sets, open source data sets that have been a lot more.

Kind of cautiously developed through the lens of copyright. We covered that in previous episodes. Um, but, but apparently not this one. So this is kind of an interesting challenge worth mentioning too. Uh, they, they, they do this in the article is that earlier this year, opening eyes, chief technology officer near Maradi evaded questions from the wall street journal about whether they used YouTube videos to train Sora. Of course, we covered that too.

Um, Um, but it just goes to show you, I mean, YouTube is such a juicy, tempting target of all this high quality video data. It's, it's hard to stay away. And I guess that's part of the issue here.

Andrey

Right. And, uh, yeah, the, uh, founder Sid Black actually wrote on GitHub that. He used a script and I just downloaded the subtitles of these videos. So this is basically the subtitles you would get if you were watching video online, they did use that. There was some speculation that, uh, OpenAI actually generated transcripts with Whisper at one point, but in any case, uh, yeah, everyone's training on YouTube data, it seems.

And next up, after Tesla and OpenAI, Andrej Kapafi's startup aims to apply AI assistance to education. So Andrej Kapafi, pretty well known figure in the AI world. He was former head of AI at Tesla, a researcher at OpenAI, also an educator at Stanford who created one of the first courses on deep learning for computer vision, has announced on Twitter that he is launching Eureka Labs, which is an education platform built with AI at its core. Not much else really known yet.

This was just an initial announcement and it sounds like their first product will be an AI course, LLM101A. And an undergraduate level class that will help students train their own AI. So this is, uh, yeah, very much in keeping with what Karpathy has been doing for a while now, he's published many blog posts. He has been publishing YouTube videos, uh, educating people about AI.

It seems like, you know, he left OpenAI where he would have to manage and create like reports to do this kind of thing that he is probably more passionate about.

Jeremie

Yeah, it's also really interesting, you know, so he's, yeah, famous for setting up all these sort of technical, uh, deep dives and, and explanations, uh, that are really high quality. Like if you're, you know, if you're into, if you actually want to build these systems yourself, they are just a great way to, uh, to get started. Um, but yeah, it, it sort of makes me wonder what is his vision Of AI timelines, right? Because this is not the sort of thing that you do if you think AGI is imminent.

Presumably it's not at least, um, it's, it's at least to me strikes, it strikes me as something very specific rather than, you know, if you think AGI is imminent, if you think OpenAI or DeepMind or, or Anthropic or somebody else is about to do it, um, in the next couple of years, at least, you know, maybe you, you focus on, on those efforts or something like that. This seems like an interesting product play. Um, so yeah, I haven't, I'm not aware of it.

of his, like, recent statements on AGI timelines. Uh, so this, this is really interesting and I think, uh, a great update because he is obviously extremely knowledgeable, right? He's worked at Tesla, he's worked at OpenAI, helped co found OpenAI, um, so he, he would, uh, have a lot of interesting insights and, and knowledge about, uh, well, that might inform the timeline piece. Anyway, that's my two cents. I think it's, uh, really interesting and I'm, I'm excited to try this out.

Andrey

Next story, Menlo Ventures and Anthropic team up on a hundred million AI fund. So Menlo Ventures is a Silicon Valley venture capital firm, and apparently they're partnering with Anthropic, in whom Menlo Ventures has invested, to launch this hundred million initiative called the Anthology Fund to invest in early stage AI companies. Uh, and uh, Menlo Ventures is. Is, uh, as of recently, the top backer of on fropik closing over 750 million funding round in the company.

So, uh, yeah, this would be a fund that will provide initial investments starting at a hundred thousand. And 25 combined with 25, 000 worth of credits to use on froplix models for startups. So I guess one way to get people to use your models is to just give them money to do it.

Jeremie

Yeah. And it also seems like a very reasonable kind of rational collaboration here, right? If you're, if you're Menlo ventures, you're going to stand to benefit a lot from having people who can really help. Pick your winners and losers in the startup ecosystem. And Anthropic certainly is going to have that, that skill on hand.

Um, it also sort of mirrors, it's not quite the same as, you know, opening eyes, venture arm, which they do have a sort of investment arm of, of the, uh, the larger company. Um, but this is a partnership and it seems that Anthropic is a smaller team is achieving through partnerships with opening eye historically is sort of done internally.

Um, I think it's actually going to be good for both because it gives, It gives a Menlo access to really great capital allocation, a talent, the people who can really help pick winners and losers. And then it gives Anthropic a great visibility into the early stage startup ecosystem. As you said, it's also a great way for them to drive the use of their products preferentially. Um, and yeah, they're, they're, they're big into Anthropic 750 million.

They, uh, contributed to, uh, Anthropic's latest fundraising round. So that's a, a Pretty, uh, pretty, uh, decent chunk of change, at least for, for right now. Just, you know, it's not a, a defining amount of money, but it helps keep the lights on Menlo is known for a lot of impressive investments too. This is no slouch firm, uh, Gilead sciences, Uber, uh, Roku credit Sesame. So those are all, um, those are all in the portfolio. So yeah, interesting. We'll see if this partnership deepens too.

And we start to see the check sizes grow if it's successful going forward.

Andrey

Yeah, interestingly, they're accepting applications from startups through an online form and Menlo will use machine learning tools to score and rank applications. And, uh, the diligent process for the companies is expected to be more quote, light weight than a typical investment. Presumably also. Lower in value. So if you're in a Bay area and want to do something with machine learning, there you go. And onto projects and open source.

First, we begin as we often do with Mistral and Mistral has now released. code straw Mamba for faster and longer code generation. This is part of two models. We'll cover the next one later. But first we have this code straw Mamba seven be a model that can handle up to 256, 000 token inputs. That's more of a GP four. Oh, and And, uh, is outpacing rival open source models on benchmarks. They also released, uh, MathStrawl 7b, which is designed for math related reasoning and scientific discovery.

And these are coming out after the Apache 2. 0 open source. License. So, yeah, more open sourcing from Mistral.

Jeremie

Yeah, it's, um, you know, one of the big things. So, so this is, as you said, a Mamba based architecture. They call out, this is especially relevant for code generation where you really want those large context windows, right? Like this is where it's helpful, disproportionately helpful to fit that entire code base, for example, into your model's RAM so it can reason intelligently about what the next function is you need and so on. So, you know, this is.

A clear, sort of, particularly interesting use case for Mamba. Mamba, of course, being known as an architecture that does generalize well to larger context windows. That's kind of the comparative advantage, and there's a lot of detail there in asterisks, but high level, that's the picture. So interesting to see them pushing that direction. I will say, The subtitle of their announcement page, uh, confuses the hell out of me.

They, they write, um, uh, I like it by the way, it just, it still confuses me. They write, as a tribute to Cleopatra, they're, they're, I guess, like, kind of dedicating this, this model release. As a tribute to Cleopatra, whose glorious destiny ended in tragic snake circumstances. We are proud to release CodeStraw Mamba, a Mamba 2 language model specialized in code generation. So I don't know, like, is, is like Mamba, is that like the name of a snake or something? Am I, am I missing something?

Uh, yes, I've seen, I've seen snake logos associated with Mamba. Mambas

Andrey

are, is a type of fast moving, highly venomous snakes.

Jeremie

That's so embarrassing. I like literally never, Oh my God. I'm in Google image. This is terrible. This is terrible. Okay. Well, then I get it. It's funny. I appreciate it. Thank you. Mr. All for lightening our days with this. Uh, yeah. Mr. All,

Andrey

uh, is, is a bit more on the fun side. They are. is, uh, they, they have their, uh, platform called left platform. Yes. And, uh, they do have some benchmarks. So, uh, this model at seven B, uh, 7 billion parameters isn't quite good as coach trial 22 B, but you know, pretty close numerically. And given that it's one third of the size, it does seem to be pretty promising. And it also.

Outperforms, uh, CodeLlama34b, CodeGemma7b, uh, a bunch of these other open source ones, so certainly a cool release from them as well. Is also master all a lot of seven B models as usual being released, I guess.

Jeremie

Yeah, all part of that trend that we were talking about, right? Where people are sort of taking a step back from scaling and, and honing in on those architectures before, before, probably in this case, before the, um, H one hundreds either arrive and are hooked up or before the B one hundreds and B two hundreds start shipping, right? So we're in that, that, uh, awkward, uh, Um, middle ground before, well, actually, as Mistral might say, Befaisant les douches.

As the French Canadians like to say, it means that, uh, butt between two seats. If you're, uh, if you wanted that picture in your head. Uh, but anyway, we're sort of like in between these releases, uh, in, in many cases of new hardware. And so you're kind of trying to figure out what to do with the existing hardware in the meantime. So. Cool, cool.

Andrey

And speaking of Mistral, we have another model. Well, this one is not open source, uh, I think, but worth covering. Mistral AI and NVIDIA unveil Mistral NEMO 12B, a cutting edge enterprise AI model. So this is a collaboration between Mistral AI and NVIDIA, and they have released this Mistral NEMO 12B. for use with NVIDIA, uh, cloud tooling and so on. So evidently this was, uh, optimized to run on NVIDIA hard, hardware, uh, trained on NVIDIA DGX cloud AI platforms using NVIDIA stuff.

Anyway, I'm just reading from the, uh, NVIDIA press release and there's a whole lot of details about how NVIDIA was used for this. Um, and it is being released under the Apache 2. 0 license and being released with a floating point age. Floppy FP8 precision, reducing memory requirements. So it's pretty portable as with other open source models. Uh, so an interesting collaboration here from NVIDIA and Mistral.

Jeremie

Yeah, yeah. And, you know, another, another Apache 2 model with 128 token context window. And I guess this one's FP8, but like it, you know, it's interesting to see just how, uh, comparable. A lot of these models do end up being, um, there are obviously narrow differences here and there, and the model size matters, but here they're compensating for the slightly larger, you know, it's not a 7 billion parameter model, it's 12 billion parameter model.

So we're going to quantize it down to FP8, and that'll probably make the memory footprint, uh, fairly similar to, uh, what, what you might see with a 7 billion parameter model. But anyway, um, all kind of, uh, uh, Yeah. Part of the commoditization of this space. Like, I'm really curious if this strategy of, of Mistrals and all these other companies ends up actually being beneficial, like, are they just like lighting VC money on fire right now? Or does this play end up working out?

Like, are we going to see them go the way of stability AI with all the struggles they've had? I just, I don't know. I'm still sort of scratching my head at this one, but, uh, They, uh, they keep plugging away and these models are impressive. So, so kudos to them.

Andrey

And at the lightning round, we got some smaller models. The first story is that Hugging Face releases SmallLLM, that's S M O L L M, fun name there, a series of small language models that beat Quantu and Phi 1. 5. So these are coming in free sizes, 130 million. 350 million and 1. 7 billion parameters optimized for use on local devices like laptops and phone. And these are fully, fully open source. So you got the weights, you got the data set, you got the training code.

All of that is Available to you. So we know that we have been trained on Find web edu and smop pia uh, V two. And uh, yeah, as you might expect, if you evaluate it, they outperform existing models in their size. Categories.

Jeremie

Yeah. And I mean, if you're looking for, I know it can be overwhelming. Uh, God knows we feel this, you know, all these open source models coming through and they're all at like 7 billion or, you know, 2 billion or something like that. Um, so first of all, definitely on the smaller side, that's, that's one. Well, okay. Small, small, small, ha ha. Anyway.

If this is definitely on the smaller side of open source language models, um, which is important that, you know, there are a lot of especially edge device use cases that depend on that things where you need blazingly fast inference. Um, but another kind of trend to highlight if you're trying to think of, like, what are the things I can distill from this one is really this data curation thing. This is something we saw with the five series of models that Microsoft.

Put out the last couple of months. You know, one of the big things that people are looking at, especially when you start to go through smaller models and you're more, let's say parameter constrained than compute or data constrained, now you're going to care a lot about building really good data sets. You know, that was the thing that, that's made the five models so successful and they, they really are, especially at this sort of like 2 billion parameter range, which they compete at. Um, yeah.

So this is a big play that's gone into making these models as good as they are. They put together this dataset called Cosmopedia V2, which is an enhanced version of a previous dataset called Cosmopedia. It is the largest synthetic dataset that exists for pre training. So this is, uh, it's got they say 30 million textbooks, blog posts and stories that have been generated by a language model.

And in this case by mixed trial, um, eight times seven B. This is the mixture of experts model that mixed trial put out a little bit earlier and that we also covered, um, the instruction fine tune version of that. So they're using this mixed trial model to generate a synthetic data set that they're going to use. to really cram as much knowledge as they can in this much, much smaller model. Um, so that's really the play, right?

Just because you're constrained by the size of your model, it doesn't mean you can't pour a sort of, um, a disproportionate amount of compute and data into it. You absolutely can. The model will keep improving. Uh, it's plateau is like gonna, gonna hit earlier. So, uh, that's just sort of the trade off there. You could make the model bigger and for the same amount of compute, same amount of data, it would perform better.

Yeah. But then the trade off is you can't use it on edge devices and things like that. So that's kind of what's going into this, this whole landscape.

Andrey

And the next story is about stability AI. We have a big player in open source and it has to do with licensing. One of the existing, uh, very important and exciting bits about open source. The title of the story is stable diffusion, free license revamped amid blowback promising a better future. model. So when stable diffusion free medium was announced that came with a specific community license that was fairly restrictive.

So it, uh, kind of was open source, but under very stringent conditions where stability, I could, uh, limit your use for it. And, uh, it, the blowback was so bad that, uh, in fact, um, The, uh, there was a ban on civic ai, a community hub, which, uh, barred all stable diffusion, three related content due to licensing concerns that you couldn't actually have it on, let's say this platform. So they announced that, uh, there's a new set of terms where there is, uh, uh.

Explicitly, uh, granting free use of Stable Diffusion Free for research, non commercial, and limited commercial purposes. Uh, so that means that, uh, individuals and businesses with annual revenues under 1 million can use the model without charge. And those exceeding that threshold must obtain an enterprise license. So, yeah, interesting to see that there was, uh, Like a significant amount of blowback on license to the degree that StabilityEye had to revise it and put out this new version.

Jeremie

Yeah, there's a whole bunch of sort of detailed legal back and forth about the specifics as well. Like it does go pretty deep. People could have concerns, for example, about whether the license can basically be revoked, terminated at any point by Stability. That was the, the big question.

The concern that at least one person had that who expressed this publicly stability came back and said, well, we, you know, just to clarify the community license agreement says that it can only be terminated if the license is in breach of any term or condition. If licensee sorry, is in breach of any term or condition. So anyway, there's a whole bunch of layers like that to this onion.

Um, but at the high level, I mean, I think this is one of the challenges like, you know, stable diffusion or stability rather. Sort of being hosted by its own petard here, like they set the bar on open sourcing things and decided to turn this into their business model. This reminds me a little bit of some of the blowback that OpenAI first faced when they were first open sourcing all their models and made a big hub about that.

Then, You know, they, for understandable business and safety and security reasons, they kind of closed down shop a little bit or or a lot. Um, now we're seeing for business reasons, I suspect, uh, you know, stability really hard to make a business when you're, you're giving away your, your crown jewels. And so hosting can only go so far, especially in a world again, where a million tokens of, uh, inference time output is like 60 cents or whatever.

So I think this is going to be an interesting challenge for stability going forward. Like, how do we figure out a world where there's, you know, Permissive licensing and also good revenues. It's not yet at all obvious that that, that world even exists, you know, even as reachable for businesses. So they have a lot of fundamental structural questions that they're facing that are pretty existential. They are under new management.

So, you know, maybe that is going to lead, I guess it's already led to a change in direction with this attempt to clamp down a bit on the, the open source side. But I'm really curious if, you know, two years from now, we're still talking about stability, at least in the same way, And if they're going to be forced to pivot or find a new business model, because this is, you know, the open source community is a certain way and understandably.

So they've had a long history of success with that model. Uh, it's seen as the core driver for a lot of, you know, early web progress and early AI progress. So, you know, championing it and then turning around is it's, it's tough to anger your, your core user base.

Andrey

Yeah, and, uh, just looking back at the previous version of the license out of curiosity, evidently in the original version, uh, for this creator's license, there was a 20 per month fee even for those running the models locally on their computer, and the license limits image generation to 6, 000. per month. And there was also things like derivative works, which are modifications to the model. That of course is a very important aspect of, uh, these models being able to build on top of them.

So it makes a lot of sense that there was this amount of pushback for those kinds of conditions in the original license.

Jeremie

Yeah, I will say one thing I like about their attempt to split the baby here is they have this condition that says, if you know, if you're a business that's making over a million dollars, then you have to pay like you can't use it commercially. And but if you're if you're making less than that, then you can. And I don't know, I guess I kind of like that idea as a way of, you know, You know, making sure if you're making more than a million dollars, actually, a million may be even on the low side.

But if you're making, you know, more than a certain amount of money, you can probably afford to and probably should build your own tech in house at that point from a almost a moral standpoint to keep competition alive. Um, yeah, so, so this is their attempt to make it pseudo open source. Yeah, well, it's a bold move, Cotton. We'll see if it works out.

Andrey

On to research and advancements. And the first story is about FlashAttention3. So FlashAttention and FlashAttention2 were things developed over previous two years to make LLM inference faster, basically by optimizing the implementation of attention, one of the core bits of the, uh, internals of LLMs. So here we have an announcement of Flash Attention Free coming from a big collaboration from Colfax Research, Meta NVIDIA, Georgia Tech, Princeton University, and Together AI.

And this Flash Attention Free is specifically Introducing new optimizations for the NVIDIA Hopper GPUs. So with Flesh Retention 2, that could only use 35 percent of H100's maximum capacity. This, due to various pretty technical tricks and scheduling operations and so on, achieves up to 75 percent usage. So 1. 5 to 2x speed up compared to previous versions of Flash Attention. So pretty huge, pretty significant for this to be coming out.

Jeremie

Absolutely. And you think about data centers and the spend on these GPUs, if you're able to like double the amount of, uh, of usage that you're squeezing out of these H one hundreds, yeah, you're basically like you get half off on your GPU, your data center spend, right? That's a really big deal. Um, so yeah, I mean, just to give a little, just a little taste of the kind of thing they're up to here.

Uh, you know, the, uh, the process of training models, uh, is, uh, involves or in inference as well. Actually, um, the process of running these models involves using a lot of matrix multiplications. So the matrix multiplications are the backbone, the vast majority of computations that go into, um, you know, the attention mechanism, uh, the feed forward, uh, layers of these, these models.

Um, the challenge is that there are a couple of key computational bottlenecks that are not matrix multiplications. So because matrix, matrix multiplications make up the, the, vast majority of these computations, hardware is optimized to blazes around matrix multiplications. So even though they account for the vast majority of computations, they are not the vast majority of the kind of bottlenecks that you actually run into in practice.

Special functions, um, which, you know, like softmax is a great example. It just ends up being this horrible drain on. Yeah, on the time required, the wall clock time required to, uh, to pump out, uh, these, uh, these training runs.

And so you want to find ways to optimize around those special functions, flash attention, which was the original technique that was first developed for this, like two years ago or something like that, um, found ways to do this, uh, by reducing basically the, the amount of exchange of data that was required between the, SRAM, which is a kind of memory that sits right next to the logic on the chip.

We've talked about that before in the podcast and the high bandwidth memory that is sort of less frequently used, but, but sort of used for really like large amounts of data, the SRAM kind of, you can think of it as a very rapidly exchanging information with the logic. It's right there, right next to it on the chip, high bandwidth memory is sort of bigger batches of data that you're moving back and forth.

Um, and so you want to reduce how much you need to move data around, uh, those, those two, those two memories. And, uh, that was what it was all about. Flash attention to, uh, was all about the a 100. It was just an optimization for the a 100 GPU. They were able to get up to 70 percent of the maximum performance. Basically get like the a 100 working at 70 percent capacity as a rule. You never get to a hundred percent capacity on these GPUs, right?

So you're always trying to get as close as you can and flash attention to radically improved performance on the a 100 did not port over so well to the H 100, which is the current. State of the art. Um, and so, you know, they're, they're only about 35 percent of the H 100 maximum capacity. Now, Andre, as you said, 75 percent on the H 100 with flash attention three, that is a doubling sort of 1. 5 to two X speed up. That's a big, big deal, right? It reduces the training costs in a big way.

That also helps with latency means you get your outputs a lot faster. And then presumably also helps with context window sizes because those things all sort of interact. So this is all open source now. It does compare favorably to opening eyes, Triton and, and, um, CUDNN as well, which NVIDIA puts out. So there are a lot of, um, these sort of like deep learning compilers that, uh, that, that exist.

And, uh, flash attention three is now the latest and, uh, really most impressive looking at these, these performance, uh, sort of specs that they share in the, in the announcement. Like it's a, it's a fair bit better than, uh, even CUDNN. So, so there you have it.

Andrey

And next paper has the short and sweet title mixture of a million experts. This is coming to us from deep mind. And the idea here is that mixture of experts is great. right? This is one of the things that has been shown over and over again. If you take your model and then sort of subdivide it, make it so parts of it are used in computation for any given input that can lead to better scaling, better inference performance, and so on.

More recently, we saw the introduction of scaling laws for fine grained experts. Basically, You know, the norm is to have maybe four, eight experts choose from or between during a given inference run. Well, people have started experimenting with what if you have way more, what you have, 64, 500 experts, and you just decompose your neural net into a whole bunch of different bits that can be used. one over another.

And so this paper is looking at parameter efficient expert retrieval, a new layer that uses, uh, you know, some fancy computation, uh, to be able to retrieve from a large pool of data. Tiny experts. And when you do that, you're able to scale to pretty absurd numbers, as the title says, a million experts. And as with mixture of experts in general, what that leads to is a better efficiency, better use of your compute, better scaling.

So at a given number of floating point operations, you're able to achieve significantly better performance.

Jeremie

Yeah, I, I really like this paper, uh, conceptually, by the way, this is a total badass paper. It has a solo, a single author. This is one, one dude who put this thing together, uh, so pretty, pretty impressive and we'll see if it holds up and, and stands the test of time, but the results certainly seem very impressive and promising.

Um, so yeah, there's a couple of, uh, reasons that they flag for wanting to have, uh, a A large number of experts and like one of them is just this, this problem of catastrophic forgetting. If you're going to have a model that learns, sir, does lifelong learning, just keeps learning and learning and learning. What you find is the things that it's learned in the past, it will tend to forget over time. And so that's the problem of catastrophic forgetting.

Um, the, the way, you know, humans, uh, ostensibly solve this problem is by. You know, being judicious, let's say our brains are fairly judicious in deciding what information is worth retaining. So we'll tend to retain information that's important more than information that isn't. We've seen variants of the Mamba architecture. Actually, we talked about one last week that did something like that. Um, but here, what they call out is, well, wait a minute.

If we have just a huge number of these tiny experts, we could freeze a large number of them and just add new experts to, to, you know, keep training those, but we retain the old experts that we had. And that helps to prevent this catastrophic forgetting problem. You're not training the old experts, you just kind of bring in some new ones. That can accommodate the new knowledge that you bring on board.

And that's all part of the sort of like expert selection mechanism that they develop in this paper. It is somewhat nuanced, but it is really interesting and worth reading if you're, if you're technical. Um, I guess, uh, yeah, so, so that's one piece, lifelong learning piece. Um, was I going to say there's another, uh, Oh, that's right. Yeah. The big challenge that they had to solve, though, is now you've got a million experts that you have to select from, right?

You got to find a way to really efficiently, really quickly, because you're doing this during training and inference, you got to really quickly identify the experts that need to be used for a given inference or training run. Um, this is Really, where a lot of the secret sauce is going to come from in the paper again, does get fairly technical, but they use this technique called product key retrieval, and this is sort of inspired by a lot of what search engines might do, or

if you're going to try to identify an answer to a user query very quickly and efficiently from a very large data set, so they're calling on that literature to help inform their approach so you can identify these experts and call them forward really effectively. Um, yeah, so I, yeah. I just thought this is a really great paper. They do identify and leverage a couple of really interesting scaling laws. And they make this argument again, you can read it.

It's in a section called why a large number of small experts. This is a really interesting, I thought in, in nuanced and valuable discussion that gives you some insights into the math behind why Uh, you might want a smaller number or sorry, a giant number of very small experts rather than, you know, a small number of big experts. Uh, it turns out to just make a lot of sense. But then of course you run into that, that problem of retrieval, right?

How do I choose, uh, my experts from this large pool? Um, yep. So they, they, they, Do a bunch of scaling laws, uh, work as well on their, on their model and show that it outperforms dense models, um, as well as mixture of experts models with like 128, uh, experts and things like that. So anyway, really, uh, interesting paper. I do recommend it again, especially if you're technical, there's a lot of meat on the bone there.

Andrey

Yeah, if you're non technical, maybe, maybe a bit of a tougher read, but, uh, yeah, seems like the sort of thing that could be impactful. In fact, uh, I'm not, I don't think we covered this, but, uh, just a few weeks ago, the company Laminai. Introduced mem and I alumni memory tuning the claim where it was getting 95 percent accuracy with 1 10th of the hallucinations and that was done through tuning a million of expert adapters with precise facts on top of open source LLMs.

So I think in some sense, a bit similar where you. Uh, sacrifice some generality to get a lot of variations of your model. And that, as you said, can, uh, ameliorate, uh, catastrophic forgetting and potential also hallucination and things like that. Moving on to the lightning round. First paper is AutoVenture creating salient novel and difficult datasets for language models. So they begin by presenting three things you want in a good dataset.

Uh, salience, things that you want your model to know, novelty, what do you, uh, how much I guess there is to know, what can the model learn that isn't, uh, learned already, and difficulty. The benchmark should be difficult for existing models. So this is thinking of that. When you want to create a new benchmark for a language model beyond what has been already created. And so they take these considerations and actually make it a computational problem.

Can you search over a bunch of potential benchmark tasks to create a good benchmark? And so they create data sets for math, multilingual and knowledge intensive question, uh, answering that leads to apparently data sets that are 27 percent more novel and 22 percent more difficult than existing benchmarks. So there you go. Benchmarking is so important. We now need to automate it with kind of meta metrics to optimize for.

Jeremie

Yeah, which, which I guess, I don't know. I don't know why I'm always surprised when somebody does this with another layer of thing that I was like, that can't be automated. But of course it can. Um, it is kind of interesting to see how they define these three metrics that you mentioned of salience, difficulty and novelty, those three, uh, sort of, uh, objectives that they want to optimize for, um, you know, salience. So, you know, yeah, how, um, how practically important it is.

The capabilities are that are being tested by a given benchmark. They define that as a binary variable that captures, uh, whether a benchmark on a specific topic is important. That's so, you know, one or zero, um, and, uh, and yeah, difficulty. I, so this is just. how, how low the lowest error rate is that's achievable by existing models. So you can quite readily see, you know, you want a benchmark that's nice and challenging so that there's actually resolution at the sort of upper end.

So you can tell which models are better and which are worse. It does no good if you're playing with a dead, easy benchmark, like MNIST or something that all models can crush. Um, and, uh, yeah, so, so that's, uh, I guess pretty, pretty clear how you do that novelty I thought was especially interesting. So how do you actually measure the novelty of a benchmark? And what they do is they actually, uh, use the predictability.

So they will, will take, let's say the, um, uh, Uh, predictability of model performances on benchmarks, uh, on other benchmarks. You basically look at like, I don't know, uh, MMLU. What's the, the stack ranking? What's the ordering of model performance on that benchmark? Then you look at another benchmark. Okay. What's the ordering of model performances on that benchmark? So you look at like, you know, how do models tend to perform on other benchmarks?

And what you're going to try to do is identify benchmarks where that ordering is different in some meaningful or interesting way. You're going to try to find, you know, optimizations so that a model that tends to perform really badly on one benchmark tends to perform really well on yours relative to other models. And that tells you that something interesting, something new must be going on with this new benchmark. So that's in practice how they're going to instantiate it.

Uh, they use this interesting adaptive search technique where essentially they, they'll like do an iteration of kind of generating a benchmark and see how.

The benchmark performs based on all these metrics, and then they'll keep track of that trajectory so that they'll be the agent that and it is an agent, the agent, the auto venture agent that's generating these, uh, these benchmarks will have a running memory of what the previous performances of the models were on the previous versions of the benchmarks that had created, and we'll use that to inform what the next version will be. That they call adaptive memory.

So, um, yeah, I thought that was, that was really interesting. That's definitely kind of the, the bell and whistle, uh, that, that sort of stands out in the implementation here. They do say that, you know, they, they run ablation studies and get rid of that feature. And it turns out that when you do, uh, all of a sudden your benchmarks are no longer nearly as good. The novelty benchmark or sorry, the novelty metric that they use.

is actually really interesting because it does allow them to surface specific cases where they've identified, um, some kind of well known models that don't perform as well as you might expect on certain tasks, right? That's a side effect of selecting for novelty in these benchmarks. You end up with benchmarks where. Yeah. The ordering of model performance on that benchmark is weird as surprising.

And so they say, for example, that, uh, while Gemini pro is one of the best models on existing history benchmarks, it performs quite poorly on auto venture discovered topics of Permian extinction and For Fordism, I don't even know what Fordism is, but Permian Extinction, um, performing even worse than some 7 billion parameter models such as mixed Misra seven B.

So you, you can kind of surface through this process some interesting quirks about these models, uh, that I wouldn't have expected it, you'd be able to surface. So it is, it is a really cool and and cute paper. Again, very much worth a, a look and especially, you know, take a look at, at some of the, um, the results around novelty. I found those to be the most interesting, some of the surprising patterns that they, uh, they surface in ways in which.

Well known models, maybe don't perform as well as you might expect on some narrow tasks.

Andrey

And the last paper for the section, we have a very exciting topic spreadsheets. The paper is from Microsoft and it is a spreadsheet LLM encoding spreadsheets for large language models. So basic. Problem is that when you have a spreadsheet, how do you convert it to an input that a language model can take in? You can do the, uh, I guess, uh, vanilla approach or the, uh, simple approach where you just have the address of every single cell with its corresponding value.

But the issue is that gets very, uh, verbose. So you end up using a large number of tokens and it just doesn't work so well. So this paper introduces sheet compressor, which is an encoding framework that presents spreadsheet. Uh, spreadsheets in a compressed fashion, they do some sort of fancy things, structural anchor, base compression, inverse index translation, things related to how you present spreadsheets.

But the gist of it is that they're able to compress Uh, things by, uh, on average 25 times, and that leads to better in context learning and general better usability of large language models with spreadsheets. So yeah, maybe not the most exciting, but I do think pretty impactful. Type of research.

Jeremie

Oh, absolutely. And it's so many of the, um, the, the breakthroughs, the inglorious breakthroughs that give us the best results in the space often come from the sort of earliest stages, the sort of pre processing and early processing where, you know, you have better tokenizers that come out or you have better embedding strategies. And so this is one of those cases where you can really, uh, you can really see the payoff.

So, um, yeah, no, I mean, it's, it's something to make you hate, um, uh, let's say PowerPoint. Not Google Sheets. I got to work my way backwards. Eventually I'll get there. This is like that V named company that I kept forgetting. Uh, Excel. Yeah.

Andrey

I'm pretty sure you mean Excel, yes.

Jeremie

You know what? Cause I use a Mac, you know, cause I'm like a dirty, dirty elitist. And um, so I, I have numbers. I use numbers on my Mac because I'm, I'm, I'm better than everybody else. And so it's been a while since I've had to use Excel. But anyway, uh, yeah, it makes you, it makes you hate numbers and Excel and all that crap. Uh, a little bit, a little bit less, maybe. Hopefully,

Andrey

hopefully. Onto policy and safety, and the first thing we have is from OpenAI, the block title is Prover Verify Games Improve Legibility of Language Model Outputs. So this begins with the motivation that language Models that are optimized for correctness can produce solutions to problems that are harder to understand, leading it to be impossible for humans to evaluate them. And this can come up in the context of, let's say, super intelligent AI.

If they come up with solutions to problems that humans cannot comprehend, that's probably not And so here they introduce a method where you have this game between a prover and a verifier, where the prover generates a solution and a verifier checks it for accuracy. And this gaming system is used to basically kind of trade off How, uh, complex or hard to understand your solution is.

So there is kind of a trade off that they explore and as a result are able to lead to solutions that are correct, but also checkable. Yeah, in an easier manner than what you would get from a vanilla.

Jeremie

Yeah, I think this is a really interesting piece of research. By the way, one of the, one of the, if not the last, um, research project from opening eye that we'll see that has yon like his name on it. He's one of the, uh, The, the sort of advisors on this piece of research, he's obviously since moved to anthropic, but, um, yeah, it's, it's an interesting piece of work. It hearkens back to a lot of these questions about what's known as scalable oversight, right?

How do we build our way to, to super intelligence, not just artificial general sort of human level intelligence, but outright super intelligence in a way that's safe in a way where we can actually like understand what is being generated by these systems and make sure that they're not, you know, making say dangerous or risky plans. Um, this whole. The game plan reminds me a lot of generative adversarial networks, right?

GANs, which came to prominence a long time ago, um, in the context of vision models, where you would have a generator model that produces an image and then a discriminator that would look at the image and try to determine whether it was real or AI generated.

And then, you know, that kind of dynamic balancing act between those two, that escalation of capabilities as those two systems are trained together leads you to having a really, really good generator and a really, really good discriminator, sort of the same thing here. So you have a verifier that is initially this really weak model that's meant to, uh, take the outputs of a prover model, which is the big powerful one.

And the prover model is going to try to, in some cases be helpful in other cases, be sneaky. And it's the verifier's job to figure out, is this a helpful or sneaky model? And again, just like the generative adversarial network, they kind of go back and forth. And, um, and the idea is the prover, if it's in helpful mode, it's going to try to make Uh, a strong case for the correct solution to a math problem.

If it's in sneaky mode, it's going to try to make a strong case for the incorrect solution to a math problem. It's going to try to fool the verifier into, you know, kind of accepting the wrong answer. And so this dynamic, uh, dance seems to lead to, uh, over time. Uh, greater interpretability of the prover, right?

So the, uh, you know, you have this, this interesting artifact of the verifier that can help you to validate, you know, what's, uh, you know, what, what the, the, uh, helpful prover is generating, and then you have the helpful prover that is then being optimized to be even clearer. Cause that's one of the interesting things here, right? Like the helpful prover. Again, sorry, we've got two versions of the prover here, the one that's trying to be helpful and the one that's trying to be sneaky.

If you look at just the helpful version of the prover, it's almost like it's secretly trying to collaborate with the verifier and get so good at explaining clearly why the answer to the math problem is correct, the answer that it's offering, that the verifier will correctly go, okay, yeah, you're, you're in helpful mode. I see it. That is the correct answer, right? So. There's this pressure on that prover model to get on that helpful prover model to actually get really good at explaining itself.

And what they find is this leads to explanations that are much more legible to humans. Um, the challenge though is that although the explanations are more legible to humans, you're also going to sacrifice some of the raw capabilities of the prover model in this context. So, you know, you're, it's going to cost you a bit. You're going to have this wicked good prover out the gate.

That might, you know, might have outlandishly complex solutions that it will walk a user through and through this process of iterating anyway with the, uh, the verifier, eventually the helpful prover becomes good at explaining itself, but in the process, it sacrifices some of the Um, the sort of it's, it's accuracy, it's actual performance on the math problem. So they sort of surface this as an example of a kind of tax that you're going to pay.

If you want interpretability in your solutions, you're going to have to pay this interpretability tax, this legibility tax, as they put it. Um, this is reminiscent of the alignment tax. This is another idea that Jan Leica would talk a lot about back in the day where, you know, you can have a very powerful system. Uh, it's going to cost you some of that power. You're gonna have to redirect some of that power in order to get that system to behave safely and appropriately.

That is called the alignment tax. Basically, this is a cost that is incurred by companies that want to align their models and make them safe, but that model companies who don't care about that don't have to, uh, have to spend. So, anyway, um, they have a whole bunch of interesting results in here, including a theorem proof, which, uh, a theorem that they prove, which is really interesting.

Uh, basically, uh, Uh, they show that for, uh, provers with any kind of model class, um, you, uh, so, so you can find an equilibrium, um, uh, well, ultimately things can converge into a situation where you have outputs that are perfectly legible. Um, for the verifier. So eventually the equilibrium point of this dynamic is that yes, you will get to a point where the prover are perfectly in sync when the prover is trying to be helpful.

The verifier will always correctly identify what it's put out as, as correct. And that's, that's a good result for AI safety. Um, it's a, it's a positive outcome. Hopefully it suggests that scalable, scalable oversight is maybe just a little bit more doable than it otherwise might be.

Andrey

Right. And, and that kind of comes about a little bit indirectly. I found this interesting by creating the verifier to be a weaker model. Uh, they say a free magnitude, uh, order difference in the capabilities where both of them start from a GPT 4 type model. And then through that, uh, kind of adversarial training, you wind up, uh, making it so the explanations just start.

Being very clear instead of being super, um, I guess, concise and lacking in detail, which is what you would get at the beginning. So yeah, really neat piece of research and, uh, results. Next story is on policy. The title is Trump allies draft AI order to launch quote Manhattan projects for defense. So this is, uh, looking like allies of.

The former president, Donald Trump, are drafting an executive order dealing with AI that would initiate a series of Manhattan projects to develop technology and review existing regulations. And the framework would establish an industry led set of agencies to evaluate AI models and secure systems from foreign adversaries.

This would include things like Make America First in AI and Presumably, I haven't dived into all the details, but, uh, this would be quite a bit different from the AI executive order that came from President Biden's team, I believe just last year or relatively recently, with a lot of government initiatives haven't been a result of that executive order.

Jeremie

Yeah, absolutely. No, you're totally right. It would be quite different. Um, this is such an, such an interesting moment for this sort of thing. So for a little bit of context for our listeners, um, executive orders, right? Are orders that the president can give to require the government to do certain things. Now, they, they're fairly lightweight in the sense that, uh, an executive order that's passed by one president can be repealed by the next president, right?

And they're called executive orders because the president oversees what's known as the executive branch of government. This is the branch of government that's responsible for all the kind of day to day activities, the goings on, responding quickly to emerging events and things like that, as distinct from the legislative branch, right? Legislative branch, that's like Congress, that's the House and Senate, they pass laws. And those laws have a lot more staying power.

Cause you got to get a whole bunch of people, all the people in Congress and Senate to agree to repeal or add legislation. So it just gives them a little bit more, more stickiness. Um, so for that reason, executive orders are a little bit looser. They also don't come by the way with their own funding. So when Biden passed the executive order on, uh, safe, secure and trustworthy AI or whatever it was back in November of 2023, Uh, there was no funding that actually accompanied that.

So, so that executive order calls for, you know, NIST, Department of Commerce, Department of Energy, and so on to start doing a whole bunch of things, but it doesn't actually fund those requirements. So that's always kind of a bit of a source of tension in the background as people try to figure out, okay, how do we do what the president wants us to do without that funding?

The going assumption has been politically for the last year that, uh, if, uh, Donald Trump wins the next election, this first move is going to be to repeal, or one of his first moves will be to repeal Biden's executive order. Now, There are a lot of good things in Biden's executive order that Trump's team correctly, is my understanding has identified as being good and desirable.

So there's a lot of, uh, you know, big rhetoric here as you'd expect it's political campaign, but it doesn't appear as if everything's going to be thrown out with the bathwater. Uh, there's a, you know, good, decent amount of nuance on the Trump team already when it comes to parsing what's, what's in and what's out. What's interesting here is this Manhattan project piece, right? Manhattan projects for defense.

The Manhattan Project, of course, like that's how you build the a bomb back in, uh, you know, the 1940s. Um, and a little bit earlier. And essentially the case that's being made here as well. We need something similar for AI. Now, remember that thing I said about, uh, unfunded executive orders don't come with their own funding. Only Congress can appropriate it. Uh, funds for things like this. And so until Congress, like loosens the purse strings, these things are unfunded.

Manhattan projects are super expensive. And when you're talking about, for example, you know, a hundred billion dollar plus compute clusters. I'm really curious how this, how this is going to play out in practice worth noting. This is not coming from the Trump campaign. This is coming from, as you noted, the Trump administration. Trump allies from the America First Policy Institute.

This is a nonprofit that's led by a whole bunch of former advisors to Trump, including Larry Kudlow and a whole bunch of ex Trump officials. So it's not actually clear what relationship, if any, this actually has with the Trump Uh, the, the sort of, um, potential next Trump administration that may be coming in. Um, so, you know, this is all, all part of the challenge.

There certainly is an eagerness to throw out a lot of aspects of the executive order, um, that, you know, are viewed as being political by the Trump campaign. Um, and, and of course, it's the longest executive order in living memory, right? It was like over a hundred pages. There's a whole bunch of stuff there about, you know, social consequences, bias, this and that. But then there were more narrowly scoped things about weaponization and risk from a sort of autonomy and, and so on.

And so, you know, there's a lot of room here for nuance, even though these are, uh, as you'd expect in any presidential campaign, they're, they're sort of bombastic statements and things like that. So. Um, the America first policy Institute, by the way, was quick to say, this does not even represent their official position. This is just a policy proposal issued by people kind of linked to them or something like that.

And the Trump campaign also in response, shared a link to a blog post that said, Uh, no aspect of future presidential staffing or policy announcements should be deemed official unless they come directly from Trump or an authorized member of his campaign team. So in both cases, we got folks kind of distancing themselves from this, you know, pretty radical idea of Manhattan projects for defense. Those things could go really, really wrong if you have the wrong people heading them up.

Um, and, uh, you know, there could be value there depending on the safety case you want to make. But, uh, but boy, you'd want to have people like, uh, people with a lot of expertise and super alignment. Yeah. Um, and AI safety to kind of, uh, oversee anything like that. Um, fortunately there, I think an increasing number of people with, with deep expertise associated and kind of indirectly connected to, uh, that, uh, that campaign. So we'll see, uh, but this is definitely a big move.

Andrey

Yeah, it's definitely worth noting. This is not official, uh, Policy proposals, I guess, coming from the campaign, although the GOP platform does include repealing the Biden AI executive order. So yeah, we'll definitely see some deregulation if nothing else, and some more friendliness towards industry. Uh, real quick, I gotta be running in like maybe 10 minutes. So we can run through the rest

Jeremie

for sure. For sure. Let's do it.

Andrey

And onto the lightning round. First, we have another research paper dealing with, uh, how we will, uh, kind of scale to super intelligent AI. The paper is on scalable oversight with weak LLMs judging strong LLMs. So in a way quite similar, it's studying the ability of less intelligent beings to provide oversight over, uh, more intelligent. Beings and it found that in a debate format where to a eyes compete to convince a judge that can be more effective than some sort of consultancy format.

So really evaluating how we might. Make it so weak LLAMs can be used to judge, um, stronger LLAMs.

Jeremie

Yep, they, uh, compare this idea of consultancy, so this mode where you're just interacting with a chatbot and getting answers from it directly, to debate, where you see two chatbots interacting with each other and debating. trying to convince you, um, of the correctness of whatever side of the debate that they're on.

Um, there are a whole bunch of interesting nuance and detail here, really the differentiator with this paper, because we have seen other papers on debate, the differentiator here is just the breadth of experiments that they're going to run. Um, there's a whole bunch of really good stuff here about, you know, how exactly you set up these experiments and, um, the, the number of. Uh, chain of thought steps that you go through the best event sampling, all that stuff.

So yeah, check it out if you're interested. This is very much an encyclopedic result. So in other words, it's not some like conceptual thing that's going to break the way your brain works, but it's just a good overall summary. Uh, of results that span a really wide domain

Andrey

and next story delving into some geopolitics. Once again, yay. Story is, uh, Google, Microsoft offer and video chips to Chinese companies. So this is saying that Google and Microsoft do have cloud offerings. And so companies in China are able to get access to GPUs, including those that are under expert restrictions, uh, by using data centers outside of China.

So this seems to comply with U. S. export controls and now, uh, Biden administration of the Biden administration is looking to acquire cloud companies to determine if foreign entities are using us data centers.

Jeremie

Yeah. Always this question. This came up a lot when we were doing our investigation and building our action plan, um, for the state department that we did a few months ago. But, you know, this question of, do you Titan. So you can tighten export controls as we are preventing China from getting the actual AI hardware. But now you have this question of what to do about cloud compute.

Do you just let Chinese companies use us domiciled cloud infrastructure or foreign domiciled cloud infrastructure that uses, um, us design chips, for example. And that's a double edged sword. If you, if you say yes, they can use it. Then the benefit is. You're actually going to reduce domestic demand in China for cloud computing and therefore make it a little bit harder for companies like Huawei to make headway.

There's less revenue for them to generate because you're siphoning off some of the consumer demand. Um, but of course on the flip side, you're actually helping bootstrap their, uh, their domestic usage of AI technologies by using your, your compute. Um, so, so this is a big kind of paradox that you've got to resolve.

Um, you know, I. I tend to think, um, think shutting down the access to the cloud infrastructure kind of makes sense because China's already made it a policy priority or the CCP has to get access to this stuff anyway. Um, so I don't think, you know, they're going to be, there's so much state funding of Huawei and other entities like that, that, you know, this can start to make sense. Um, but anyway, a lot of nuance there. I'll park it for the moment.

Andrey

And the next story, also on this topic, the title is U. S. planning draconian sanctions against China's semiconductor industry.

So, in addition to all the restrictions that are already in place, apparently the U. S. government is considering even stricter restrictions, like applying the foreign direct product rule that would mean that allies would want to limit service and repair of equipment in In China, uh, that would put pressure on, uh, companies in Japan and Netherlands, basically making it so the U. S. can, uh, pressure those to impose further sanctions and limits on businesses from China.

Jeremie

This is a really, really challenging problem for the U. S. Navigate. They do have this instrument. Yeah, the foreign direct product rule that they can apply right now. You got companies like Tokyo Electron in Japan and A. S. M. L. That are crucial parts of the semiconductor supply chain internationally. Um, you know, do you go to them and say, Hey, look, Yeah. Any, so this is what the foreign direct product rule basically says.

If your product has any components that are made in the United States, then we can basically, you know, tell you this can't be shipped to certain, uh, certain end users. And so this is a very wide ranging level of, um, of control that the U S government could apply. So far, they have not done that. To ASML, to Tokyo Electron, to Japan and the Netherlands, basically trying to, you know, be diplomatic about it.

Um, and you know, the Netherlands and, and the U S there's been this tense dance between these two countries as Netherlands wants their domestic champion ASML to get more access to the Chinese market. The U S doesn't want that. Um, and so now the, the persuasion game is on. It's clear that the companies do not want the administration to invoke the FDPR. They're worried that it'll provoke, uh, by the way, when I say companies.

Not just companies in the Netherlands and Japan, but even American companies, they're worried it'll provoke backlash from Japan and the Netherlands, um, and basically have them stop cooperating. And so there's a lot of anxiety.

This is really, really challenging because the technologies that ASML, um, is, uh, you know, could be shipping to, to China, um, are these very advanced lithography machines that could meaningfully move the needle on China's domestic domestication of the semiconductor supply chain. So, um, yeah, it's a, it's a big story with lots of moving parts, but, uh, there you have it.

Andrey

And last story dealing with OpenAI, seeing some more drama about the internal processes. The title is OpenAI illegally barred staff from airing safety risks according to whistleblowers. These are whistleblowers who have filed a complaint with the Securities and Exchange Commission, the SEC. That alleges that the company OpenAI prohibited employees from warning regulators about potential risks about its technology.

So, according to, uh, the, uh, sources, agreements that employees had to sign, meant that they would need to waive their federal rights to whistleblower compensation, and that they would have to get prior consent from OpenAI if they wish to disclose information to federal authorities, which seemingly violates federal laws and regulations meant to protect whistleblowers who want to reveal information anonymously and without fear of retaliation.

This is, of course, coming after some other Uh, information came out with regards to the very strict NDAs that employees had to sign when we leave the company, where essentially they would not be able to say anything negative about OpenAI ever again. Uh, so clearly It seems like there was more to what OpenAI, uh, imposed on employees beyond that D. A. This also seems quite, uh, you know, non traditional, pretty aggressive compared to what you usually do in the tech sector.

Jeremie

Yeah. I mean, you know, you, you and I both know, right. Having signed a lot of these contracts ourselves, it is not the norm to sign an NDA that says you can never like, uh, surface issues with, with your company and definitely in the future. And, and in fairness, opening eyes had to walk that back after having been caught with their pants down in one of the scandals of the year. Uh, that, that I think it's fair to say really just forced their hand.

Um, there was no sign that they were going to kind of change coarseness. There were all this denial about whether they knew, which, yeah, you know. to me seems pretty implausible. But anyway, um, so yeah, the rationale here is that these, uh, severance and nondisclosure agreements, uh, could have led to penalties against workers who raised concerns.

The letter is going to argue that, uh, this also could run, run a foul, not just of whistleblower protections, but also the white house executive order, uh, the demands that AI companies develop, uh, technology safely. Broadly understood, uh, the letters being sent to the sec, um, it was, it was, I guess, uh, pulled up by a journalist or requested by a journalist through some like FOIA type mechanism. I think.

Um, and, uh, they're arguing that the sec should require open AI to produce every employment severance and investor agreement that contains non disclosure clauses. Uh, they also say that the sec should issue fines to open AI for quotes, each improper agreement under sec law and directed, um, uh, sorry, uh, That's right, under SEC law and, and direct OpenAI, uh, to, yeah, cure this chilling effect that they say its past practices have led to.

Um, one, one note here, last thing I'll say, so there is this, uh, OpenAI spokesperson who comes out and says, okay, you know, um, we're, we believe in rigorous debate about this technology, blah, blah, blah. I mean, at a certain point with all these whistleblowers, it's like, okay, uh, you know, there's, there's, uh, the public statements, sure, um, But the interesting thing here comes from, uh, Chuck Grassley, who's a senator, Republican senator from Iowa.

So he's quoted in this article as saying that opening eyes policies and practices appear to cast a chilling effect on whistleblowers right to speak up and receive due compensation for their protected disclosures. Um, and in order for the federal government to stay one step ahead of AI opening eyes, non disclosure agreements. must change. I can't agree with this enough. Uh, yeah, I think Senator Grassley is putting his finger right on the issue here. There needs to be more transparency.

The stakes are too high for open AI to be kind of playing these games to the extent that they are here. So, um, yeah, uh, I think another, another big, uh, not legislative step, but regulatory step here and getting the sec and its authorities, uh, involved in this whole process.

Andrey

And we are going to finish up with that. Some more exciting drama out of OpenAI as we often do have. So thank you for listening to this week's episode of Last Week in AI. As usual, you can find the articles and the texting user at lastweekin. ai. You can find our emails down in the episode description. for listening. Bye. find us on YouTube to leave comments. And, uh, aside from that, of course, we would appreciate if you review the podcast and share it with everyone you know.

All that aside, we would mostly appreciate it if you do keep listening. And you do enjoy the AI generated song that will close this out

AI Singer

last weekend. ai We Bring You News, keep you in the know today trends and breakthroughs. All the headlines that will make your way. We chat, chat. Chatterbox, chatterbox, chatterbox, text scene. Stay updated, never fall behind. With the latest stories of AI on your mind. It's last week in AI, we'll keep you wired. From breakthroughs to trends, we've got you inspired. So tune in, join us, let's dive in deep. With a beat that, that will keep you wired. On your feet.

From the labs to your ears, It's a journey we share, Analyzing code with a rhythm so rare, Every byte, every line, We'll break it down. In the AI world, We'll help you win the crown. Plug in and get charged. Stay ahead of the curve with insights and lessons that you truly deserve.

Transcript source: Provided by creator in RSS feed: download file