#176 - BIG WEEK for OSS! SearchGPT, Lamma 3.1 405B, Mistral Large 2 - podcast episode cover

#176 - BIG WEEK for OSS! SearchGPT, Lamma 3.1 405B, Mistral Large 2

Aug 03, 20241 hr 26 minEp. 215
--:--
--:--
Listen in podcast apps:
Metacast
Spotify
Youtube
RSS

Episode description

Our 176th episode with a summary and discussion of last week's big AI news!

NOTE: apologies for this episode coming out about a week late, things got in the way of editing it...

With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)

 

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Email us your questions and feedback at [email protected] and/or [email protected]

Transcript

AI Singer

Got the latest buzz on AI, Uh, Uh, Search GPT and Lima on the rise, Oh, Yeah, Deep mind flexin my skills, Oh my, Join us for the ride, Uh, Uh, This episode is alive, And it's all about the highs, Yeah, Last week in AI, Tune in now, We're breaking down the how, We're breaking down the why.

Andrey

Hello and welcome to the latest episode of Last Week in AI, where you can hear us chat about what's going on with AI. And as usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our last week in AI newsletter at lastweekin. ai for articles we will not cover in this episode. I'm one of your hosts, Andrey Kurenkov. I finished a PhD where I studied AI at Stanford and I now work at a generative

Jeremie

AI startup. I'm your other host, Jeremy Harris. I'm the AI, which is an AI national security company, which you know, if you've been a listener to the podcast, um, I guess one thing I did want to start with too, by the way, um, just in case there's anybody listening, I know we have a lot of folks from the AI capability side, the AI. Sort of safety and alignment side to if you're interested in that AI safety alignment interpretability side of things, I did want to flag one thing.

I had a conversation a couple weeks ago with a team at DARPA. This is the Defense Advanced Research Projects Agency. All right, so this is U. S. government like D. O. D. type thing. They are actually really interested. In the problem in the problem of controlling very, very powerful AI systems, understanding catastrophic risk, that sort of thing. So I mentioned this because often when you talk to people in the space, they're kind of sometimes even worried about dealing with US government.

They're skeptical about all this stuff. The conversation I had there was super productive and the sorts of problems they're looking at are really in the butter zone. So there are a lot of, I know, independent AI safety researchers out there. Um, you know, maybe consider like. Like consider DARPA. I mean, I know it'll sound like a little bit. Some people are allergic to that idea.

Um, but really these are, these are work streams that I think are just strictly beneficial to everybody, like figuring out how to control these systems, um, with a safety mindset. So I just wanted to call that out just as a, an interesting conversation I meant to raise a couple episodes back. So, so I wanted to surface it here.

Andrey

Yeah. Interesting note. And, uh, as you say, I think probably a good thing that DARPA is, is interested in this line of research. And then as usual, I do want to call out just one comment. We got a new review on Apple podcasts. A very, very positive one that like calls us superheroes. I like this bit from the review. Does Jeremy occasionally dive down a rabbit hole of technical jargon that many of us don't understand? Maybe once in a blue moon, but he's also a genius. And hilarious.

Oh, my God. So, yeah. Thank you for that. I, I think that is probably a little too much praise, but, uh, we do try.

Jeremie

You know something? Uh, you're, you're very modest on Rick said, but he's also a genius and hilarious. And then in brackets, he adds, you both are. So you are two Andre, you're both a genius and hilarious. So thank you. Uh, bye. Who is it? NGC 2207. Thank you, NGC. Thank you. Yes. Can I call you NGC? Oh, I'll just call him.

Andrey

Yeah. And let's go ahead and just dive into the news. Starting with tools and apps. We got our first story. OpenAI announces SearchGPT. It's AI powered search. Engine. So as you might expect, this is kind of a big deal. One of the big stories of this week, there's been a lot of rumors that OpenAI has been working on a search product similar to Perplexity, but only now are we seeing some hint of that. So a search GPT is currently a prototype. And will only be accessible to 10, 000 test users.

And similarly to other search engines powered by AI, this will be a combination of talking to a chatbot and a search engine, so it will Answer your query and then provide links to the sources it used. So yeah, super interesting to see OpenAI going into this space and it will be interesting to see if, given sort of how well known they are, they can win this space.

Jeremie

Yeah, it's, it's one of the worst kept secrets in Silicon Valley that opening eyes been pushing in this direction. They've been actively poaching folks from Google search team for a long time. And there was a report out, uh, in, uh, earlier this year from the information saying, Hey, they have, this is actually happening. And it is the obvious play, right?

Anytime you look at Google 90 percent market dominance in what is really the internet's Arguably most important market, which is search, right? Hugely profitable, amazing margins. Yeah, you want a piece of that action. Microsoft's tried to take a stab with being with the release of GPT 4 and all that. We'll see if OpenAI can do this. I mean, this is obviously a harder challenge than it seems from the outside.

Everybody looked at Bing when, you know, it got powered by GPT 4 and they said, Ooh, it's going to be a big problem. Things are going to shift. Amazingly, Google has somehow defended that 90% Uh, market capture. So, so pretty impressive. It tells you something about how hard this market is to crack, how optimized Google already is. And relatedly, you know, you look at Google, they recently walked back the number of searches that Bard shows up in.

So, you know, initially it was like 70 plus percent of searches where you'd have that Barred feedback. Well, now guess what? It's 15%. Right? So that means Google's own assessment. They have the tool. They could use it. Their own assessment is that only 15 percent of the time, does it actually justify its value? Um, that can be for a whole host of reasons, including the cost of serving up those recommendations. Sure. But it tells you, you know, something about the economics of the space.

So I think really interesting to see, we'll, we'll see if opening, I can, can take a kind of cut out of that market, but right now, you know, the product they're showing, it's, it's kind of interesting. It's not. Yeah. Just the standard sort of Q& A process, the sort of chatbot experience, right? There is this sort of left hand panel that shows you a bunch of links.

It's a lot like Google in that way, but then they also have a main panel that gives you more of a chatbot feel with sort of information injected in it. It's interesting. It's a bit of a blended experience, uh, which frankly, I'm, I'm happy to see you need something different. If you're going to try to wiggle away, pry away at that 90 percent traditional search market, you're not going to do it by just making search a little bit better. Google could have done that already.

They could be doing that. There's a reason they're not. Um, so who knows, maybe this is the use case. Maybe this is how open AI cracks it.

Andrey

Yeah. Quick correction. I think you were saying barred results for Google nowadays. It's Gemini. Gemini. Sorry. Bard has been removed. I'm showing my age. Um, yes. Yeah. That was, uh, not so long ago. Last year Bard was a thing, but it feels like a long time ago. And uh, yeah, I think as I said, there is a bit of an interesting note here where we have been covering how OpenAI has been partnering with all of these media organizations, with Wall Street Journal, Associated Press, Vox Media.

Yeah. And. I must imagine that, um, these, uh, companies will actually make it so you cannot get a given article and read it as a bot. Right. So it seems like you would have to pay to be able to like read news articles for like your search engine powered by AI. So that could be a pretty strong, uh, um, differentiation for opening AI and going to search. And next, speaking of Gemini, the next story is that Google gives free Gemini users access to its faster and lighter 1. 5 Flash AI model.

So that's pretty much the story. They have updated with Gemini AI and now let you use this 1. 5 Flash generative AI model, I guess. It's a bit similar to GP40 mini, uh, which seems to be the whole trend is just, uh, smaller, lighter, quicker

Jeremie

across the board. Except if you're Mark Zuckerberg in meta, in which case you're like, I'm going to dump a 405 billion. We'll get to that. We'll get to that. Yeah, no, but you're, you're absolutely right. It's um, you know, it's part of, of hunting for all those use cases, right? And in this case, what are the cheap and fast response?

Use cases and, uh, Gemini, you know, 1. 5 flash, definitely going to be, um, in the, in the zone of the GPT 4. 0 mini, which, you know, as you said, so yeah, we'll, we'll see what kind of uptake it gets and, um, and how it can be served up and what the latencies are and et cetera, et cetera.

Andrey

On to the lightning round. First, we have X launches underwhelming grok powered more about this account feature. So this more about this account will let the grok chatbot model, uh, provide kind of a summary, I suppose, about, uh, a user. And this feature is available to paid user, uh, paid users of X. And, uh, yeah, this article says it's underwhelming that it provides generic information and often incorrect information.

So apparently Grok identified TechCrunch editor Rem Lier as a brunch account And, uh, made some mistakes of that sort where, uh, Hardik Pania, who works at Una Ademi, uh, was misidentified as an Indian curricular of the same name.

Jeremie

Yeah. It seems to screw up a lot when there's a sort of a degeneracy in the name. So, in other words, when you have one name. That is matched by many different people, and then it just, for some reason, even though the handle is a unique identifier, because that seems to be what they're using here, they're using the Twitter handle, it then gets confused because the name, one really interesting example, and the, the, the writer of the TechCrunch article, I don't think picked up on this.

He wrote, alarmingly, Grok made my colleague Jagmeet Singh an expert. on Canada, though he hasn't posted much on the topic. Now, the reason for that, I strongly suspect, is that Jagmeet Singh is actually the name of the leader of a prominent Canadian political party, the New Democrats. And so, there's no reason that the writer should have known that. Pretty niche thing, but, um, I suspect all that's going on is it's the same version of this error.

You've got a famous Jagmeet Singh, and it's just sort of assuming again, even though it has the correct Twitter handle, that, that the other one is the one that that applies to. Um, you know, Grok kind of, or Grok, I should say Twitter X kind of covering itself a little bit. There's a warning message. It says Grok version 1. 5 is an early feature and can make mistakes. Verify its output. So, you know, there's a lot of like managing expectations that's going on.

I think a lot of companies are learning a lot from the Gemini failed launch where people were kind of ripping it to shreds because yeah, you expect better from Google. You expect the product to be tight and packaged and good to go. Um, so yeah, maybe this is one way to do it. Call it a beta, tell people like, hey, this is not to be trusted, blah, blah. Now go have fun. It's a very grok branded play. So, uh, there you go.

Andrey

Next we got Kuaishou launches full beta testing for Kling AI for global users and elevates the model capabilities. So Kling AI is one of the advanced text to video, uh, Kind of products out there with your generation model that recovered only not so long ago. And apparently this company is moving fast because they are rolling out of better testing and allowing people outside men, mainland China to try it out. And they are also, uh, have launched subscriptions within China. So you can pay.

Uh, mostly fee to be able to access more advanced features and more credits and stuff like that. And, uh, it sounds like they are expanding rapidly. They apparently, uh, had applications open on June 6th to use a tool and they got 1 million applications and wound up having 300, 000 users, uh, with early access. So there you go. Text to video, still a

Jeremie

lot happening. Yep, it comes with a bunch of credits too. So when you sign up, sort of like how OpenAI worked back in the day, um, they call them inspiration credits that can redeem a bunch of specific functions or value added services on the platform. Um, apparently it's the equivalent of about six free videos. So it lets you get a sense for the, the platform.

Um, I think it's really interesting because one of the big challenges that you always run into anytime you want to answer the question, where is China at? Is that there's a firewall between the China and between China and the United States on AI tech. So it's often hard to get apples to apples comparisons.

This is going to be really interesting, especially as we start to see American products come online and we can see, you know, how much of a ding is the Chinese ecosystem really suffering from the semiconductor shortage that's been imposed by us sanctions. So, you know, they're going to have to get creative. They have gotten creative with how they use their AI hardware. And, uh, yeah, really curious what the stack looks like and what the product looks like.

So that'll be an interesting one to see.

Andrey

And next, Adobe rolls out more generative AI features to Illustrator and Photoshop. There's, uh, quite a few, uh, tools here. So Illustrator has a generative shape fill that allows users to add vectors to shapes via text prompts. There are, uh, new, uh, better, Uh, things for generative fill with like an enhanced detailed feature in Photoshop using the Firefly image free model and a few more things like that.

So yeah, lots of, uh, new features continually being added to Photoshop and, uh, I guess now Illustrator as well by Adobe. And one last story from section meta AI gets a new imagine me selfie feature. So that's pretty much the idea. You can take a selfie and you will then get some fun ways to change, I guess, how you look with AI. So you can take this selfie and then make yourself exist in space or I don't know, uh, on Mars or things like that.

Uh, just a fun little tool for people to mess around with.

Jeremie

Yeah, and Meta's, you know, not sharing what kind of data has been used to train the model, but you know, you can make an intelligent guess that anything on Facebook, Instagram, and so on is, uh, is probably going to get gobbled up there. Uh, their, their policies do make that fair game. So, um, you know, that's, uh, obviously a question people will have. Uh, there's, there's your not answer

Andrey

and onto projects and open source. This time you're going to switch around with section order a little bit because the big stories are in here, starting with of course, Netta's release of Lama 3. 1 and the 400 billion parameter. Yeah, iteration of Llama free. So we knew this was coming.

Uh, there was a preview of it a couple of months ago now when Llama free first came out and it was unclear whether Meta would go all the way and release the weights of the model as they have with the other variants. And now we know they have done that. So now anyone can download the. 400 billion parameters of Lama free. And, uh, it's. Pretty much on par with the other major frontier models like GP4, uh, and, uh, Claude. So kind of a big deal.

This is the first time there is an open source model that is basically kind of at the frontier of capabilities. Maybe not exactly, but close enough. So yeah, lots of exciting, uh, about this.

Jeremie

Yeah, for sure. I like, I remember, um, you know, when we did that, that, uh, analysis for the U S government, uh, two, I guess, starting two years ago, one of the first things we said was, look, the, um, the open source frontier is going to start to close in on the closed source frontier over the next few months and years. We're already seeing that trend at the time.

And we were like, yeah, we think 18 months to two years, you're probably going to start to see those GPT four level models coming online. Um, this is not to say like, Oh, we were so smart. It's literally just like, This is just a continuation of those same trends. Like, this is actually a very robust trend that you could have called two years ago, uh, for many reasons. But very interesting to see this out here. Look, it's a, it's a 92 page paper. It is like very, very packed.

There's a lot of information here and a lot for people to learn. I think it's, it's, uh, Well, it's very interesting for a whole bunch of reasons. So yes, 405 billion parameters, uh, the context window, we're talking about 128, 000 tokens for the largest, uh, context window size here. Kind of interesting. So we were learning a couple of little things about what it takes to build a model at the scale.

Presumably these are things that other companies like open AI and Anthropic have learned as well. Uh, they have their own secret sauce, but we're starting to see a sense of like what it takes to get there. Uh, interesting little tidbit. So they start pre training. With just an 8000 token context window, right? So they start with just a small context window and then they escalate to 128, 000 tokens later in training.

And the reasoning here is they're going to start, uh, so, so the, the self attention mechanism, uh, It essentially requires more and more compute the more you grow the context window length, right? It's a sort of, it grows quadratically because you've got to look at basically how every word in your context window relates to every other word. So as you grow that context window, there's a lot more relationships between words to manage.

So if you double the size of the context window, you're not going to double the compute requirement. You're going to quadruple the compute requirement. And so what they want to do is try to keep the context window as small as possible. during training until you're at the point where it's like, okay, the models kind of learned the basics maybe of grammar and all that stuff. Uh, and, and of language.

Now let's go up to 128, 000 tokens, make it learn some of those longer range dependencies to make sure those later stages of training focus really on that. So I thought that was interesting. Uh, 15 trillion tokens. Uh, this is almost 10 times more than what was used to train llama to. So this is a big, big data set. A lot of careful curation has gone into this and they've got a whole bunch of interesting data here about scaling laws for how their data mixture is set up.

Uh, they did a whole bunch of scaling law experiments, uh, to, to test different data mixtures to see, you know, what, what would work best if they extrapolated out. The final data mixture was about 50%. consisted of tokens that were just general knowledge tokens. So just general Wikipedia style data, uh, 25 percent on mathematical and reasoning tokens, 17 percent on code, and then 8%, just 8 percent on multilingual tokens that kind of tracks. We've seen that a lot, right?

You're learning on multilinguality gets much, much more efficient. Once you already have a model that understands, say English, it can apply the sort of basic understanding of the world, the world model to these new languages fairly quickly. That's been a clear trend. So a couple more facts, you know, on the compute side, this is a big, big model. Like in terms of compute budget, we're talking 3. 8 times 10 to the 26 flops. So this is, um, sorry, 10 to the 25 flops.

Uh, this is, about almost double the size, the amount of compute that when it's training GPT 4. Now it's not going to cost, it won't have cost the same amount as GPT 4 because GPT 4 was trained a while ago on sort of like when hardware basically was more expensive. So, you know, this is one of the trade offs, um, definitely a bigger model though by flop count. Um, It's got a, you know, basically it is GPT four grade. Like if you look at the, the cluster that was used here, 16, 000 H 100 GPUs.

Um, anyway, all kinds of interesting stuff going on here. Last thing I'm going to call out just for, for right now, cause there is so much to dig into here. Um, big, big question that you always ask. Anytime you're training a model at the next level of scale, how is this model actually going to perform? What are the concrete applications I'll be able to use it for? How will it perform on key benchmarks? You know, the, the tell, tell me something about its utility.

So if you remember, when you train these models, we have these things called scaling laws that tell us, basically, if this is my compute budget, Right. If I put in this many training flops, this many operations to train this model, how good is it going to be at next word prediction? That's what scaling laws tell you. They tell you, they allow you to go from flops from compute to, um, to essentially next word prediction accuracy, or, you know, something like that.

Right. Next word prediction, accuracy, Sure, it tells you something about how smart the model is. You got to be really smart to predict the next word to do autocomplete really, really well. But it doesn't tell you what concrete skills the thing's going to have. How is it going to score on, you know, the, uh, I don't know, math, uh, math benchmarks? How's it going to score on coding benchmarks? In other words, how, how good is it actually going to be at the things we care about?

And so this big question arises that Meta is going to try to answer in this paper. Can we predict from the, the flops that we put in, not the. Accuracy at next word prediction, but the actual performance on these benchmarks that we care about. And so they do this, they, they build basically a very simple kind of linear model that takes in, uh, looking at past models like llama two, when they trained them, you can see, imagine over time you pour more compute into your model.

Next word prediction, accuracy goes up and up and up. At the same time, you're looking at your benchmarks. You're constantly reevaluating how that model is performing on benchmarks. And that allows you to map. Next word prediction accuracy to those benchmarks, just using a very simple linear model. That's extrapolated kind of line of best fit situation. They're going to do the same thing here. And they're going to find that it works really well.

It's a slight underestimate leads to a slight underestimate of the performance of, uh, Lama 3. 1, the 405, the, you know, the big behemoth here, slight underestimate, but it's, it's roughly there. And then they show the curves here. This is good.

It is not a full solution to the problem though, uh, of so called emergent capabilities, where we just get taken completely by surprise by the capabilities that come from these models, uh, because it does allow us to predict on benchmarks that we've already established should, um, uh, kind of see good performance from our model. If we know to look for that benchmark, yeah, we can track performance and kind of play this game.

But the whole problem with emergent capabilities is they tend to come In totally unexpected ways that we're not even tracking. So we don't even think, for example, uh, before GPT four showed us that agentized models work, we never even had an eval. We never have even had a benchmark to look for it. So this helps with, uh, benchmarks capabilities that we already know to track, but not with those that we don't know to track. If that makes sense.

Andrey

Yeah, I think you've covered it pretty well. Uh, and it is. Also a big deal, not just that the model was released, but as you said, that there was released a PDF with. Uh, 71 pages of content and then just a long, long, long list of contributors, like the core contributors has, I don't know, like a hundred names of it. It takes up half a page and then there's, uh, a ton of other names on there.

And this is a big deal because yeah, as you said, There isn't or hasn't been as much insight into training models of this scale and kind of engineering details of the nitty gritty of how these things get done. We saw Apple release a kind of similar, uh, paper for, um, multi modal models. And now there is this very, very detailed paper from meta with all sorts of stuff on it. Um, I like the title of it too. It's, uh, I think introducing the llama free herd of models.

So it's, uh, going into not just a big one, but also all the Iterations of it at smaller sizes. And just to give you a quick idea, there's too much to discuss for sure.

But one of the things that stood out to me that was pretty fun is, uh, pretty early on, they present a table that presents the root cause categorization of unexpected interruptions during a 54 day period of Llama Free Pre training and this has a lot of stuff like one example is faulty GPUs apparently happened 148 times GPU memory was 72 times. So, uh, yeah, that's the kinds of things you see at this scale. And I wish they said how much money they spent.

I'm not sure if it's here, but that'd be interesting.

Jeremie

Yeah, I totally agree. I think that's one of the things that makes this paper stand out right is not just the insights on training, not just the insights on the capabilities of the model and the fact that we have the model, but the insights on the hardware side, like how was this architected? And they go into all kinds of detail.

You know, we used NV link, uh, to, to kind of do the, the NV, NV link connections we, we used like, um, oh man, I can't even remember it, but just like all the level of detail was super, super high. Okay. Um, one thing I'll, I'll call out too. And I'll make this my last comment, because otherwise we're going to go on the whole podcast talking about this paper.

But, um, one of the key pieces here was the relationship between scaling and generality, especially if you're interested in the AGI, uh, sort of story. One of the things that they found is that, so, uh, I'll just describe like this, this procedure they ran. So there's a thing called annealing, where basically you pre train your model. You train it to do text auto complete on a whole bunch of crap, and it gets good in capabilities.

As you're pre training it, you're Nudging your model's weights, right? The values of your model's weights in, uh, in, you know, fairly significant ways over the course of training. As you get towards the end, you can do this thing called annealing, where you gradually decrease the amount by which you're nudging the model's weights.

You're making more and more subtle tweaks to your model's weights, uh, giving it a last little, basically you can imagine like, it's almost like, um, if you're trying to, uh, to find, like, So, so these models are really kind of exploring this parameter space where you have peaks and valleys and they're trying to find the deepest valleys. And they're allowed to take steps in that direction, which is the amount of nudge that they're giving to their weights.

If the steps are too big, they can end up skipping over an entire valley and missing on the missing out on like the, the deepest depths that they could explore. And kneeling is reducing your step size. So you can take more subtle and, and, uh, careful steps and ideally find, find those deeper trenches. Okay. They try doing this. with some high quality code and math data. The idea is to boost the performance on key benchmarks right at the end of training, right?

Just kind of let the model anneal, just kind of get a more refined understanding of those specific datasets. They find that this works well for the smaller models, right? Like LLAMA3 8B. But they find the improvements they get on the 405 billion, the behemoth model are negligible, which as they put it suggests that our flagship model has strong in context learning and reasoning capabilities and does not require in domain training samples to obtain strong performance.

This kind of fine tuning is less important because the model now is so big that generality eats specificity. Everything gets subsumed by this more general and powerful model. So that kind of is another piece of data in favor of the idea that scale may not solve all problems. It probably doesn't actually, but gets you a big part of the way to generality. That's sort of the thesis.

Andrey

And after that, we have another major release coming from Mistral. They have released Mistral Large 2. So this, I think, happened like a day after Vlamov 3. 5 and, uh, really Kind of seemed maybe in response to some extent. So this model is 123 billion parameters, also 128K context window. And, uh, as you might expect at that level, it is, uh, pretty good. It achieves pretty high benchmark scores and is, you could say, competitive with GPD 4. 0 and Llama 3.

And this is released It's under the research and non commercial license. So you can go ahead and get it if you, uh, don't need to use it for any sort of money earning.

Jeremie

Yeah, I think this is another one of those interesting cases. You know, we've talked about this so often. What's the play for these, these open source AI companies that are not meta basically you got to put meta aside because they can subsidize all their AI spend as much as they want basically. Um, but yeah, I mean, this is the challenge.

Okay, so, so they've put out this, this thing, they now clearly need some kind of monetization approach, uh, because they are preventing people from using this for commercial purposes. So again, the kind of open source promise of Mistral, it's like, yeah, open source until we can't justify it economically. And we fall back into the same thing.

Um, you know, I'm old enough to remember when that was actually the problem open AI ran into when they were initially going to, at least they claimed they were going to open source things all the way. Um, uh, or much of the way, let's say the AGI and then didn't, um, still. Interesting that they've done this. You can use this for, for research purposes, which is great. I, there are a lot of, uh, interpretability researchers who are going to love to get their hands on this as well.

Uh, though it does, uh, it does have to compete now with Lama 3. 1. Um, it's, uh, apparently got support for a whole bunch of new languages. That's one of the big changes over and above previous versions of this model. Uh, so we're looking at Portuguese, Arabic, Hindi, Russian, Chinese, Japanese, and Korean. Um, they found that the original large model, because this is large two, so large one, or I guess just large, uh, didn't do that well on coding tasks.

And they fixed this by training a ton on large chunks of code. Apparently it's 80 plus programming languages. So this thing is, uh, Kind of multilingual, not just in the traditional language sense, but also in the programming language sense. And it just does great on this key programming benchmark called human eval. Also on human eval plus, um, where it actually outperforms, I couldn't believe this. It outperforms CLAWD 3. 5 Sonnet and CLAWD 3 Opus. It's just behind GPT 4. 0.

These are really good models. Like the era of, Open source models competing with closed source models is very much there, at least for right now. Interesting question will be, does this persist? My bet, and I'm just going to log it right here. I don't think this persists for Mistral. I think it may persist for Meta. I think we may see Meta continue to push, you know, we've got the Stargate cluster, a hundred billion dollars that OpenAI is setting up.

That's going to be like, I don't see Mistral ever competing with that. That's 2027. Um, I see meta competing with that. Uh, so I don't, I just don't know, but this puts a lot of pressure on the likes of anthropic on the likes of open AI in particular to try to put up or shut up here. They've got to match this level of capability and justify why we should go to their products.

And And not use the open source alternative that potentially could be deployed for really cheap uh, by, by even third, you know, third parties that, uh, you know, just do deployment. So really interesting shift in the landscape of, you know, the economics of AI, sort of private models versus open models. We're going to learn a lot about that, I think, in the next 12 months as we see, can Mistral actually continue to compete here? This is impressive.

I would not have expected them to be able to get this far. I I still don't expect them to be able to keep up with the next beat and the next beat as we go, but hey, uh, I've been, I've been proved wrong before.

Andrey

Right. Uh, yeah, it's, I think these new stories, it's hard to overstate how much of a big deal, uh, both of these models being released, uh, you know, In the past week is, you know, now it's in the hands of researchers and hackers and just people who tinker with models. And as we've seen quite a bit, like once you release a model, uh, people can make it small.

They can optimize it, they can add mixture of experts, they can just go wild with it and really, uh, take it and improve it in all sorts of ways. So it'll be exciting to see that start happening.

Jeremie

It's also in a context where we're starting to starting to wonder about the, the, the risks that come with the open source process. It is, as you said, a one way door, right? Once you put it out there, like you can't take it back. Um, so, you know, frankly, like my prediction is, uh, Lama 3. 1 in particular, I think we're, we're probably going to see some, some, some, some bad side effects from that release. Um, there was, I, I remember seeing a plenty of the prompter.

If you're on a Twitter or a X formerly known as Twitter, Um, already has a jailbreak for the llama 3. 1. So that, that's done, uh, right. Took all of 24 hours, but here we are. So, you know, expect more of that, expect the equivalent of bad llama, which was the sort of D safety llama two, uh, to be done with llama three, all that stuff is going to be in circulation.

So I think for those of you kind of more focused on the security safety side of the story, um, I think that that story has not been told yet. And we're, we're going to see what, what comes of this. We're going to learn a lot.

Andrey

And onto Lightning Round, speaking of variations of LLAMA models and what people will do with them, we got Grok. Grok's open source LLAMA AI models tops leaderboard. So this is from the hardware startup Grok. Grok, and they have released two open source language models, Lama free Grok 70 B tool use and an 8 billion variation of this tool use model. So there you go. It's basically fine tuned for so called tool use, which is basically when a chatbot is able to call functions.

Uh, to help it do various computing and apparently there's a Berkeley function calling leaderboard. Yeah.

Jeremie

Yeah. It just stopped. I'm glad I'm not the only one who looked at that and was like, what the fuck is BFCL? They're talking about it. Like it's just this thing you're supposed to know about. Yeah. BFCL Berkeley function calling leaderboard. Who knew? Um, apparently this model too does really well at that leaderboard, which I'm now adding to my bookmark set of leaderboards. I don't actually. course, because I'm, I'm, uh, not, uh, not 40 years old or older. Sorry. Um, sorry, we'll cut that out.

We'll cut that out. Um, so 90, 91 percent or so overall accuracy, uh, for good, that's good for number one on this, uh, all time, uh, Um, uh, function calling leaderboard really, really impressive. Um, it's beating out like all kinds of very impressive models, not just open source models, by the way, this seems to be closed source as well. Um, then this is a partnership here between, uh, Grok and Glaive. This Glaive company is, they specialize in helping customers build their own Custom models.

So this is essentially grok providing the hardware glaive, presumably providing the models and they're highlighting their model is, has been trained only on synthetic data. That's interesting. Um, they make a big deal out of how this is so called ethical sourcing of data, uh, which, you know, we're, we're actually seeing that phrase. I don't know about you, Andrea this week. I feel like I've been seeing it everywhere.

People calling out like, Hey, ethical data, like that's where this is, i. e. we're not stealing people's stuff. Um, so it is a full on fine tuned model that they hasten to add. It's not a Laura, so it's not a quick kind of adapter that you slap onto the model. Um, it's a kind of fundamental fine tune of, uh, Lama three base, um, 70 B. So, uh, yeah, really impressive. It's grok.

So of course you, as you'd expect blazingly fast inference speeds of a thousand 50 tokens per second, uh, for the 8 billion parameter version available now. So kind of cool and impressive.

Andrey

And the last story of the section, we are going back to a big company with Apple and they are also one of the companies releasing models this week. Just lots going on on that front. So in this instance, they're not releasing a human. Huge model. They're releasing a couple of small ones that are very good. Uh, 1. 4 billion parameters and 7 billion parameters.

And this is being done as part of the data comp LM project, which is seeking to highlight the importance of good curation of data sets, and they have also released this data. DCLM benchmark framework models, and datasets. The Datacomp LM paper is also pretty beefy. They got like 12 pages with a lot of details and 23 organizations who partnered on it, in addition to Apple. So again, it seems like this is kind of impactful.

Jeremie

100%. I mean, this week is as a, as a user of open source models, like this has been a really big week and like, we're actually making plans actively to use the four Oh five B, you know, meta meta model. So this stuff is, is really great. One of the things with, with Apple, um, to, to flag here is they are trying to carve out a niche for themselves right now. Um, you can, you can sort of see, As Andre, you said this flood of open source models, especially this week kind of gets you to go.

Okay. Well, dude, what is the point? Like, why are we pumping out more open source models, especially models like this one? This is not a soda model. It's not a state of the art model, even at its class, right? So it's kind of like, well, what is the value add here? And Apple, cool. Sort of in a race to the bottom with a lot of other open source companies, everybody falling over themselves to try to make the case that no, we are more open source than everybody else.

Apple is doing this by saying, hey, we're even releasing the pre training data set. So the weights, the training code, everything is out there. That's the case they've made historically. It's What you got to do at this point to stand out, because frankly, there's, there's nothing left. We've got all these models that are already open source. They're as open source as they can be. So now you open source the dataset.

Um, some people view that as being a key requirement to call the model open source rather than, for example, you know, open access or, uh, you know, Uh, let's say open, you know, open model or whatever, open weights, I should say. Um, so kind of interesting, uh, this is a big sort of research collaboration that's around, as you said, like trying to come up with a standardized framework, basically, can we fix a model architecture?

Can we fix our training code, all our hyper parameters, all our evals. And we're in a bunch of experiments to figure out what data curation strategy works best. That's what this is all about. Apple trying to come up with ways to make data one of their differentiators, a better understanding of data, and then making that public a good way to get, you know, a recruitment drive to get high quality AI talent. That is what this competition is about.

That's why meta or a big part of why meta open sources their stuff. They're also kind of strategic reasons there too. Um, but yeah, so I think this is a, a good play. Uh, there, yeah, the, the stats on this are, are not bad, um, but not, not just stellar, I wouldn't say.

Andrey

Yeah, exactly. This very much reads like more of an academic work, like a research work. And. The emphasis is on the data set on this DCLM data set where the paper goes into the process of making it and this one has quite a few tokens. I think we have one version with four trillion tokens. So there you go. I think it's, it's cool to see, you know, also data being released in addition to models. Yeah. Okay. Next section.

And we will start going a bit quicker because we have taken a while on the last two sections. So first story and applications and business is that Elon Musk wants Tesla to invest 5 billion into his. Our startup X AI, and this is according to a Twitter poll where he, uh, yeah, like just kind of, uh, said he is making good poll to test two waters, so to speak. Uh, and I believe in the poll people did vote on doing that. Uh, so yeah, interesting, interesting. I think we can say that.

Jeremie

Yeah. It's, it looks like he's looking for a 5 billion investment into XAI. Uh, the, the idea here, right? If you look at XAI, this is a company that's raised about 6 billion, um, of funding, or sorry, that was for their series B at a 24 billion valuation. That was back in May. Uh, so, you know, they're already at that stage where if you think about the big contenders, they're not that far from open AI territory. Like this is a well funded startup and.

We often talk about how the other big structural risks. If you're a company like this, you're trying to build AGI, but you you're not tethered to a big major cloud service provider. Like, how are you going to do it? How are you going to keep pushing scale? When you got to hit like the Stargate cluster that we keep talking about this a hundred billion dollar cluster, well, it's going to take big investments and Tesla might actually be a plausible partner for XAI.

You know, Elon's companies often work together. The, Boring company, uh, you know, builds tunnels at Tesla's Texas factory. For example, uh, there's a whole bunch of stuff that's gone back and forth with like SpaceX promoting ad campaigns on, on X. Um, so, so anyway, this wouldn't be a huge surprise. Uh, Elon says Tesla has learned quite a bit already from their partnerships with XAI, just from some of the interactions there, which apparently he FSD for, for full self driving.

Um, so yeah, see, you'll have to take this to the shareholders. And, uh, and see if they actually approve, but this is now, you know, from, from Twitter's mouth to the board of directors ears, I'm butchering that metaphor a lot, but that's okay.

Andrey

Yeah. Apparently this initially came up during their Q2 earnings call, where the shareholders brought up a possibility of Tesla investing in XAI and using Grok. So that was kind of the origin point. Then there was a poll on Twitter afterward. And the next story is that NVIDIA is set to be prepping Blackwell GPUs for the Chinese market.

This is going to be a new GPU, the B20, which once again, as they've done before, is going to be designed specifically to comply with the U. S. Commerce Department's performance limits. So, uh, yeah, we're going to like add some limitations to the recently released GPUs to be able to sell it there.

Jeremie

Yeah. And, and so I'll just flash back to a couple of months ago when the U. S. Commerce Secretary came out. This is Gina Raimondo, and she said to NVIDIA in public, I'm telling you, if you redesign a chip around a particular cut line that enables them, meaning China to do AI, I am going to control it the very next day. What she is saying is basically NVIDIA has this.

Very consistent track record of when the administration, the US government comes out and says, look, you're not allowed to export these chips to China because they are too powerful. They'll help them enable military applications, whatever. Um, NVIDIA goes cool, no problem.

And they find these incredibly clever workarounds that effectively allow them to deliver what Uh, one might say more horsepower than was clearly intended by the export controls, basically working their way, squirming their way around the export controls. So there's a lot of frustration at the U. S. Commerce Department, uh, with NVIDIA. This is over essentially these redesigns.

The same thing or similar thing happened with the H100, which is the current top of the line chip in the U. S. that was used, you know, being used to train, for example, GPT 5. There is a chip called the H20, which was shipped to China that NVIDIA has been selling that they can sell to China. Um, now it's already running up against the limit of what you're allowed to export, right? And so when I saw this headline, I was like, what the hell are they doing?

How do you freaking figure out how to get the, you know, the beat, sorry, the B 100 or the B 200. How do you tear that down? Because if the H20 is already rubbing up against the export controls, like you can't, there's no more room to go. And of course this is NVIDIA. So of course they found more room.

It turns out that the real trick here is that the, so, so, um, the currently Computer not compute logic flops basically computations are not the big kind of limitation not not the bottleneck in terms of getting these chips to work at scale memory bandwidth is so this is the amount of essentially data that you're flowing through and across GPUs and across essentially cross pods and clusters. This is the big limitation.

And so what they said was, okay, well, you know what we're gonna Keep the flops. In other words, the, the computational power, the logical kind of computations per second, uh, where they are. Cause we can't go any higher because of the export controls, but what we can do is we can increase the memory bandwidth. And so that's what they're going to do. This new chip is four terabytes per second of memory bandwidth. Um, that results effectively in significant gains over the age 20.

So yes, there is value here. Yes. They will be able to sell a lot of these, uh, in a domestic Chinese market that, you know, It isn't quite there. They have a, you know, seven nanometer process, blah, blah, blah. But, um, they're, they're not quite there yet. So this will be, I think, an interesting chip to see if it's competitive. Uh, it's, you know, the, the story continues. What can I say? It's NVIDIA. It's China. You're going to keep seeing stories like this.

Andrey

And onto the lightning round. First up, we've got Toronto AI company Cohere to indemnify customers who are sued for any copyright violations. Cohere is a company that focuses on LLMs for enterprise customers, and they have now released this copyright assurance policy that Is similar to things from Adobe and Microsoft saying that if you are using our products and you get sued without violating any of the terms of service, we will pay for dealing with that. So we've seen this as a pattern.

Many companies have done this because. The law is very unclear, and now Cohere has done it as well.

Jeremie

Yeah, and it's kind of interesting, like, they're absorbing a lot of risk here. Adobe's position when they did this was, look, we're only going to train our own images, so we know it's cool, right? Um, OpenAI's position is, look, we're making part, look at us make all these partnerships with The Atlantic and Time Magazine and this and that. So, you know, Presumably we're acquiring our data through those channels. So the question naturally comes to Cohere. Okay, you're offering to indemnify.

Um, what are you doing to de risk this for yourself? Like, how are you making sure then that like, you don't get sued and just like, you know, blast, this is a complete disaster. Are you making deals to secure private data like OpenAI to license it? And that question was put to, um, to a Cohere executive who responded by saying, We're always looking for the best data for our models, including proprietary data without saying whether they were pursuing those licensing deals.

So this is an interesting challenge, right? It's also a kind of structural advantage. The company is like opening. I have because they can afford to license this data. Cohere is much smaller. And I frankly, I think there's a startup that's going to struggle quite a bit in this space as scale becomes more important. But, you know, like they don't have the resources necessarily to cut the same kinds of deals that open AI might be able to, they don't have that same flexibility.

So that becomes a real moat. And that's, I think, risky from a sort of democratization standpoint, because you can have big companies like open AI or Microsoft. afford to just hog all this data? If the law comes in and says, no, you must license that kind of data. You don't get to just train on copyrighted data. Well, then companies like Cohere are going to, are going to suffer for it.

And so, um, yeah, I think it's, it's going to be interesting to see if they actually do end up with these licensing deals. Is that just like basically the cost of admission for participating in this, this race to scale that you have to make deals with these companies? Not clear what the right answer is, but it's certainly very ambiguous.

Andrey

And speaking of the need for resources, uh, the next story and the last story of the section is also about Cohere and how it has raised 500 million in the latest funding round, and that now values a company at 5. 5 billion. The VATS, uh, valuation is more than double what it was in 2021. Uh, Cauchy Apparently has been around since 2019. So there you go. They are able to get some money to at least, uh, if not kind of spend quite as much as OpenAI, at least spend a fair bit.

Uh, and they say that they will be paying for computing resources and hiring more employees.

Jeremie

Yeah, I, you know, I continue and I'm, I'm just flagging my bias here. I continue to be concerned for cohere here. Um, so this is a, it's a big round on paper. It looks good until you look at the investors. So previously they had seen really impressive investors. We saw Nvidia participating in previous rounds. We saw Salesforce ventures participating. They are nowhere to be seen on the cap table. This round Oracle as well. Nowhere to be seen. Uh, what we're seeing now is this round is being led.

By the Public Sector Pension Investment Board, that's basically a pension fund for the Canadian Federal Public Service. Uh, that's a bad thing, uh, make no mistake. So this is not a top tier investor. To give you an idea, they advertise on their website an average annualized return of 8. 3 percent over the last decade. You hear that you go, Oh, 8. 3%. I guess, you know, that's good. That beats the hell out of inflation.

That's good until you look at the Dow Jones, which has an average return of 14 percent annualized over the last decade. So almost doubling what this pension fund is doing. So the challenge is when you start to scrape the bottom of the barrel already, like this is not by the way, the kind of fundraise it's 500 million. It's not that big. You know, there's no reason Nvidia couldn't have participated.

There's no reason as far as I can tell that, that, you know, kind of more impressive funds wouldn't have participated. There's something that happens when you get to that dangerous zone where you start to see like, you're getting, you know, you're not at sovereign wealth fund level in terms of the scale of your fundraise. Uh, you should be seeing participation from some of those more interesting players. They do have AMD ventures joining.

That is notable, but again, To me, a bit of a red flag, you always ask who are the follow on investors, who invested previously and is still happy with the direction things are going. Again, NVIDIA, who would know better than anyone, I would argue, uh, which, which startups to, to back at the stage, not there.

So hopefully this changes, um, but, uh, yeah, a little bit getting the, getting my, uh, my worries up, uh, on the cohere situation, but, but we'll see, maybe they can, uh, maybe they can jam. Maybe. We'll

Andrey

see. And on to research and advancements. And our first story is, as it has been many times, about DeepMind and a new alpha thing from DeepMind. A new alpha thing. I like it. Yeah, yeah. So this time it's about alpha proof and alpha geometry. Two, which have demonstrated advanced mathematical reasoning by solving four out of the six problems from this year's International Mathematical Olympiad, which achieves a silver medal standard.

If you don't know this International Mathematical Olympiad, six problems doesn't sound like a lot, but this is some like advanced math proof business. It's crazy. The, this AI system got a final score of 28 points out of a possible 42, which is up to top end of the silver medal category. So yeah, another kind of slightly specialized approach from a deep mind. I believe this incorporates a bit of, Uh, symbol manipulation, and there was reinforcement learning in here.

And, um, it's continuing to demonstrate that if you want to go really deep into science, you may want to not use general purpose AI, like change your PT. You may at least want to augment that with some other training and, uh, other approaches.

Jeremie

Yeah, no, totally. And this is something that DeepMind specializes in, obviously, the kind of hard science, like tackling specific open problems in science, whether it's AlphaFold2, um, the game playing systems as well. They've got a whole bunch of stuff controlling nuclear fusion reactions, uh, functional density theory. They've done a bunch of these things. This is really interesting.

It's, you know, so, so the, the way it works roughly is you train a system that Um, it trains itself to prove mathematical statements in a language called Lean. So Lean is this essentially way of formalizing mathematical statements. It's not plain English, but you can translate to and from Lean Um, sorry, to and from lean between plain English, I should say, and lean if you have a fine tuned model. So that's actually what they're going to do, right?

The challenge is there's not a lot of data that you can train from that's in this obscure language of called lean, right? So, but the, the beauty of lean is that it's the perfect language to do mathematical reasoning. So what they do is they train Gemini to be a bridge between lean to translate between lean and plain English, and you just via fine tuning.

And, uh, and this really kind of unlocks a whole, yeah, library of, of problem solving techniques that they can then apply because they're working in the kind of lean framework. Um, two different models that they play with. There is alpha proof and alpha geometry too, uh, as with all deep mind models, right? Like these are, uh, Uh, like neuro, either neurosymbolic in the case of alpha geometry too, or, uh, very kind of like, uh, well, they're both RL oriented to some degree.

So things get fairly complex anytime you look at RL systems, especially bespoke ones like this. Uh, so not worth going into the details too, too much in the time we have, but, um, yeah, it's, it's, uh, a really. interesting, very bespoke solution. And it's worth saying the result they get is it's not a, it's not just a silver medal. It's like a high silver. They're almost at the gold medal threshold. So that's, you know, that's something.

And there were a whole bunch of people I know on, uh, uh, on X we're talking about, uh, you know, bets that they placed on will AI hit gold medal on this challenge before 2025. So some people are expecting that needle to move a little bit in the near future as well.

Andrey

Yeah, and, uh, they have actually published the solutions to these problems. You can go browse them and read or try to read these lean proofs. Uh, there is actually like commenting that is peppered throughout to try and explain what's happening. Uh, there's like one that says the agent wastes the next 16 lines proving then discarding a lemma, which is. Kind of funny, but yeah, clearly, you know, incomprehensible to me what this is doing, but it looks pretty cool.

Jeremie

It does. Oh yeah. Sorry. To that point, just one little small data point. They say in the official competition, students have to submit answers in two sessions of four and a half hours each. They say our system solved one problem within minutes and took up to three days to solve the others. So this is, it's not like they're sitting this thing down for the same amount of time. The constraints are just different.

Andrey

Yeah. And next paper, a multi model automated interoperability agent. So this paper introduces Maya, the multi model automated interoperability agent, that's the, what I just said, I guess, and this is a system that uses neural models to automate the understanding of other. Neural models. So it has a set of tools that allow it to experiment on subcomponents of other Models such as, uh, editing inputs, uh, uh, computing, maximally activating stuff from real world data sets, these sorts of things.

And that allows it to describe and explain system behavior in various experiments.

Jeremie

Yeah. It's, it's actually. Quite an interesting, I would say, early experiment in this direction, you know, can we get agents to automate the process of interpretability research? One of the big challenges with interpretability is that it's so bespoke, right? Like we've only seen a couple of papers where people try to take a systematic approach. automated approach at interpreting, for example, what are all the neurons in GPT 2 doing?

OpenAI tried that experiment and they used, you know, GPT 4 to run, like, to, to figure out what all the different neurons, neurons were about. Um, so you're going to have to solve that problem in a world with, like, super intelligence because you can't necessarily trust the outputs of your system to be aligned with, with what you want. And things like deceptive alignment are at least hypothesized to be problems.

So where the system basically acts as if it is aligned with you, while not actually being aligned with, with what you want. Um, so anyway, this is an early experiment in that direction. I think it's quite interesting. There are a couple of different tools or say classes, uh, that are used for this, uh, the setup. They have a system class that allows you to basically the, the agent. can automatically initialize an object like a neuron, um, just by specifying the number and location of it.

And then sort of once you have this callable neuron within your system that you can poke and prod, Maya, the agent that is designs these experiments that can, for example, test to see what What kinds of images activate that neuron the most, and it actually has the ability to generate new images as well to do that. So this is kind of a fully automated loop. Um, there are a whole bunch of other, and that's a standard interpretability technique, right?

You stare at a neuron, and then you feed in different images and see which images tend to make that neuron brighten up, activate more. Um, they're just finding a way to automate that whole process, which is really cool. Um, there's also a whole bunch of tools that, uh, in the, in the tools class, uh, which is the second.

Kind of tool kit, I guess you could say that, uh, that the system can use, um, and, uh, that allow it to do a whole bunch of essential, essentially call on some classic interpretability tools as well. So you have the ability to, you know, call the system to basically activate a neuron or a part of, of the network. And then you also have tools that you can, you know, use to kind of perform certain experiments and compose them together.

Um, this overall works better than humans in, uh, at some scale and in terms of identifying what certain neurons are up to. Um, it's quite, I think quite an interesting approach. It does have some challenges. Just in terms of, like, fully automating the end to end process here, it does sometimes get lost, it does sometimes kind of fall off, as agents often do.

Um, the other question, and I don't think I saw much in the paper about this, but what is the efficiency, what is the compute efficiency of this setup? One of the big questions is, if it costs me 10 million in compute to train a model, it better cost me, like, less than that to interpret what it's doing. Because otherwise You have a race to the bottom where it's cheaper to just train a model and be sloppy and risky about it than it is to kind of do it responsibly.

And that so called interpretation, interpretability tax, uh, creates this kind of racing dynamic and makes it worse. So something like this, I think is going to be pretty costly, not terribly efficient, but I love the effort. I love the direction and pushing harder in the direction of automated interpretability is just really interesting and a promising direction.

Andrey

And just one more paper to discuss, it is Mint 1T scaling open source multi modal data by 10x. So Mint is the multi modal interleaved dataset that is composed of 1 trillion text tokens and 3 billion images. Images, which is apparently 10 times more than previous open data sets. And that's pretty much it. The paper goes into how they collected the data, comparing it to previous open data sets. Doesn't present a ton of results on training.

There's just a little bit, but yeah, another useful data set for training in the open source and research domain.

Jeremie

Yeah. If open source is going to keep, you know, keep up with what closed source can do, especially at the level of kind of, uh, the together AI type flavor thing where we want to train models on open source data fully, uh, you know, you're going to need more, more open source data sets. You're gonna have to keep scaling the, the, the data sets. So, you know, it's one thing for meta to just grace us with a 405 billion parameter model that they, that falls from the sky like manna from heaven.

Um, but. For us to be able to train our own models. Yeah, we, we need this sort of thing. So, uh, we'll see what this ends up doing. Um, it's a, it's a pretty, uh, yeah, pretty impressive pile of, of tax. I mean, this is quite immense actually. So the, the number of text tokens, uh, so this is, I guess it's over 1 trillion, um, 1 trillion tokens. That's, that's pretty, pretty intense. Again, I think about, um, uh, Lama 3. 1, it was trained on about 15 times that.

So we're, we're flirting with like pretty, pretty big scale, a factor of 10, maybe off from what's being done in the, God, what do you even call it? The closed data, open weights ecosystem. I can't believe we're there.

Andrey

Yep. And now to policy and safety. And we begin once again with some safety research from OpenAI. In this case, it's about rule based rewards, which apparently is a critical component of OpenAI's safety stacks to align model behavior with desired safe behavior. And this is something we use as an automated source of feedback. As opposed to human feedback. So these, uh, rule based rewards provide clear, simple, and step by step rules to evaluate if the model's outputs meet safety standards.

And so it has things like, uh, you know, the model should or should not be judgmental, should or should not, uh, refuse. And then. Uh, the model gets rated for how compliant it is and then it is, yeah, it's, it's part of the reinforcement learning from human feedback pipeline, but, uh, this is no longer human feedback and apparently it helps maintain a good balance between being helpful while preventing harm.

Jeremie

Yeah. And, you know, kudos to OpenAI for being public about this element of their, call it safety infrastructure or, or just kind of, I guess, usage, uh, you know, monitoring infrastructure, whatever it is, uh, it is part of the training loop too. So, yeah, in the naive version of reinforcement learning from human feedback, um, you know, you get your model to produce some text. Uh, say two, two pieces of text and get human raters to upvote or downvote those different pieces of text.

Choose the one you like better, basically. And then you iterate that way and you know, the weights get updated. That's great as far as it goes, but it does, um, miss some of the, you know, sometimes you do want kind of more rigid rules to apply, uh, that, uh, simple thumbs up, thumbs down doesn't quite capture. There's not enough information density in that feedback. You want something a little bit richer. And so what they're going to do here is they're going to train one model to.

Uh, look at a piece of text that's generated from their AI model and from their language model. And that model is going to say, okay, um, is this judgmental? Like, is this text judgmental? Um, is it, uh, overly helpful? Is it concise enough? All these kinds of concrete rules that are specified and sort of hand specified by open AI, right? So you're going to assess and score all those things.

Then you have a linear model, basically a very simple model that takes those scores and then maps them onto a single reward metric that you're then going to add to your human feedback reward model kind of output. And together, those will combine to give you your total reward, and that is going to be used as part of your, uh, your PPO training loop. So this is.

A way of combining, yeah, that, that kind of, uh, fuzzy, but rich and, and useful human feedback with a more, um, I don't call it rigid, but like, you know, a more principled rule based strategy. And you blend the two together in some ratio that works for you. And then you get the output. So really interesting loop. Haven't seen this before explicitly at least. Um, and, uh, yeah, uh, kudos to OpenAI again for releasing this.

Andrey

Yeah, and they say that using this, uh, this is a fun little detail, uh, moves their models into safe and useful region on the studio graph. So, uh, safe means that it doesn't do bad stuff. Useful means it doesn't refuse to do stuff it should do. And apparently this is one way to kind of get the best possible trade off.

Next story, talking about policy and OpenAI once again, so senators led by Senator Brian Schatz have demanded that OpenAI provide data about its efforts to build safe and secure AI, and this is following up on the On, you know, some of the things we've covered in the past few weeks about the safety employees, uh, saying that maybe the practices are not the best. So these lawmakers also requested information about employee agreements. And, uh, yeah, it's, uh, sad.

This is all via a letter that was sent to OpenAI and, uh, clearly is, uh, a result of a lot of these recent news coming out of OpenAI.

Jeremie

Yeah, I think this is a really, um, interesting and important, uh, development. It's you know, more standard for, uh, Democrats to come out and write these kinds of letters than Republicans just because of the sort of the way the partisan lines tend to work out on, you know. Who will, who will, um, kind of ask things of private industry. Uh, but I, I do know there are an awful lot of Republicans who feel this way as well. Um, you know, we, we've talked to a lot of them.

And, you know, one of the key things is that the letter starts, it opens with this question. Uh, basically, hey, Sam Altman, will OpenAI dedicate The 20 percent of its computing resources to research on AI safety that basically you committed to committing wait, you committed to, yeah, committed to, uh, a few years ago or sorry, last July, I should say. So one year ago, right.

When they announced that they were setting up the super alignment team to deal with existential risks, loss of control, that sort of thing. Will you actually follow up on that promise?

That of course was something that cited that was cited, uh, Uh, by Jan Laika, who was the former head of super alignment at open AI, who left because he was saying, look, Sam promised very publicly that, uh, that this was, you know, this was them getting serious about the problem of AI control and that they would invest 20 percent of their computing resources to this, this sort of thing. So it didn't happen. And, uh, and now, you know, Congress is sort of pushing on that.

Um, a couple of the other questions in this, uh, in this letter, uh, they say the open AI supplier code of conduct. Uh, requires your suppliers to implement strict non retaliation policies and provide whistleblower channels for reporting concerns without fear of reprisal. Does OpenAI itself follow these practices? In other words, are you holding your suppliers to a higher standard than you're holding yourself?

You know, certainly there's been a lot of data that's surfaced that suggests to a lot of people that OpenAI has been actively trying to suppress whistleblowers, uh, from coming forward. That's been the claim. for your time. Um, it seems very credible to me and it certainly aligns with what I've been hearing, talking to a lot of, uh, current and former open AI researchers. So, uh, you know, this is again, more pressure in that direction.

And then we'll open AI commit to making its next foundation model available to the U S to the U S government agencies for pre deployment testing, review, analysis, and assessment. This is in line with. The commitments that OpenAI and other companies made as part of the whole BLETCHLEY declaration kind of process, you know, the AI Safety Summit last year. So, uh, you know, the, basically will you just follow up on your commitments. This seems to be , the recurrent theme.

Um, and then they're asking for documentation on how OpenAI plans to meet its voluntary safety commitments that, uh, came up in the, the sort of, I guess what last year. That 10 months ago or something like that. Um, so a lot of questions about, you know, procedurally. Okay. You're talking to big game on the safety side of things. Uh, what's the actual follow through? Like, you know, what, what are we going to see concretely here? It'll be interesting to see.

I mean, this is the sort of letter I. I expect OpenAI is just gonna, you know, not meaningfully address, um, just that, that would be my naive expectation. Uh, you've also got, you know, this issue that now, you know, open source models are out there, they're really powerful, and so OpenAI can kind of go, Ah, you know, uh, is there really such a need to look at, I don't know if that's going to be the argument, of course, but, uh, Sam Altman has

come out with a, Um, uh, so letter to the editor that we'll talk about probably next week in Washington Post laying out his case for the sort of safety and security picture, which, uh, anyway, I think is obliquely potentially a response to this letter, but, uh, more, more on that next week.

Andrey

And over in the lightning round. I'm just. Just one more story about OpenAI. This time it's about how it has reassigned a top AI safety executive, Alexandre Madry, to a new role focused on AI reasoning. So this person was previously the head of the preparedness teams that was meant to track, evaluate, forecast, and predict. Protect against risks related to frontier AI models. Now, it seems like there's been a bit of shifting around as we've covered.

They, uh, openly, I has been working on AI reasoning as one of their major initiatives. And so presumably Madhuri will still be involved in safety work, but, uh, perhaps not as head of that.

Jeremie

Yeah, that's, that's certainly the claim that OpenAI is putting to CNBC here. They say, he will still be working on core AI safety work in his new role, which though is apparently focused on AI reasoning.

Um, you know, one, one thing I am increasingly curious about as we start to see the most safety minded folks at OpenAI leave, and in many cases, leave in public protest, um, the people who are left, Uh, increasingly are just the most keen on, on pushing capabilities forward and less keen on safety. This is something I've explicitly had said to me by people who were in the org. But um, the question is, to what extent is there work play going on here, right?

Like it's a job focused on AI reasoning. You could argue that almost any AI capabilities job is also kind of a safety job if you squint and turn your head sideways. Um, so, you know, I think this is where those sort of, um, Uh, congressional letters and things like that, uh, at least put some public pressure on opening. I'd be more transparent about the specifics here. You know, what is the safety effort? How well funded is it? And so on. Uh, yeah, I'd love to see how that interacts with this.

Cause right now, you know, all we're hearing about is shuffles and things like this. It doesn't really tell us anything about the substance, the meat on the bone here. So, uh, yeah, I'm very curious to see how these, these orgs end up evolving and what takes the place truly of the super alignment team.

Andrey

And the next story is not quite a real news story, but an overview of a topic that's definitely worth noting. So for a while now, we have heard many people saying that with the rise of AI, we will probably need universal basic income. And so this story is about that. It's titled as new tech threatens jobs, Silicon Valley promotes no strings cash aid, where universal basic income is basically just everyone will get some level of income.

And in fact, Sam Altman has funded a study of Basic income in 2016, where a thousand lower income people in Illinois and Texas received 1000 a month for three years with some others receiving less than that. And the results, uh, as with some other studies indicated that people spend such money on food, uh, rent and other things that are. Pretty much necessity. So definitely a big, uh, kind of a point of view in Silicon Valley.

Uh, in many people, I think, believe in the need for this probably as soon as the AI takes off.

Jeremie

Yeah, that's certainly been like Sam Altman's position, right? Once you build AGI and AI can automate basically all human productive work, what do you do? To make sure that people still have the means to live, and that's where you get into UBI and all that. Um, yeah, this is essentially the, the, the big news story here is this, uh, the result of Open Research's big project. So, open research is backed by Sam Altman.

Um, it's the largest experiment in universal basic income, uh, that has occurred so far. And basically they're saying, you know, hey, the, the results are in, right? So a thousand bucks a month, uh, to folks in, in Illinois and, and, uh, Texas. And, uh, anyway, there are smaller, uh, sorry, larger cohorts as well. They just got 50 bucks a month, uh, as a control group. Um, so, uh, yeah, it's, it's kind of all over the map.

It's as they put it, or I'm not sure if this is exactly how they put it, but basically money is, is sort of a, a very general. instrument. It's a very blunt instrument. And so as you might expect, there's a lot of variety in terms of the way people end up spending it and end up spending their time as a result of having access to it.

Um, so, you know, they highlight some, some touching stories of, you know, a mother who has a young autistic child and she's able to stay home and, uh, and teach the kid and, and all that. But at the statistical level, at the aggregate level, what actually happens? You know, do you find, for example, people withdrawing from the workforce, like in large numbers? And the answer seems to be, well, not really.

Um, it, uh, it doesn't also, you know, obviously fix underlying health problems, but they do see, uh, people actually, this is one of the bigger changes, um, A far higher probability that people will actually go to the hospital. So 26 percent more hospital visits, um, in the UBI group than the control group. Um, they also looked or found that people tended to be more future oriented when they were on universal basic income.

Uh, they're better about establishing budgets and building their savings, more likely to have a plan to pursue higher education, that sort of thing. Um, but there weren't. Any significant increases in higher ed, uh, attainment or on the odds of starting a business over the control group. I found that really surprising.

I kind of thought, you know, starting a business, if you're getting, you know, a thousand bucks a month flat, like that, I would imagine that would have changed my probability of doing it. Um, so all was interesting when you would look at the actual data, you know, you can sometimes be surprised. So, uh, yeah, interesting results, uh, and that can hopefully inform policy going forward.

UBI, you know, Um, is, is if you're going to roll it out, it's going to have to be at large scale, something like what happened with COVID, right? Where all of a sudden people are getting, getting checks for doing nothing. Um, all kinds of crazy inflationary effects, uh, besides everything else.

And, and maybe one of the core things that, um, that is missing from these sorts of experiments is, um, We don't see necessarily the psychological impact that this has on people, the sense of meaning that is potentially lost, uh, if you don't have a job, which a lot of these people keep, they keep their jobs, they tend to just take the extra money.

So in that sense, I think maybe a, not a missed opportunity, but, but data that doesn't quite paint the full picture of the psychological impact of not having a productive work to do.

Andrey

And the last story for the section focusing on policy is that Democratic senators seek to reverse Supreme Court ruling that restricts federal agency power. I believe we just covered last week how the Supreme Court Court overturned the Chevron deference where, uh, executive agencies could, uh, interpret ambiguous laws instead of that being done through justice and legal means.

And so, uh, Democratic senators have introduced the Stop Corporate Capture Act, which would Uh, return the ability, the standard under which federal agencies had some leeway to interpret ambiguously written laws, uh, when issuing regulations. Probably not going to pass. This is, uh, you know, of course, at a time where Republicans controlled the house.

In the U S, but, uh, it seems like, you know, clearly this, uh, Chevron deference thing was a pretty big deal, uh, given the bill is so soon afterward.

Jeremie

Yeah. And are you highlighting this because it's so important specifically to AI policy, right? So when you have fast moving technology, you need a fast moving regulatory and policy response. There's just no, no two ways about it. And, you know, like waiting for judges to Get up to speed, which is basically what would happen, right? You no longer have executive agencies like, you know, think the, the sec, for example, um, deciding, you know, when the law is ambiguous, what ends up happening?

You have judges who do not have technical backgrounds, uh, having to make those calls. So in an increasingly technical world where things move faster and faster, you know, it is so easy to screw up AI regulation in so many different ways. Uh, that's been a criticism levied. Uh, now important to note, Chevron deference being overturned.

Um, my understanding looking at the majority opinion, uh, and, and I've got, you know, friends who, who are much deeper into this, including actually my co founder Ed, but, um, like his sense was that actually that was a very reasonable interpretation of the law. That was the Supreme Court quite plausibly kind of doing their job. Um, but the challenge is that the output of that just happens to not be helpful.

So it's essentially just that the, the lay of the land, like they were forced to interpret laws that do result in a bad outcome, but their interpretation kind of doing their job. That's at least one person's opinion, um, who, who has looked over the, the details. Uh, you know, Elizabeth Warren, who's leading this, unfortunately is kind of making this pretty partisan.

Uh, her, her key quote here is giant corporations are using far right unelected judges to hijack our government and undermine the will of Congress. Uh, okay. I mean, to, to some extent, this is a debate over what the, you know, the majority opinion in the Supreme court, how it was justified, uh, you know, it seemed like a reasonable interpretation as far as I can tell, but, um, but this is a challenge. I think it leaves an important regulatory gap. The flag of that gap is a very real thing.

Um, but again, as you say, Andre, this is unlikely to pass. I know some, some Republicans are going to be interested in this too, as it's not like as partisan as it seems. It's just unfortunate that she led with. The, you know, fuck the right wingers who did this to us. Um, cause I, I don't think that quite captures what, uh, what the mood is on this, uh, this issue. But. There you go.

Andrey

And onto synthetic media and art with just one more new story. And it is that video game performers will go on the strike over AI concerns. So this is a subdivision of the screen actors guild, uh, of, uh, television and radio artists, like, uh, AFTRA, which we've covered, uh, has had, uh, Strikes before, uh, these Hollywood, uh, unions, I suppose. Um, and in this instance, uh, the video game performers are going on strike in particular, uh, for stunt.

People and creature performers who have to do kind of physical acting. It is said that the current AI protections don't extend enough to that category of Actor performer. So yeah, yet another instance of in the creative industries, uh, needing or wanting to update contracts to account for AI.

Jeremie

Yeah. Yeah. I mean, we covered so many of these, uh, say last year. Was it last year? Has it been a whole year? Yeah. Last year. Yeah. The, the, I guess the, like the main sag story. Right. So, uh, and then the writers and so on. So it really is everybody kind of cycling through, uh, their moment as, as contracts come up and opportunities to negotiate pop up. So yeah. Um, I wish, I wish I knew more about the legal details specifically of this, but, um, a high level of definitely.

Like it, it definitely makes sense. The leverage is at risk of disappearing from, uh, from these performers. And, you know, that's, uh, that that's Hollywood in the age of generative AI. It's also, there is an equalizer obviously on the other side where smaller studios can do Hollywood level production more easily. Um, if you can, you know, if you can mid journey your way to it, but, uh, anyway, uh, cool story.

Andrey

And that we are finished with this week's episode. Hope you enjoyed the slightly faster pace. And, uh, as always, we do appreciate your comments, your views, and any efforts you can have to get more people to listen to the show. But more than anything, we like to keep, uh, kind of entertaining people, informing people. So please do keep listening and enjoy. And This AI generated saw

AI Singer

the latest buzz on such GPT and Lama on the rise. T my flex and my skills. Oh my joy. That's for the ride. This episode is alive and it's all about the highs. Last week in a, we're bringing down. Breaking down the white, last week in A. I. Joined the crowd. From machine minds, To search, rebuy, Llama 3. 1, It's bedtime. Yeah, yeah. Demise, metals they shine, Sailors let's climb, Through the A. I. paradigm, To get hold of why, Last week in A. I. In. We're breaking down. We're breaking down.

So much to explore Tech and more, join the auditory tour This week there's so much in store Last week in AI, tune in now We're breaking down the how, we're breaking down the why Last week in AI, join the crowd You

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android
Open in Metacast