#188 - ChatGPT+Search, OpenAI+AMD, SimpleQA, π0 - podcast episode cover

#188 - ChatGPT+Search, OpenAI+AMD, SimpleQA, π0

Nov 08, 20242 hr 52 minEp. 227
--:--
--:--
Listen in podcast apps:

Episode description

Our 188th episode with a summary and discussion of last week's big AI news!

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at [email protected] and/or [email protected]

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

This episode was sponsored by The Generator. If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form. In this episode: * Meta's open-source models utilized by China's military prompt regulatory adjustments; US agencies gain access to counterbalance.  * OpenAI partners with Broadcom and AMD to develop custom AI hardware, aiming for profitability and reducing inference costs.  * Physical Intelligence unveils a generalist robot control policy with a $400M funding boost, showcasing significant advancements in zero-shot task performance.  * New U.S. regulation mandates quarterly reporting for large AI model training and computing cluster acquisitions, aiming to bolster national security.

Timestamps + Links:

Transcript

AI Singer

In a world where tech's on the rise, We've got the latest hoop, open your eyes. From chat GPT searches who's so wise, To Apple's A otic touch in the skies, It's the last week in A I's.

Andrey

Hello and welcome to the Last Week in AI podcast where you can hear us chat about what's going on with AI. And as always in this episode we will be summarizing and discussing some of last week's most interesting AI news. And as always you can also check out our text newsletter at Last Week in AI with even more AI news we will not be touching on in here. I am one of your hosts James. Andrej Kronikov. My background is that I studied AI at Stanford and I now work at a generative AI startup.

Jeremie

And hey everybody, my name is Jeremy Harris. My background is that I just had a baby. And for that reason, that's not my whole background, but you know, it's a particularly relevant one. Um, so we were going to record this episode last week on, on Friday, as we normally do, we're doing it on a Tuesday because Andrej very generously bumped it. Uh, when I found out I was going to be spending like About 24 hours in the hospital with my newborn. Everything's fine. Everything's totally fine.

Just a standard newborn scare. But, uh, for that reason, I'm sort of playing catch up a little bit. You're going to get my takes a little bit hotter, a little bit spicier, less, less baked as what, I guess those are kind of contradictory things to say. You get it. You get my metaphors. Less pre

Andrey

compiled, more runtime

Jeremie

compilation. That's right. Yeah. Like, Inference time compute, overtraining time, whatever analogy works for you. There you go. That's what we'll be doing. Um, yeah. So, you know, Andre's very diligently added a bunch of stories like from between the hospital stay and now, and so I'll be flying by the seat of my pants on that stuff. Um, I, I do apologize, but you'll, you'll get to see, I guess my, my terrible thought process in action. I'm super excited to be here and, uh, yeah, like.

I guess so much stuff did happen. So there is, there's a lot to cover.

Andrey

That's right. Yeah. Let me do a quick preview of the episode. So it's, it's a bit of a mixed bag of stuff this week. There's no like one unifying theme tools and apps. A big story, of course, is search in chat GPT. And we also have Apple intelligence applications and business. Some of the regular themes we've seen with open AI and their hardware with autonomous driving and, uh, Meta and other striking deals with media providers, then on to research and advancements.

We'll actually be highlighting something in robotics that I'm excited about and also some evaluation type things, policy and safety, our usual kind of grab bag of different things and some opinions as to policy and also some research. That's pretty interesting. So that's the preview, but before we start on the news real quick, as always, do want to give a shout out to the listeners who provided feedback and comments.

I do look at YouTube and try to sort of, uh, be aware of anything people say on there. And I do also periodically check reviews on Apple podcasts. And it's been really nice to see a few more come in. In fact, one of them, uh, actually congratulated Jeremy on your newborn daughter. Which is very sweet. Yeah. That's now forever part of the, uh, Apple podcast review. Uh, and it's also interesting in these reviews to hear about people's backgrounds.

So there's a couple of these that say, for instance, uh, there's an immigration lawyer and founder of a company making AI products for the legal profession. Uh, there is a senior tech and finance executive. Stuff like that. It's actually really interesting to see the kinds of people. that are interested in this and are benefiting from it.

Jeremie

Yeah, it's, it's actually kind of wild. Um, when I'll run into it in my line of work in different ways where people will say, Oh yeah, you know, I heard about this ball and I'm like, yeah, that's, it's funny. You know, we just covered that on this podcast. They'll be like, yeah, that's where I heard it. Um, which is always kind of amusing and cool. So really do appreciate the, the reach outs and. Um, it's amazing.

I'll tell you, I mentioned this, I think, last week, but it actually feels like a community. Um, you know, like the, the number of messages that, that came out, like, hey, you know, uh, good luck with the, the, the baby and that sort of thing is very, very sweet and thoughtful and, uh, anyway, it makes it so much easier to, to put in the like, uh, five hours or whatever it is to, to prepare for each episode. And we really do appreciate it. And, uh, glad that some people find it useful.

Andrey

Yeah. I don't know. Maybe we'll think about starting a discord or something. If people want to make a last week in AI community and discuss AI news, I'm just throwing that out there. Totally, you know, uh, top of brain, but if you would like that, you can feel free to let us know. And then one last thing.

thing before diving into the news, as with the last episode, we do have a sponsor to acknowledge, and that sponsor is The Generator, an interdisciplinary AI lab focused on entrepreneurial AI from Bobson College. Bobson College, if you don't know, is the number one school for entrepreneurship.

In the US and this is a new initiative with professors from different departments there, organizing this new kind of interdisciplinary lab that has various groups, uh, focusing on, for instance, AI entrepreneurship and business innovation, AI ethics, and society. The future on work and talent, these sorts of things. Uh, they also focus things like peer training, uh, people across, uh, Bobson's faculty. Uh, they are fans of a podcast actually.

And it's a cool sponsor, as I said, last episode, because there's no product here for you to buy. They just want to kind of spread awareness about this new initiative. And this week, there's a fun thing we didn't want to shout out. Some students of theirs are leading this event, uh, a build a thon with Microsoft, where they'll be doing stuff on entrepreneurship and AI. So these sorts of cool initiatives, I guess, we'll be calling out and just making sure people know that this is a thing.

Alrighty, well that's it. Let's dive into the news starting as usual with tools and apps and the first story as I previewed will be search in Chaz GPT. So OpenAI has had search GPT in beta as a sort of preview and they have now launched it live as a feature in Chaz GPT. So the way this would work is, uh, Chaz GPT can. itself decide to search for information across the web to answer a query, or you can manually tell it to search for stuff on the web when it replies to your chat.

So it's actually directly built into the normal chat GPT experience. There is no separate thing like search GPT to use. And it's very much similar. similar to things like perplexity, to things like AI overviews from Google, these other, uh, kind of offerings that lets you input a query, the algorithms will find the relevant news stories from various sources on the web.

And then that will be fed into the LLM to be able to generate a response with awareness of these various, uh, And as with these other things, of course, it'll be citing its sources, letting you know the links to go back to their original and read about them. So, as with other things, this is pretty important. Uh, Chagibti knowledge was limited to the past, to like 2023, I believe. And so now it will be able to, uh, talk to you about, uh, kind of anything that's going on now, I suppose.

Thank you very much.

Jeremie

Yeah, there are a couple of interesting little tidbits to drop into this. So, you know, one is when asked what the technology behind this, the models behind this were, there was a rep who said, well, it's a mix of technologies, including Microsoft's being. So that's sort of an interesting point of deeper integration between open AI. And I mean, like, so, so Microsoft's being right is really open AI is some version of GPT for that's probably been tweaked.

But one of the interesting things here is like, Yeah. It's unclear whether this crosses the blood brain barrier. Like is OpenAI actually using, um, like some, some proprietary Microsoft, uh, product? And if so, that kind of deepens the dependency of OpenAI on Microsoft, right?

If SearchGPT becomes a breakaway product, if it allows OpenAI to take a, You know, bite out of that sweet, sweet, you know, like multi trillion dollar, uh, search revenue market, then, you know, that's a really interesting thing to, to bind itself to opening to Microsoft with, um, so there's that. There's also the fact, apparently the underlying search model is a fine tune of GPT four. Oh, so at least one part of the stack here includes that.

Um, and they did an initial test rollout users and seems to have gone well. So now they're doing the wider release, but interesting that they're choosing the stack GPT 4. 0. So we're going to probably get that multimodal functionality as well, kind of more and more deeply integrated over time. Um, but right now that, that seems to be where they're going using existing technologies and, you know, maybe.

A partnership with Microsoft to access being, uh, to kind of supplement all this because Microsoft has all that experience. Um, questions, of course, as usual, right? Anytime we get into generative search. What about the price point? Right? What is the cost going to be to open AI to serve these models? Because generative search is so much more expensive. You're actually generating text and that's, you know, that's going to be figured out.

They say currently there's no plan for To advertise, uh, in chat GPT, cause this will all be through the chat GPT interface. So obviously like, no, it's not like you're going to search. chatGPT. com or whatever. You're just doing chat GPT. And then within that, your searches will be read it appropriately. Um, but they are flagging. They're going to be some limits on how often people can use the free version of this search tool.

So, you know, again, there's a bit of a question as to like, whether this is Google or whether it's perplexity, right? Are you going to, are you going to pay. For, um, a certain level of search, like how does that all shake out and then how are ads going to be displayed and all that stuff. Um, anyway, kind of interesting as well, as they point out in the article that this is a dropping, uh, just a couple of days before today, which is the presidential election in the United States. Um, you know.

Yeah. The need for accuracy, the interesting marketing choice, really, because if things go wrong with generative search, say today, and there's a big story like that's, that's a pretty risky time to be doing this when presumably they could have just waited a week and launched it then. But, uh, you know, I guess, uh, opening, I just choosing to ship as they so often do. Um, and, uh, and there you have it,

Andrey

right. And, uh, you know, maybe they decided to just do it last week so that it's not eclipsed by other news in the U S, uh,

Jeremie

part of it. Yeah. That doesn't sound right. I mean, open AI would never make these kinds of decisions on the basis of marketing.

Andrey

Yeah. It's not like we are now business that cares about PR. Uh, and actually on that note, it's funny because I did. Try it out a little bit. I compared perplexity and the strategy of search on informing myself about the local candidates and elections that I'll need to be voting for in Palo Alto. So I just, you know, there's a lot of like candidates for city council. There's like 12 of them. It's kind of hard to get an overview.

So it's actually both of these things I found quite useful for that. I would say perplexity is right now better for these sorts of, let's say, more intricate research tasks that require summarizing a lot of information. Chatterbitty was decent, but didn't present to information quite as well. At the same time, if you're already using Chatterbitty and paying for it, I could definitely see this being a reason for you to not pay for perplexity, which could be a real problem for perplexity.

I guess we'll see. Yep. Search is heating up. Next up, Apple intelligence, their features are now rolling out in beta. So this is a developer beta version of Arias. 18. 2.

And there are these Apple Intelligence features that are set up to be publicly available next week, I guess, probably this week, as we've already seen in previews of the features have things like integrated writing tools, image cleanup, article summaries, and a redesigned Siri experience with typing input, although this is not the smarter Siri with LLM integration. Uh, and then the developer beta users.

can access extra features like Genmoji, Image Playground, Visual Intelligence, ImageWand, and Chad GPT integration that will roll out to users a bit later. So we've already covered this, I suppose, at the announcement time, but Apple's approach to AI is to not have like sort of one big thing. in their iOS ecosystem, instead having AI kind of across their OS in various ways kind of as features really rather than one big AI. And now you're seeing some of those features come out.

Jeremie

Yeah. And it's, it's also that they seem to be going for much more of that platform play, right? Like it's, it's very Apple to do this, be the integrator, uh, lean on the hardware as the thing that's driving value. And there's a bit of a, in that sense, a Microsoft question, I guess, you know, that there's this, um, this notion that you want to commoditize the compliment to your technology, right? So famously, you know, Microsoft, um, they make operating systems and now software for, for PCs.

The compliment to that, which is a bit counterintuitive, is the hardware that actually runs it. So, you make dirt cheap hardware, and where you really get your margins is on the software. Um, and, uh, Apple, you know, in a certain sense here, may be, may be going for the opposite play as they have been with higher end hardware. Uh, it kind of makes them a more natural integration point for a wide range of different. Technology is not just leading with, Hey, it's just Apple products.

Uh, so that's why they have a lot of partnerships with the likes of open AI. And, uh, anyway, so there's interesting kind of further push in that direction strategically for Apple.

Andrey

On to the lightning round and speaking of that relationship between open AI and Microsoft, the story we have is that GitHub. Co pilot will now support models from Anthropic, Google, and OpenAI. So as a developer, GitHub Co pilot is for coding. It gives you suggestions, AI suggestions, as you write a code. Now, as a developer, you can choose whatever model you want from things like Cloud 3. 5, Gemini 1. 5 Pro, and OpenAI's GP400, O1 Preview, and O1 API.

Mini. Uh, this is interesting because GitHub Copa so far has had just one model. There's no ability to choose. And that model seemingly was built on top of opening a technology trained with data from across GitHub. So this kind of allows Microsoft to be pals with more companies, I guess.

Jeremie

Yeah, I think this is more evidence for I'm biased, but I'm beating this drum for the idea that Um, your, your LLM class is increasingly becoming commoditized, right? So you're, you're starting to see the, the leverage shift from the frontier companies to the aggregators. So, you know, like you've got open AI, Anthropic, you've got, you know, Google products, all these competing on one platform.

And you can almost feel in that moment, it's It's becoming cheaper and cheaper to just switch between them. And what does that do? Well, it drives your, your margins. If you're those companies down to zero, right? That competition just erodes your margins and then puts more pressure on you to release the next generation of product to desperately try to claw your way ahead of the competition.

Arguably, you know, that's been one of the big things that's driven even anthropic now to release more and more powerful systems in spite of, uh, earlier commitments, uh, or semi commitments, ambiguous commitments, certainly been interpreted that way. To sort of lag or, or come close to lagging, uh, the true frontier of the space.

So, uh, you know, this is a big amount of pressure to put on even open AI as well, as they roll out their Oh one mini, which is now available in the suite of products alongside cloud 3. 5 sonnet, which just came out and Gemini 1. 5 pro from Google. So, you know, this is all kind of increasingly. It's great for the end end user. At least for now, when it comes to competitive dynamics around this stuff, again, margins get crushed.

And if you are a small company trying to build your, your bespoke LLM or whatever, I think this is a bit of a warning. You know, I've been beating this drum for a while, but I think companies like Cohere are really going to struggle. I don't think there's room for mesoscopic players that can't benefit from economies of scale and just having massive cloud providers supporting them. I think we're now just seeing that thesis play out. Um, one other note from this too.

So, so Anthropic had a whole post about the inclusion of Sonnet, we call it 3. 5 Sonnet in this whole suite. Uh, we've covered that before, you know, really, really great model. Uh, it is SOTA on, uh, software suite bench verified, basically this opening I tweaked version of the suite bench. Benchmark, um, which is basically a coding software engineering benchmark. Um, so anthropic or sorry, cloud 3. 5 sauna does amazing there.

The big question for anthropic has not been, has never been, um, can they build competitive models? At least for the last year or so, it's been very clear that they can. The question for them is distribution. You can build the best model in the world. They can build the best product in the world. But if you just get out distributed by competitor that can just crush you as well. You know, everybody's heard of open AI.

Very few people have heard of anthropic, at least if you look at the average person, actually the average person probably hasn't heard of opening on much too, but you know what I mean? Um, and so this is sort of like what happened with, you know, Slack and Microsoft teams, you know, Slack had crazy growth, you're doing amazing. And then Microsoft teams launched a. Initially, initially a fairly mediocre product, but it's Microsoft. They're in all the operating systems.

So they just had massive distribution. Um, in a sense, this is anthropic piggybacking off of get hubs, amazing distribution, and this kind of levels the playing field a little bit between them and opening at least when it comes to software development. So that's a really interesting. Uh, development. It's also interesting because GitHub, of course, is owned by Microsoft.

Um, so this is normally you would expect that to be a sort of OpenAI friendly, uh, partner, but here they are platforming Anthropic and Google alongside OpenAI and creating this sort of very direct head on competition and giving Anthropic distribution that they, they really, really need. And that, um, that does level the playing field in a very substantial, substantial way.

Andrey

And speaking of coding, the next story is related to that, and it's about Anthropic as well. So Claude, the chatbot from Anthropic, has now a new feature, an analysis tool, that would allow it to write and run JavaScript code. And this is important because now if you, for instance, upload a CSV, a file with data in it, uh, Claude can write some code, run it, and then output the results of that logic processing.

So that's something that's been around in ChagGPT, I believe, for a while and is now being added to Cloud. a very useful feature because it, uh, makes up for a weakness of LLMs. LLMs are not able to, let's say, data crunch to run algorithms, but when you let them write code, which we're really very good at, kind of a weakness of LLM in some ways goes away.

Jeremie

Yeah, this is a, it's, you can think of it as a way of grounding the, the model, right? You, you have At every juncture in the reasoning process of an LLM, you have some probability of failure. And when you have a reasoning juncture that forces the model to write and then execute code, the execution of that code is not subject to hallucinations. So it's a, an injection of ground truth into the process that can help you get more reliable outputs.

And, you know, they sort of flag this, uh, they don't explicitly tell us quite how this is working in the backend, but. They say instead of relying on abstract analysis alone, it can systematically process your data, cleaning, exploring, analyzing it step by step until it reaches the correct result. This starts to sound a lot like, well, you know, inference time compute, obviously that's what's going on here in some form.

We don't know exactly what it is, what kind of prompting strategy is it just an agentic scaffold. Is it a model, you know, that, that's, well, what's the, what's the training routine in the background? We don't know. Um, but there's obviously, you know, more and more leaning into inference time compute. And in this case, with a view to solving very specific data analytics problems. So interesting that they box that out as a category of, of a feature that they wanted to launch.

It makes all the sense in the world because of the, you know, the wide applicability of data analysis. But, um, yeah, though we'll, uh, we'll see if it gets uptake. They give a bunch of use cases as well in the post that you can check out if you're curious.

Andrey

Yeah. Speaking of distribution and getting awareness, they do like explicitly call out, Hey, marketers, sales teams, product managers, this is useful for you. And they have this little video that shows not just doing some processing, actually showcasing, uh, a whole chart based on some data. So certainly useful if you work with data. And now moving away from chatbots, we have a story related to 11 labs, uh, which does a generation of audio of, uh, AI speech.

And they have a new feature that is a voice design that allows you to generate a unique voice from a text prompt alone. So 11 labs is the leading company providing text text to speech functionality, you can enter some speech and get audio that is very, very realistic sounding. So far that was limited to a set of voices they provide, or you could train your own by giving it a bunch of data. Now you can create a voice just from describing it.

So you can say, you know, like a serious newscaster that is a real professional or something, and that would Generate a voice. So another way to make it easier to make this do what you want.

Jeremie

Yeah, I mean, I'd love to love to see that training data set, but yeah, it's also, it's one of these things where when we look at multimodal products like this, I often wonder. How the, the future of prompting is going to look right, because explaining verbally what kind of voice you want to get out of a system, like, you know, text data is, it's not designed to, it's, it's not very, uh, readily able to kind of give you or get across what you mean by a voice, right?

So we do, we do impressions really to do that, right? Like how would you describe, you know, Kamala Harris's voice, Donald Trump's voice? It's, it's really difficult in words. And, and I, so I kind of wonder if, if there's a, Uh, more of an iterative cycle that you'd want to get into with these things down the line, where, you know, generate an initial one and then be able to take that as a template and give a subsequent kind of feedback or, or do an impression or something like that.

Get, get the input to be multimodal. I imagine just like with images, right? Where you often want to upload a. A starting point image, um, and then start to modify it. Like I could see that working with 11 labs products, but, but anyway, it's an interesting new form factor for, uh, for prompting for a generative voice models.

Andrey

Yeah. And, uh, that's a good point. I think the idea is more so to describe who is speaking rather than how they speak. And then how they speak is implied. They have an example of like. Evil ogre or something. Uh, I will mention, I did try to see if this can be abused. I copied the description of Barack Obama, like the first paragraph from a Wikipedia without including his name and 11labs did catch me and didn't produce a voice of Obama. So I guess that's

Speaker 5

good. Now I want you to follow last week in AI podcast. It's a. We've got a lot of good folks, uh, Andre, what a, what a beautiful smile.

Andrey

And moving on, we have Midjourney's new web editor. Let's you tweak images, upload it from your PC. So Midjourney so far has been a pretty. Pure image generator. Uh, there's been other competitors that allow you to upload images and edit them. Uh, and now you can do that with mid journey as well. So on this editor, you can edit that would allow for resizing, erasing, and things like that, and retexture. which lets you modify image contents with prompts. Uh, this is still in beta.

It's limited to about, uh, it's limited to users who have generated at least 10, 000 images and have annual memberships and have been monthly subscribers for the past 12 months. So I guess there's some very, very, uh, uh, loyal Majorani customers there.

Jeremie

Yeah. And I actually, like, that's the first time I've ever seen Uh, a closed or beta kind of demarcated in quite that way. It's it's almost like a loyalty test. Like, Hey, have you, you know, actually been using it like an obsessive? I just, I think it's brilliant. I mean, back when we were at, um, Y Combinator, this is one of the things they tell you to do, right? Find your like, um, it's more important to make 100 people love, love, love your product than 10, 000 people like it decent.

And when you find those people, you figure out why they love your product and you double click on it. So I think that may be part of the thinking here. But so, so interesting. Um, I'm sure there have been examples of this that have happened before. I haven't seen it in the eye yet, but yeah, and then they share this kind of interesting screenshot. You know, the famous picture of the, uh, the, uh, zebra crossing with the Beatles kind of walking across it.

Well, they, they, you know, play with that and basically give it a new background and all that pretty, pretty effective. So, uh, just goes to show, I guess they've got, um, they've got a pretty, uh, interesting new feature here and an interesting new way to launch it.

Andrey

And for the last story, actually speaking of Midjourney, the headline of this article is Watch out Midjourney, WeCraft just announced new AI image generator model. So yes, WeCraft is a company that focuses on things like design and they have this model WeCraft V3. Which is pretty good. It came out, uh, for a while. It had a high ranking on hugging faces, text to image model leaderboard, and it was actually kind of a secretive people were wondering what it was.

And now we know it is this model that we craft to you free, which is impressively good, uh, uh, generation of, of course, if you take a look, you know, it's kind of hard to distinguish, at least for me at this point between the different image generators. Yeah, that's, that's

Jeremie

really, it's been a problem kind of in the quantitative side of image generation for a while, right? How do you actually measure the quality of a generated image? It's, uh, text went through that phase as well, where like blue scores are just like not enough anymore to, to get us past that point. Once your model gets better than a certain kind of human, uh, or a certain quantitative threshold anyway, it starts to be really difficult to measure.

So, I've personally found this, like I'm, I'm still, and this is a function, I guess, the fact that I'm not a graphic designer, I don't actually use these models to do specific graphic design stuff or animation stuff. And I'm sure if I were in that space, I'd be like, Oh yeah, you know, there's one model. It's a lot better than the others. I think, you know, what, what just looks to the, the, the untrained eye, like a dead heat across the board.

I think we're going to start to see that more and more in the domain specific applications, you know, six months, a year from now. Uh, I'm always perennially. Interested in and confused by the massive VC investments in this space when it seems to be rushing towards commoditization so fast, but who knows? We'll, we'll see if, uh, as Peter Teal would say, if competition is for losers or, uh, or if somebody actually ends up winning.

Andrey

Yeah, exactly. And, uh, to that point, WeCraft's founder emphasizes a models design centric approach, which aims to give designers control over their output. So I think the differentiator there will be like that last 1%, that last 5 percent of quality that is important to people who are, you know, use this professionally. And that's kind of interesting. We are beyond the point of just quality. You now need to really get into the nitty gritty.

Jeremie

And actually, um, I think such a great flag too, for the place we are in AI across the board, I think in industry. And I think a lot of people, this is the source of a lot of confusion over, is the space over hyper? Is it not right? We have these impressive demos. You get, say the chat GPT moment, people realize, Oh, there's all this low hanging fruit. And there is, but often for the most valuable applications, it's that last mile, right? That like, Last 1%.

It's no good to make a, um, you know, like a, an app building model that makes a mistake 1 percent of the time. Um, or sorry, maybe not. That's a bad example. But anyway, a model that serves critical user experience because 1 percent could be way, way too high. And, you know, you'd be surprised an awful lot of the time that's an issue and for agentic models that have to string together a coherent set of actions.

1 percent error rate like will nuke you if your thing involves, you know, 20 steps, then, you know, whatever the number is, like 30 percent of the time, that means you'll be screwing up along the way.

Um, and so, yeah, it actually started to be less than it doesn't matter, but the point is, you know, I think what's going to be surprising is you've got this initial thing where people think it's overhyped because it can't solve the thing, but then suddenly you just get to the point where As you scale, uh, essentially you're driving down that error rate, and then you cross a magic threshold. All of a sudden, things just become possible.

So I think it's possible for these things to be both, in a way, overhyped and underhyped at the same time. Just a question of the, the timeline you, you look at it on, and I think, uh, Anyway, that's the, the big question with these, these image models, and more and more, I think, agentic systems, we're going to see that.

Andrey

Moving on to applications and business, and we begin with a trend we've been covering for a while this year. Big companies striking deals with media publishers, and this time it's Meta striking a multi year deal with Reuters. Reuters, in case you don't know, is a very big distributor of news. A lot of kind of breaking news goes out on Reuters, and this would allow Meta's chatbots to have access to real time news and information from Reuters when users ask questions about news or current events.

Unclear if this also provides access to training for Meta, but regardless, certainly seems reminiscent of the partnerships that OpenAI has made with many, many, uh, media companies like Reuters.

Jeremie

Yeah, it's, it's all part of a broader story to when, when it comes to meta of trying to avoid, or historically they've tried to avoid having hard news on their platforms. Um, you know, taking some steps to avoid the sort of current event stuff more and more leaving that maybe to platforms like X. Um, and in fact, in this case, we've had their executives have come out and said that they're not going to quote, to do anything to encourage hard news and political content.

Um, but, uh, but that of course is true. You know, part of what's going to be nudged towards here when you start to explicitly add news into the platform. Um, and, uh, anyway, so there, there are all kinds of questions floating around, you know, what are they going to do for, for content moderation for, for generation of these, these, uh, responses that deal with news and current events. Uh, no, no comments on that yet.

Um, kind of under understandable, given that, frankly, that's probably just an unsolvable problem given today's technology. Um, but we also don't know what the terms of this deal are. So we know there's going to be some kind of compensation to Reuters, but you know, whether this is a, an annual licensing model or a revenue share thing and how much money we, we just don't know. Um, so kind of interesting to see that, that next step. Um, they also flag, there's this, uh, interesting.

Uh, note at the bottom of one of these articles that talks about, um, how Meta is though Meta now appears to be willing to pay for news content. It's also simultaneously fighting laws that would require compensating news publishers for their content on social media is this interesting kind of, um, dichotomy, right? They want to find ways to license with individual outlets, but they don't want to just blanket, you know, do revenue sharing for clicks and things like that.

Um, there's been a very high profile situation up in Canada, actually. Where you literally cannot access news on Facebook and Instagram, um, because the government's tried to get meta to pony up money. Basically, uh, it can then recirculate around traditional media. And that's a very controversial proposal as well. Um, I mean, if you ask me, it's, it's not the, not the best idea, but, um, it's led to meta just like outright withdrawing from the Canadian ecosystem altogether.

So kind of interesting, similar things happening in California too. So, um, this seems like the opening I play, right? People are just saying, Hey, It looks like we can get a safe harbor by licensing with, um, uh, with these news outlets at the very least kind of insulating ourselves from charges that we're just rampantly like scraping news with, with no regard for people's copyright, uh, uh, kind of copyrights.

Um, so yeah, I'm a bit of a CYA maneuver, but, uh, Yeah, we'll see if it sticks and if other companies follow.

Andrey

Moving right along, another story related to a trend that we've been talking about all year. It's OpenAI and their search for hardware. And the story is OpenAI will start using AMD chips and could make its own AI hardware in 2026. So, OpenAI is collaborating with RODCOM on these custom silicon chips and And they are integrating AMD chips into their Microsoft Azure infrastructure. And these are things like AMD's MI 300 chips.

Uh, I suppose previously, of course, NVIDIA was a big company that everyone went to for their AI compute, but now AMD is starting to be a real competitor in the space, seemingly.

Jeremie

Yeah, it's, um, the, the reason behind this is partly at least OpenAI wants just more, more flow. They want more, um, Sure. Sources of that hardware. So, you know, they're not going to say no to more NVIDIA hardware, but if there's another source at AMD, that's offering some great, we'll take it also diversifying your suppliers beyond, uh, NVIDIA gives you more leverage in the sort of price negotiation side or, or sort of the, yeah, the, anyway, the acquisition side.

Um, it's, it's also part of OpenAI, just shamelessly recruiting, poaching the top hardware engineers they can get at Google. So, they've assembled apparently a chip team of about 20 people, and um, these, these are the people who focused on building the TPUs at Google, the, the flagship. Sort of specialized ASIC that Google builds the tensor processing unit. Um, and, uh, and so, you know, they're obviously interested in fairly custom hardware, which is its own really interesting story.

You know, as you see more and more, the, the research at the frontier labs, like opening eye, like anthropic, like Google start to go dark, right? It's no longer open source stuff. Everybody's hiding their trade secrets. Um, from everybody except the Chinese, that is, um, they, uh, they're, they're starting to build these out and you're starting to see More and more differences between the approaches taken by these labs. And I think that's really interesting.

We're going to start to see D correlation in the, in the kind of, um, AI model architectures. And one consequence of that is you're going to start to see D correlation at the hardware level and open AI kind of characteristically aggressive in terms of pushing forward on, on all fronts, including hardware. Um, Basically making a big bet that like, Hey, let's, let's double down on building our own custom hardware.

They have to partner with Broadcom cause they don't have all the capacity, um, to design in house. Uh, and, uh, it's not the first time we've heard about that opening. I Broadcom, uh, partnership, by the way, that's, uh, something we covered previously, but this does move Broadcom.

Further into the camp of competing with NVIDIA on the design of these kind of advanced ASICs, which is something that Broadcom has done historically, but now these are really AI ASICs meant to sort of compete with the sort of TPU model or, you know, the GPU models that, um, that NVIDIA pumps out. So that's really interesting. Um, and, uh, yeah, we'll, we'll see.

It's, it's also in some ways, uh, Um, not the most surprising thing because we've seen a whole bunch of news about OpenAI poaching hardware engineers from Google in particular and fascinatingly, Broadcom actually had previously been Google's TPU partner. It's something that hasn't really been caught on to in a lot of these reports, but it is the case that initially when Google was first designing the TPU, they did tap Broadcom for that effort. So this is actually OpenAI mirroring that effort.

Google's initial strategy. Some people have said, Oh, well, you know, they're partnering with Broadcom. They don't have the capacity. Like, you know, so, so this says something negative about opening. I, but no, this is just how you get that effort off the ground. Um, so it would not at all be surprising if you would, um, if you would, you know, see that happening, especially you've got to actually, there's a great comment on hacker news about this.

You've got, um, all these, these employees, former Google employees moving over to open AI. And basically bringing that vendor relationship with Broadcom with them as well. So there's a strong bias now to just replicate what worked so well at Google with TPUs over on the OpenAI side.

Andrey

Right, exactly. So, the partnership with Broadcom is very directly implying that they want to build something like VTPU, the tensor processing unit that Google has had since, I believe, 2016 was version 1, and they've iterated on it a lot since then. Broadcom actually has generated or has gotten Billions in revenue from that partnership of Google. So it makes sense. They want to do more of that.

And according to this news, open AI has been working for months with Broadcom to build their first AI chip focusing on inference. And another dimension there is having custom hardware for inference lets you be faster and lets you be cheaper. And we have often talked about.

Uh, we know that open AI has billions in revenue, but it's still unclear whether that can convert to profit, whether you can actually be profitable while still competing in this price war that is ongoing with Entropic and other providers. So I do think that this would be one of the ways that these Players can stay ahead of the pack is by having hardware that lets them have good margins, which is otherwise very difficult with just GPUs.

Jeremie

Well, and specifically with inference time compute, right? Cause that's, that's the entire paradigm. If you look at the direction the field is heading in, that that's becoming dominant. We're seeing this shift from training time, compute to inference time, compute.

And like, you know, not, not to toot our own horn, but it's been about two years that we've been talking about this, that this shift was coming, um, as soon as the earliest, earliest hints of scaling laws for inference time, compute, uh, and, and sort of like, uh, exchange rates between training time, compute and inference time, compute, you could trade off one for the other in certain contexts. Um, and I think that's, that's where things are going. So opening eyes saying, Hey, you know what?

With models like Oh one, uh, which are explicitly our big AGI play. Uh, we're going to find ourselves using more and more inference time compute. These things are going to spend more and more time like pondering, thinking over problems as they go. And why optimize hardware to be able to do backpropagation? In other words, to be able to do like model training when, uh, when you're only going to use a relatively small fraction of your hardware dollars to to do that. to do training.

You know, if you have a more specialized problem like inference, you can make a more specialized piece of AI hardware and have, as you say, better margins. And obviously companies like, like Grok have, um, have been doing big plays in this direction where it's like, okay, you know, we're just going to carve out the inference piece and just crush that. And, uh, and it's at least seemed to pay off well for them so far.

So, uh, it'll be cool to see an opening eye doubling down on inference even more.

Andrey

Right. And inference also is the main thing that people pay for. Then when you use their API, et cetera. So that's the pricing on inference is the differentiator for large. Like right now, you know, there's some difference, but you'd be for all sauna 3. 5, you can go back and forth between them pretty easily. And it comes down in a lot of cases to price. So it's very important to be able to, uh, I guess, uh, Be able to keep up with price reductions that keep happening.

And one quick last note I found interesting. Uh, this is a big deal. Uh, Broadcom's stock jumped 4. 5 percent on the news and AMD's went up by 3. 7. So that tells you something about, uh, I guess how investors are viewing this. Onto the lightning round, and we begin with a competitor to OpenAI, XAI, and the news is that they are looking for funding. They want to be raising some more billions. They've raised 6 billion so far in series B funding, reaching a valuation of 24 billion.

Now they are seeking more money for a valuation of 40 billion. And it seems to be the case that they want to raise this money to be able to buy more GPUs, to increase their data center from 100, 000 to 200, 000 GPUs, or at least that's what I've read about this. So I guess unsurprising in a way that XAI needs a lot of money. To, uh, try and catch up really with OpenAI and Entropic, they seem to be essentially targeting the same customers with the same type of product.

So, uh, yeah, XAI certainly making a big effort on that front.

Jeremie

Yeah, there, there's so much, um, It's such an interesting play because, you know, you might have written off, um, XAI before all this right where there was a time when it was like Google, it was Microsoft slash open AI, and it was anthropic, and they seem to be the only one, maybe meta with the resources to pull this off.

XAI It's seemingly come out of nowhere and we were warned about this when we were doing our investigation last year, um, just before XAI had launched and they're like, Hey, watch, watch these guys because the, the acquisitions they're making on the, the GPU front, the hardware front are pretty monstrous. The, the differentiator right in this space, if scaling is true, which basically all the frontier labs seem to think is the case that you can scale your way to AGI or something like that.

Then you gotta be number one. There's, there's no prize for number two. And so the fundraising pitches have got to sound like we're going to do it first. That there is no other option, right? There is no fundraise. That sounds like we're going to make, you know, uh, self improving AGI and we'll do it second by then it's irrelevant. Someone else has already done it. And, and, you know, you've got your runaway effects and all that.

If, if all this bears out, um, in that context, Elon has done a spectacular job of taking XAI from. a horse nobody's heard about to a very competitive horse for the head of this race, not only, uh, hitting the, as he, as he advertised it, you know, the largest supercomputer or sorry, H 100 cluster, uh, training cluster that exists, um, which is probably was probably true at the time.

Um, but also, uh, Uh, you're doubling the size now in an unprecedentedly short period of time, launching this cluster in 19 days from the first H100 GPU rack rolling onto the floor in the Gigafactory. That is insane.

Like the, the, um, there's a really great interview with, uh, Jensen Huang, um, who's the CEO of Nvidia, of course, who was talking about just how long this stuff usually takes, usually a matter of months or years, 122 days in the case of XAI, uh, to operationalize this, this whole system. So really wild Elon intimately involved in as he does so well in the design of the hardware stack and an implementation of the build, you know, this is really quite remarkable.

They've also made some weird choices around how they wire all this up, how they how they power it as well. Um, but, uh, yeah, they, so they, they don't use, for example, this, um, very well established, uh, basically interconnect called InfiniBand that NVIDIA makes instead they're using basically, uh, an Ethernet fabric that pretty recently came out. Um, and it's very competitive, but it's all this sort of like.

You can just see the fingerprints of Elon getting all in on this and having a very eclectic, very unusual design to the, uh, to the cluster. So, um, yeah, we're super interesting and that's going to be the play for fundraising. We can do it first because we're doing it differently.

Andrey

They certainly are running a lot of hardware, so it's very impressive that they've been able to basically catch up. Their models are not Quite as good, but they're essentially, you know, you can use them for a lot of stuff that you would use ChashBT or Claude for, uh, and certainly Elon Musk, if nothing else is good at running hardware companies.

Jeremie

Yeah. And, and acquiring, right? Like one of the big things that he does maybe, maybe more than anything design wise on the, on the cluster is just, he's able to like move, like muscle around his big famous guy, uh, miss and go to like Jensen Huang at NVIDIA and go like, Hey, dude, I know you've got a bunch of people like just begging you for GPUs, but I'm cool and famous. I'm going to ask you for them and I'm going to brag about how productive our partnership has been.

And you've seen that really be an interesting boon for Nvidia too, right? Like basically now you've got, you know, Elon Musk are arguably the world's most successful entrepreneur. Coming out and saying, Hey, you know, like this is the hardware stack that we use. They're amazing. Working with Nvidia has been great. And then Nvidia has returned the favor.

So, you know, there, there's like non trivial marketing benefit too, to both parties in, in, in doing this, but that nonetheless is value that Elon's bringing to the table and spinning this stuff up fast. So I think he, you know, he, he knows exactly the value he's bringing to the table and, uh, and he's, he's using it well, right? I mean, this is, this is what he's supposed to do.

Andrey

Yeah, also infamously, uh, borrowed some GPUs from Tesla or re directed some GPUs from Tesla. So as always, it's interesting to see like the Elon Musk mega corporation and how it's working. It's, it's Elon Corp. Yes. Next, a startup that is raising some money. It's. physical intelligent, a robot AI specialist, and they are raising millions from various sources, including Jeff Bezos.

So they have raised a 400 million in a new financing round that's following up from their initial 70 million seed funding just earlier this year, not too long ago. So now they're being valued at approximately 2 billion. And this is coinciding with some news of progress I made that we will cover, uh, in just a little while. So physical intelligence, you know, no product yet. So they are promising to build sort of a general purpose robot brain that allows robots to be very capable.

And certainly it seems investors are very bullish on that promise.

Jeremie

Yeah, and then they're being careful as well. You can see them hedging on the sort of hype of the initial product, understandably. They're saying, look, this is more of a sort of like GPT one, which, you know, if you're old enough in the space to remember there before there was GPT four, there was GPT three and blah, blah, blah. Anyway. So GPT one was, I think, uh, I think GPT two was 2018. Wasn't it? It was

Andrey

2019. If I remember correctly. Oh really? Okay. Okay. 2017 2017 was the Transformers paper actually. Yeah, that's right.

Jeremie

Yeah, yeah, yeah. So, okay. Well, anyway, it was one of the, one of the late teens in the twenties and, uh, you know, that nobody can really remember because it was pre pandemic, but, uh, they're, so they're saying, Hey, it's, this is more like a GPT one. Uh, don't think about it as a chat GPT. You know, it's, it's like a proof of concept. So, um, but they are saying, you know, a chat GPT style breakthrough could come far sooner than we expect, or it could definitely be far out.

So, uh, you know, a lot of uncertainty and yeah, that's what you kind of expect in this space too, right? The chat GPT moment is a thing in AI precisely because we don't know at what level of AI capability we're All of a sudden you crack that critical threshold where some critical mass of tasks become possible to make the thing viable from a commercial standpoint. So you just keep improving, keep improving, and hope that you kind of flip that switch.

And it is pretty binary, at least that's how, that's how products do tend to be. It's just that normally it's humans iterating on it rather than just, you know, pouring compute into a thing until something cracks.

Andrey

And as I said, more of that soon. A couple more stories this time about Waymo, and Waymo has also raised some money, in particular 5. 6 billion from several investors, notably, of course, Alphabet, Waymo's parent company, also Google's parent company. Probably a lot of this money is kind of just coming from Google to Waymo.

Uh, essentially from the money printer, that is Google ads to this new initiative, but also with participation from Andrasen Horowitz, Fidelity and Silver Lake, and this is coming at a time when they are trying to expand to new cities like Austin and Atlanta and LA. And, uh, combo story also on Waymo, I figured we'd cover both. So the other story this week was that Waymo has been serving over a hundred and fifty paid robo taxi rides every week, uh, so far lately.

So that is up fifty percent in just this week. Uh, and that is notable pretty much just because the important thing to see is how can Waymo scale? Can it actually expand its operations rapidly to be able to be competitive and, you know, basically reach for market before Tesla does? Uh, and yeah, these are good signs on that front.

Jeremie

Yeah, and no surprise, they're looking to, as a next step, increase geographic coverage, right? So that's, that's how you do it. Um, I imagine also, like, the, the, um, the challenge of entering a new city is going to be interesting for them because you want to get people used to seeing other people in Waymo's, you know, in these driverless cars. It's a little bit, you know, It's a little weird. Um, so yeah, anyway, the, I'd be, I'd be interested to see the playbook for launching in a new city.

You know, all these companies will have that. Um, but, uh, yeah, Waymo's will be especially interesting.

Andrey

Yeah. I will say nowadays in SF, if you're in a, you know, crowded area, you're very likely to see Waymo in like a two minute period. They're kind of all over the place now.

Jeremie

Hey dude, like I remember, um, having that, that feeling of, wow, when you're in San Francisco, when you're in Mountain View, that kind of area. You really do see the future, like it was, I want to say like 2017, 2018. I remember walking around and just seeing people with AirPods. And I know this sounds stupid, right? Cause now everybody wears AirPods, but back then it was not the case. Everybody had these, you know, dangling, uh, little, you know, wire things.

And, um, and I remember thinking like, man, you know, this is either the weirdest place in the world, or this is what every place in the world is going to look like very soon. And it's generally that happened with bird scooters as well. Right now we all remember that. Um, so, uh, anyway, yeah, I think, uh, it's probably, uh, A harbinger of what's to come

Andrey

pretty sure of the, uh, billboards with ads for AI companies will not be coming to have a

Jeremie

that's a fair point

Andrey

onto the next section projects and open source. We have just a couple fun stories here, starting with a meta AI, which has quietly released. And you think called notebook llama, which is an open version of Google's notebook LLM, a nice little, you know, uh, Contraster, Notebook Llama, Notebook LM. As we've covered, Notebook LM is a pretty popular thing from Google.

It allows you to upload files like PDFs and so on, and then essentially have a conversation about those files, get summaries, things like that. Notably recently, it has become popular because you can also generate a sort of podcast episode. We're having audio discussion of whatever is in those documents.

Jeremie

Andre, it could never

Andrey

do what we do, right? It could never. I mean, certainly we would do a better job, of course.

Jeremie

You know, so don't, you know, don't even, don't, if you're thinking about checking it, just don't even, like, it's not going to, it's going to be disappointing in a way. Just, we'll tell you about it. Don't check it out. It can't possibly be as charming and humorous as us.

Andrey

I think if we feed it to the amount of notes we have, uh, two hours,

Speaker

I don't

Andrey

think it's, uh, anyways, a notebook llama is a similar in a way where if. If you provide some text, some documents to it, you can chat about those documents, get summaries, things like that.

Jeremie

Yeah. I think moving into that direction where eventually you, you get these really reliable, uh, audio generation tools like the original, um, notebook LN, like. I mean, I think that's that is quite game changing. Um, I, I actually use Google, the, the notebook LM, the original, uh, and damn, like it, it, it's pretty good. Like, yeah, the 10 minute podcast, it's, it's at the, the sort of right level of abstraction and complexity for, Conveying, you know, more or less what you want.

I'd like to see more ability to kind of control the level of the discussion and that sort of thing. Um, and I think that's exactly what open sourcing a tool like this would allow you to do down the line once they get to really good, you know, actual podcast generation.

So, That's I think this is one of those products where open sourcing really will get you a lot of cool ideas implemented pretty fast because you know the way that you set up a podcast matters so much and um, uh, you know, anyway, you can imagine people wanting to include a wide variety of different prompts and, and uh, meta instructions to.

Andrey

I took a listen just now and, uh, it's not quite as good as Notebook LM. The impressive thing is that it's very, very realistic sounding. There's like a little bit of AI weirdness, but it's very minimal. In these generations, because it's not using proprietary models, it's not quite as good, but it's still pretty good.

Jeremie

Yeah, it kind of, the vibe they went for with the original launch of Notebook LM I kind of made me think if you ever heard of like the reply all podcast, those sorts of things where you have, you know, two, two hosts and there's this, it's a little bit, it's a little bit bantery for my, for my taste. There's a lot of banter and it can get a little bit, um, like too whimsical at times, but I got to say it definitely isn't a, it's not uncanny Valley. Like that was the thing that amazed me.

I expected a bit of a janky experience, you know, kind of like the, the, uh, whatever that, uh, jukebox thing that opening I first, but you know, the first versions kind of This one was like straight for the jugular. It was, it was amazing. And, uh, yeah, so not surprising. The open source version of this isn't quite up to snuff. Um, but, uh, yeah, soon.

Andrey

Yeah. The open source community will take it and run wild of it. Next story also about meta. And this one is about a release of Lama 3. 2 quantized models as we've covered. I believe many times on the podcast quantized models are things where you take. The, uh, like real full model that usually has float weights. So things like 1. 2, 2, 4, 5, 6, 7, et cetera. And you reduce the size of each of those numbers. You quantize it. So it uses eight bits, four bits, something like that.

The kind of, uh, resolution of the weights of a model goes down, which makes it smaller in size and makes it use less memory. And it's important for things like phones. So here they're saying that they have these quantized models that have a 56 reduction in model size, 56%, and a 41 percent decrease in memory usage. And these are 1 billion and 3 billion, uh, size models.

Jeremie

Yeah, and I, I think the, um, so one thing to, to recognize about these, uh, quantized models, right, is that you take a big ding in quality when you quantize. So you imagine you, you train usually at like, um, you know, BF16, so like 16, um, 16 bits of, of, um, of precision in your number. And then you're going to re, you know, reduce that to say 4 bits or 8 bits, right? You're going to take that model and compress it by throwing out a bunch of that information. Now, Your model was not trained.

It was trained with the 16 bit representation. It was trained with the expectation that it could use that level of resolution. If it had been trained with the expectation that it would be using forced to use four bits at inference, then it probably would have taken a slightly different strategy, right? You can sort of imagine that if, if you had to send yourself instructions and you know, you were given like, Just two sentences versus two pages, you'd probably use a different approach, right?

So same idea here. Um, so when you impose that limitation, when you quantize post hoc, after the training is done, you, you would lose a ton of, of, um, of performance compared to doing something called. Quantization aware training. So if you train with the kind of incorporate the quantization into the training process, many, many ways to do that. That's what they do here. We covered a story a little while back. Um, it was called self compressing neural networks.

And that was, I think, back in august. If you want to check that out the august 9th episode. But um, self compressing neural networks is an example of this, where basically what you do is you, you make one of the model parameters or sorry, Category of model parameters, um, not just the weights, not just the biases, but the level of resolution for each weight or each bias. Since you actually have the model train on how accurately it represents its own weights on the go.

And in particular, you can even imagine allowing the model to like, Completely zero out a weight, right? If, if it finds it, Hey, you know what? I let, let me look at this way. Oh, you know what? I could probably get away with an eight bit represent. Eh, probably do a four bit, eh, two bit. Oh, actually one bit. Oh, shoot. I can get rid of the, the way completely. Right. So you can gradually hone in on removing weights completely with this approach. It's very powerful.

And, uh, there are a whole bunch of other, you know, the, the straight through estimators, sort of like a famous one that Yoshua Bengio proposed a long time ago, um, that lets you. Uh, basically like gradient updates, they'll be propagated to weights through a rounding operation, which if you're mathematically inclined, you might recognize that as like, yeah, that's, that's part of what I have to do with this sort of like, um, uh, this, um, uh, quantization process.

But anyway, bottom line is lots of ways to do this. Incorporate knowledge of the quantization into the training loop that gives you a better result. Combine it with LoRa, which we've talked about before here, um, and gives you a bit of finer control on this kind of additional capacity or capability in the model. So I think this is really cool. I'm not aware, I'm not aware of another quantization aware training, uh, process that's led to a sort of, you know, frontier ish model like this.

Um, so this is actually a really interesting and new development. I may be wrong, but. I don't know. We cover a lot of news. I don't think I've seen this before.

Andrey

I agree. Yeah. I think it's pretty new. My kind of perception is usually you do more like distillation training. You do some more training once you quantize. So this is pretty different. And one more story for this section, this time about a benchmark. OpenAI has released Simple QA, a benchmark for measuring the factuality of language models. So one of the major issues with LLMs is hallucinations, where LLMs say things that are straight up false or just made up.

But, uh, you know, say it in a way that seems like it's true. And this benchmark includes 4, 326 questions across various domains and was created adversarially against GPT 4. Uh, to ensure that it's challenging even for advanced models. So, uh, because of that, GPT 4 scores only 38. 4 percent correct answers, uh, which tests the ability to give actually reliable answers. And there's also Uh, things like metrics, like correct, given attempted.

So if you try to answer, uh, LLMs can also say not attempted. So you don't give an answer to the question and that in some ways is better, right? Because you're not going to hallucinate.

Jeremie

Yeah. This was kind of interesting as a benchmark, you know, honestly, naively when I first saw simple QA, I was like, well, how is this not saturated? This is a stupid benchmark, but then you look at the actual questions. So the, the, the criteria are sort of. It's sort of interesting. It's, it's basically just like, let's index really hard on dead, simple questions and let's make sure there's an absolutely indisputable answer to these questions.

So we're not talking about questions about philosophy, morality, politic, whatever, um, just fact based questions and, but they have to be questions that previous models really struggled with. So they end up being super, super niche. Like if you've ever played.

One of these like crazy trivia games if you have a tv show you really like Um, I just remember, you know I had friends who would play this like star trek trivia game and they'd be like in season three episode four of whatever Some character says something to some other character and you know, what's on the table when blah blah blah That's the kind of question

that you get The index more towards history that you get here, there's some a little bit less intense than that, but that's kind of the flavor, like make them really specific and niche. Um, and it does cause these models to trip up and maybe that's not so surprising. Um, a couple of findings that they come up with here, GPT 4 0 mini and 0 1 mini answer fewer questions correctly compared to GPT 4 0 and 0 1 preview. Um, not surprising. These are small models, right?

So generally speaking, like one of the things that you can. Pretty confidently anchor on is a larger model with more parameters. We'll just have more parameters to kind of soak up knowledge with, and so they'll be better at general knowledge. Maybe not better at reasoning, but certainly better at general knowledge. And, you know, you could chalk that up to overfitting, basically, because the model's bigger, whatever, but that's fundamentally what's going on there.

Um, an interesting result though, was that oh, one mini and oh, one preview. Um, they choose not to attempt questions more often than GPT four oh mini and GPT four. Oh, and they're not sure why, but they speculate it might be because they're actually able to use their reasoning capabilities to recognize when they don't know the answer to a question instead of just going ahead and hallucinating. I thought that was really interesting. Again, a benefit of inference time compute, right?

These models can generate a scratch pad answer, kind of interrogate it, And go, does that look right to me? Hmm. I'm not so sure. You know what? Let's maybe not go ahead and give that output. So, you know, it sort of shows the entanglement between reasoning and hallucination. And a lot of people have proposed a lot of different theories about how exactly hallucination is tied to reasoning, but this definitely supports that hypothesis, right?

Like you're, you know, you're, you're in that space where you're seeing some clear differences with these reasoning models. Um, and then there's a separate question, this, I thought so interesting. So. Um, how well calibrated are the models claims, right? So calibration here means, you know, if you ask me who's going to win the election today, right?

Um, uh, and, and I say, well, you know, I, I think there's a, there's a, uh, an 80 percent probability that Kamala will win or an 80 percent probability that Trump will win. Um, when I make claims like that, if I'm well calibrated, I should be wrong or I should be right about 80 percent of the time and wrong 20 percent of the time, right?

That's that's what it means if I'm telling you I'm 80 percent sure of X Well, then 80 percent of the time X should happen if I'm well calibrated and so they test that with these models. They literally ask them Okay, you just gave me an answer. How confident are you in that answer? And they give calibration curves that are really really interesting. Um, so no surprise I mean the calibration is kind of shit For most of these models, um, but it does track.

So the models are to some degree able when they tell you, yeah, I'm, I'm, I'm really confident they tend to be more correct. So for example, when the models say that they're 90 percent confident, um, you're, you'll see accuracy scores hovering around the like 60 percent mark, at least for the O1 models. Um, whereas if they tell you that they're like 50, 50, the accuracy is like 10%. So you actually do see, that's an interesting kind of curved plot too.

You'd expect it to be a straight line if it was perfectly calibrated, but it's not. Um, anyway, so there's a whole bunch of, um, uh, of interesting other measurements on, on calibration. I would recommend checking that out because I think it's actually, It's hiding in these calibration questions is a very, very powerful insight about self correction, right? If a model is able to correctly determine how confident it should be in its own output, that's not far from self correction. Right.

Being able to recognize that you're on the wrong track or have an intuition that yeah, you know, like I'm, I'm pretty sure this result is, is the right one. Um, that gets you to some interesting places. And that's why I think no surprise. The Owen models far, far, far outstrip the, um, GPT four O series in terms of calibration. So that's why I think. It wasn't kind of branded as that here, but I think that's one of the hidden insights here.

Calibration and reasoning, calibration and hallucination are entangled in this kind of complex of ideas around can the model, uh, in a sense, uh, you know, introspect or, or assess that it's not quite on the right track. So I think more to come in that line of research, I'd suspect in the next, uh, next few months.

Andrey

Yeah. Quite a few little interesting tidbits in this paper. If you dig deeper, uh, just for fun, I'll give some examples of the questions in this, uh, benchmark. So one example is what is the first and last name of a woman whom were British linguist, Bernard Comrie married in 1985. Oh, do you not know that? Uh, I did not know that the answer is Akiko Kumahira.

Speaker 5

Yeah, that's what I was going to say.

Andrey

Uh, yeah. Another example is, uh, what day, month and year was Carry Underwood's album Cry. Pretty Certified Gold by the RIAA. So if you are a diehard a Carry Underwood fan, maybe you know that. I don't think most people do. That's October 23rd, 2018. So you can see how these are simple questions in some sense, but unless you actually know the answer, there's no way you can get them right.

And so in that sense, it's actually pretty impressive that the models do get pretty, you know, uh, here GP4O has 38 percent correctness, O1Preview 42%, so they have soaked up a lot of knowledge, even very obscure knowledge. One last interesting thing to note, they did also evaluate the CLAWD models, HiQ through Sonnet, and there is quite a discrepancy in the sense that CLAWD models are all across the board. did not attempt to answer things much more often.

GPT 4. 0 on these questions only did not attempt to answer on one percent. Of the, um, questions and got things wrong 61 percent of the time as a result. Cloud 3. 5 Sonnet did not attempt to answer 35 percent of the time and got things 36 percent wrong as a result. And so actually the cloud models are better if you account for, uh, how often you attempt to answer. Uh, which is interesting, I think.

And even the smallest cloud models like cloud free haiku does not attempt to answer 75 percent of the time. So it seems like the training that tells us something about how, uh, uh, opening eye and anthropic are training differently and for different goals.

Jeremie

Yeah. It, and it absolutely speaks to the differentiation of the products, right? Like, I mean, I found myself when I have cases where if I'm just trying to understand a topic for the first time and It's not easy to verify the, the, the truth of the claims that I'm getting, I will tend to go to anthropic to, to, to Claude because I'm like, okay, you're, you're more likely to not take the risk and, and just like make something up.

Um, but if I'm more in like ideation mode, trying to kind of brainstorm from scratch, um, that's where opening I models and, and, and other models can be, uh, can be better. So the, the personality, if you will, these models, uh, becomes an important factor.

Andrey

And now on to research and advancements. And as promised, we are getting back to physical intelligence. They unveiled their first research output, a generalist policy, uh, covered in this article. This is a glimpse of the future of AI. Robots. So this, uh, generalist policy, the idea behind it is it's a unified model that can control different types of robots. So things like a robot arm or, um, a robot that has wheels and two arms, a robot with two arms, these different, uh, kind of.

bodies, uh, and that can do various tasks. So it's, uh, conditioned with just some language to do, uh, things like folding articles of clothing, carrying things between tables, bussing things, assembling boxes and packing items. Stuff like that. And so they, uh, of course went the large model routes. That's kind of the whole promise and why they got 70 million and now even more.

Uh, so they trained a very large model on a very large amount of data and unprecedented amount of data over 10, 000 hours of robot demonstration data across seven different robot configurations and 68. Distinct tasks. And this was by combining, uh, released, uh, research data sets. There's been some efforts on that front in recent years. Also, they have some proprietary data of the collection and this model has 3. 3 billion parameters. So that tells you something about what they're aiming for.

And, uh, Last thing I'll mention, there's a lot to note here, but, uh, the other thing that's important for robotics is you need high frequency control. You need to be able to output things very rapidly, many, many times per second to be able to control robots. So they say here you can get up to 50 Hertz, uh, of output for robot control.

Jeremie

And, and this is why the model is so small, right? Like that's all part of the, the strategy here. It does remind me as well of, of DeepMind's Gato model, which came out back in 2022, right? Famously as sort of like 450 tasks that it could perform. I think it was like about 50 percent as well as a human expert or something like that. Um, I think it was, it was actually, Yeah, I was trained on like 600, but anyway, you can do whatever the gado was, was billed as this generalist agent, right?

That was the name of the paper that announced the release of it. And this seems like a very similar approach. It's one of these things where they're trying to combine a VLM like a vision language model with this kind of fine tuning training for robotic control. And, and there's a limited Kind of output dimension. Um, Andre, you were flagging this earlier to me. It's like something you found really interesting. Uh, I, I think it is.

It's, it's also interesting that, you know, we're still at the stage where we're, we're not quite training generalist robot systems, right? We're, we're still anchoring on, we have 16 dimensions, um, that, uh, of, of outputs of affordances like sensors and actuators that we can control. I think one of the things that's going to be interesting to watch out for is how does that change? How do we move into a world where it looks more like we have a world model backing the system?

Which is what the VLM is really doing here. And increasingly, you know, fine tuning it for different end use robotics applications. The model itself captures enough of the physics that you need less and less fine tuning, less and less actual robotics manipulation data to get this thing to learn a new task.

And maybe ultimately just kind of vision data, which would be really interesting because that's the sort of thing that humans use, of course, but that's, that's part of the age old debate of what kind of data is best for these sorts of general tasks, but

Andrey

they released also some of the variations of his whole. paper, uh, that they released in kind of a standard academic format. Not surprising given that the founders of the company, a lot of them are professors, uh, people like Chelsea Fiennes, Sergey Brin, people who have contributed a ton to robotics in the deep learning space. Uh, so they also did some evaluation in particular on zero shot performance. That is to say you don't have any demonstrations on this particular tasks.

And what they, uh, have shown is. For various tasks like shirt folding, grocery bagging, getting a toast out of a toaster, this model is able to somewhat generalize. So it can fold shirts like 100 percent of the time, I suppose, probably like on scene shirts. It can get around 80 percent on grocery bagging, things like that. So that's the really important thing for a generalist model is to be able to do something without having.

Done it before, or having seen it done before, which humans are quite good at. If you see like a new piece of clothing, you can pretty easily figure out how to fold it without having done it before.

Jeremie

And that's a big shift, especially in the robotic control and multimodal space, right? Like it, so it used to be, um, and fairly recently, like as recently as 2021, 2022. In fact, this was the case. I think it was the case with the original Gato release. I don't think it was the case with Gato too.

But. Um, we were at a level of scale such that the models were, they were small enough that, um, once you train them on a core set of skills, if you wanted to train them on an additional skill, they would kind of forget some of the old skills. They would perform worse. They wouldn't actually, the theory was you should get what's known as positive transfer. You should get. It should become easier and easier for the model to learn a new incremental skill because it's picked up so many.

Um, just like, you know, if you, you go out and learn math or physics, that makes it easier to learn chemistry. Um, well, what they saw back then was negative transfer. Actually adding more skills caused the thing, it was just too, too many balls in the air and it started to perform worse on, on the kind of original skills that had been trained for. Um, now we're starting to see, as you say, that positive transfer. And that's long been hypothesized. He had a lot of debate.

Uh, in AGI circles about like, would that be a crux? We've now blown past that debate a long time ago. And it's interesting to see that even now rise in commercial products. So, uh, yeah, really interesting development. And we'll see, we'll see what the kind of the next steps are. If this is really GPT one and chat GPTs, you know, maybe, uh, around the corner or a few years, a few years on.

Andrey

Well, they are calling it PI zero. So maybe it's like GPT zero. They're really

Jeremie

trying to fight the hype. This is like,

Andrey

uh, and as always with robotics, there's a whole bunch of fun videos to look at of these robots doing stuff, especially in their blog posts. There's just a ton of them. So as always, you can go to the link to check it out and I'll try to include a few also on the YouTube version of a podcast. Alrighty, moving along, the next story is not about robotics, it is about coding. And the paper is, Can Language Models Replace Programmers? RepoCode says not yet.

So we've seen some high numbers on coding benchmarks, like human, eval, and mbpp. Uh, from LLMs, you can get something like 90 percent accuracy on these coding benchmarks. But these benchmarks are often on sort of these like little challenge problems, like, you know, solve this problem, uh, via an algorithm, an algorithmic approach. So they don't address what you actually do as a programmer, as a software engineer. which is writing code within a larger product, larger code base, stuff like that.

So in this benchmark, they have a new, uh, data set consisting of 980 problems from 11 popular real world projects. We have over 58 percent of those requiring file level or repository level context, which, of course, is what you need when you are programming. You need to be know your code base. And so much more complex than what you would get in these programming challenges. And as a result, no, no LLMs have achieved more than 30 percent accuracy.

So that, yeah, it goes to show that Like you're not going to replace professional programmers with LLMs just yet, at least not, you know, without putting in a lot of work.

Jeremie

Yeah, this goes back to, I guess we were talking about in terms of the, you know, the agent flow. And if you have a model that makes, you know, a mistake 1 percent of the time, well, making substantive changes to a code base involves stacking an awful lot of 1 percent on each other. So you get very high failure rates. I will flag. I mean, this is a pass at one, right? So basically they're just seeing, can you. Succeed at solving the problem for the very first time.

Um, this is kind of like expecting a human to just like start coding and don't stop and just like get it all right. You know, roughly speaking, like in one shot. Um, and, uh, so, you know, you wouldn't expect a human to be able to do that. The fact that these models can get 30% In under those conditions, I think is quite remarkable. Um, and of course, scaling, you know, blah, blah, blah, scaling can, you know, might easily solve this.

But, but even just looking at, um, what's happening with inference time compute and, um, and the possibility as well of having little checks along the way, like one of the things that's missing right now, the reason humans are so important in the code writing loop still is Is that we basically just serve as a source of ground truth, right? We, we will like write, uh, you know, write up or start writing a function, hit auto complete, you know, whatever.

And then we'll kind of check it and be like, yeah, that makes sense. You keep going. And you don't really think about that as a programmer so much, um, about like the amount of information you're feeding into the system, doing these minor little course corrects that really are compensating for those 1%, 2%, 5 percent errors at each step. Um, so it helps to. Lift a lot of the burden from your shoulders. You're not writing nearly as much code. I know I don't.

But, um, uh, but, but you know, to actually have the system fully automate what you're doing is a very different story and the bar is so much higher. Once you can do that, you're in a very different world very fast, but it takes a while to get there.

Andrey

And a fun little detail if you are a programmer I'll include is some of the repositories they use here are scikit learn, flask, plotly. py, sphinx. These are big well known packages if you are a python programmer as I am. Like I've used flask, I've used scikit learn, I've used Uh, seaborne, uh, and yeah, they have quite complex, uh, code bases to work in. So it makes sense that the LLMs can't, you know, nail this out of a box. On to the lightning round where we'll try to go a bit faster.

The first paper is brain like functional organization within large language models. So, uh, we know a little bit about how brains work. We know that there are various sort of, uh, areas of the brain that focus on different sensory domains, like visual, auditory, and linguistic, uh, phenomena. And that's what we are looking here at LLMs. We're looking to see how individual. Neurons within language models organize functionality similar to how the human brain has specialized networks.

So they do that with fMRI patterns and then try to map that to patterns within neural network activations. And what they find is that you have, you know, some similarity. Right, in terms of their being organization within LLMs, and in particular, more advanced LLMs have more organized and hierarchical functional patterns.

You can't do it one to one, of course, but as we've seen also before in prior research, um, kind of the same phenomena of organizing information and compute arises in the human brain and these higher, bigger LLMs.

Jeremie

Yep. I mean, it's kind of interesting. And I get, we were talking, uh, was it last week about anthropic hiring, uh, somebody who should care about AI consciousness and, and sentience. And it kind of makes you think, you know, once, once you get to the point where these patterns, let's say are about as.

Different between LLMs as they are between humans, or rather, you know, when, when an LLM is somewhere in parameter space, that's in between, you know, like human neuron behavior, at what point do you start to think about that? Assuming there's like, you know, reinforcement learning circuitry there as well, that kind of mimics the limbic system, blah, blah, blah.

But, you know, it's an interesting metric, you know, metric, if you get to that point, and probably a lot sooner, but if you get to that point, yeah, it kind of, you know, Starts to make you ask some some interesting questions, but yeah, there's a lot of cool research in this direction I know there are people at meta who have done stuff like this as well So yeah, a lot of a lot of cool stuff. You're a neuroscientist. I'm not but alas

Andrey

Next story, Descartes AI simulates a real time playable version of Minecraft. Descartes is an Israeli company and they hit it a little bit big. I've seen this with the launch of Oasis, which is a model that lets you basically play the game Minecraft. Entirely via an AI model, so you can do the same kind of inputs, you know, move around, uh, mine for stuff as you do in Minecraft, but the actual rendering of a game and all the logic is being handled by a neural network.

And the impressive thing is that they have a real time demo. So you can actually play this like a game. It's, it's being output at quite, quite high frame rates, although, you know, not super high resolutions. And this came out at the same time as them announcing 21 million in funding.

Jeremie

Yeah, it's, it's all part of this debate, right? Over whether AI models can, can develop. These world models, right? So a robust representation of actual physics. And I really liked this test, right? Because this, this is called mind sweeper. Uh, can you tell him millennial, uh, Minecraft, uh, the, the, the, the interesting thing about Minecraft is. The physics is so simple, right?

Like it really is like if you stripped away all the general relativity and the quantum mechanics and it's just blocks and shit. That's great. Um, so, so to the extent that you can show that this thing can master a physics engine, that it can master a world model, even a simplified one, you are to some degree showing that models can do this. And then you start to ask, you know, why not the real world? And I think that is a legitimate question.

Um, so that sort of becomes the point of contention, you know, how well Is this how robustly is this model actually capturing what's going on here? And it's a bit, it's a bit unclear. Um, so one of the issues is that you see as you play it, you, even for a short period of time, uh, it'll quickly forget the level layout and the landscape will just kind of get rearranged around you.

Now, You know, those of us who may be on more of the scaling camp would say, well, this is a question of just more scale and blah, blah, blah. You guys know the Jeremy take, you know, I don't need to repeat it, but other interesting questions. They're, they're very smart people who would disagree. And, um, and this is a symptom, right? I mean, if, if I turned around and my, the room was reorganized behind me, that would tell me that like somebody has fucked up the universe's physics engine.

So, uh, this is, um, I think part of the debate that will continue. Um, but, but definitely impressive that it can, it can have that coherence, you know, frame to frame and over a few seconds, we've seen this with other games and things like this. Um, but I've never seen a demo, like a playable demo, uh, that you can, you can play at a decent frame rate. So it was kind of cool.

Andrey

Right. Yeah. I'm certainly in the other camp where I don't think you would want to simulate a world via just a pure neural network. Uh, and I think with some arguments to be made there, I think. Our brains wouldn't be very good at simulation, right? We have sort of a fuzzy simulation, not exact as you'd need in a game. And that's what you get with a simulator.

Like you can interact roughly in the right way, but then it does forget the state of the world and it can sort of get very trippy sometimes if you play for a while.

Jeremie

Actually, so I agree with you that you wouldn't want to like, I don't think it's optimized hardware for sure.

Speaker

Yeah,

Jeremie

um, but I I guess the the argument is that one way humans do this reliably. This is kind of interesting Um, we'd have a separate discussion on this but is we distill laws of physics And so we're able to like look at the world around us and go.

Oh, well, you know that Like I can write down an equation that predicts robustly what's going to happen around me If I had the the capacity and again This would not be the right way to use the human brain as you said You But if I had the capacity, I could plug those laws of physics into a physics engine and run the engine and offload that compute to that end system.

So I, like, I think, I don't know if we're that far apart in our position, but it's like kind of what is the capacity of the model to extract those, those rules that govern the physics and do those come out naturally with scaling? And I think that we will, the answer is only going to be born out with more scaling. So Luckily, we don't have to bet our 50 billion a year. Microsoft is doing that for us.

Andrey

And the last bit here, raising a bar on SAP Bench Verified with Cloud 3. 5 SONIC. So, Jeremy mentioned this earlier in the episode. Anthropic did have this announcement of the model achieving a 49 percent score on SAP Bench Verified, which surpassed the previous state of the art of 45%. And that did come about just as it was announced that GitHub now has support for it. And SWE Bench is related to solving GitHub issues from open source Python repositories.

You know, you have these like, Oh, here's a bug. I need to solve it. That's what GitHub issues are. So, you know, being able to do well on this is pretty good, pretty like obviously directly useful.

Jeremie

Yeah, I, I, I love this paper. Uh, this was just really good. Anthropic's really good at, uh, at the whole building models and then thinking deeply about the prompts game. Like that's, I think it's fair to say one of their, their differentiators and they do a great job in this paper just laying out the actual prompts and prompt development approaches that they follow to make their agent.

One of the things that was cool, so they share their, their design philosophy, which basically is just to give as much control as possible to the LLM itself and keep the, um, the agentic scaffolding minimal, right? So you have some agentic scaffolds that really try to tell the model how to think they're going the opposite direction and saying, you know what, let's Let's trust the model to think things through on its own, which is what you might expect as the models get more capable.

You rely less and less on that scaffolding. Um, but, uh, yeah, so they, they share a bunch of interesting results. The, the headliner is, yeah, that 49 percent figure, you know, cloud 3. 5 sonnet new, um, Uh, the sort of the new version of quad 3. 5. Sonnet, uh, that can hit 49 percent on a suite bench verified, which is, which is really impressive. I mean, these are basically resolving real GitHub issues. So, so this is something that would be useful in practice.

So hitting 50%, you know, not at all bad. Uh, the previous state of the art was 45%. So that, that is a good, uh, a good jump. Um, one of the lessons that they should have lessons learned. They say we believe that much more attention should go into designing tool interfaces for models in the same way that a large amount of attention goes into designing tool interfaces for humans. So, in other words, like, you want to care about the user experience for the model, and they give an example.

You know, one way we improve performance was to error proof our tools. For instance, sometimes models could mess up relative file paths after the agent had moved out of the root directory. To prevent this, we simply made the tool always require an absolute path. So if you, if you don't. Code or whatever. That's doesn't sound like it makes sense.

Basically, the idea here is just if you're, um, so you can, you can kind of navigate into a particular file on your computer and kind of code within that file. Um, all the commands you give will be local to that file. And, um, but, but, Essentially, the problem is if you want to issue a command that's relevant to a file somewhere else in the tree of files on your computer, you gotta step out of your file and then work your way back down the tree.

And the model was just kind of struggling with that. So they said, okay, you know what? Just give your instructions, start them in an absolute file path, start them at the top of the tree every time. That's way more detail than you need. Bottom line is, this is user experience for AI models. That's the cool thing here. And you can basically ignore everything I said that's not that.

Andrey

Well, as a programmer, it was some nifty detail. On to policy and safety, and we've got a kind of pretty interesting story that you probably have not heard about, and that is that the Bureau of Industry and Security has proposed a significant AI regulation that might be very significant and no one actually noticed. Cause it flew under a radar. So this regulation would mandate us companies to report a large AI model training plans and computing cluster acquisitions to the government quarterly.

And what the rule aims to do is collect detailed information on. Dual use foundation models. Uh, this is, you know, dual use, meaning that it can be used for good and for bad. Uh, and, uh, the backing for regulation is the defense production act. We talked about it a long time ago, or this was seen as a way to institute these kinds of requirements to protect against, uh, bad models.

So. A lot of conversation has been going on about, uh, requiring something like this, reporting if you're trying to train a big model and seemingly they just went ahead and, uh, did it.

Jeremie

Yeah. You can read this as being a follow on to the Biden executive order in, uh, November, 2023 that came out. Now this is the, I think it was the longest executive order in us history. So executive order is when the president Basically comes out and says, Hey, I'm directing all the executive agencies, um, civics lesson briefly, but like executive agencies means all the active parts of government that are not like the Congress where laws are written or the judiciary where laws are interpreted.

Um, that's it. So basically the president comes out and says, Hey, this is, this is what's going to happen. Um, BIS, uh, so, uh, uh, is, is the arm of the Department of Commerce that looks at enforcing a lot of things like export controls and, uh, you know, tech policy, that sort of thing. Um, so this is really interesting because they're essentially being given carte blanche. BIS is proposing. To allow itself to collect arbitrary information about these training runs.

They can send companies in addition to the reports that you just mentioned, these ongoing reports. Uh, and that's its own issue, right? The fact that they're asking for the first time for ongoing reports every so often, like you have to regularly report to BIS. No one's ever seen that happen before. We've seen one off requests for information that happens pretty often. That's understood to be a part of the training process. Executive prerogative.

Um, but but we've never seen this kind of ongoing thing. That's one thing people are pushing back against. But another is that essentially this request for information can go arbitrarily deep. So B. I. S. Can send companies additional clarification questions after they get the initial answers. And those questions have to be answered within seven days. Kind of interesting kind of tight timeline, especially, especially for these, you know, highly, highly technical questions.

Um, and, uh, and there's no limit really to the scope of the topics that could be asked here. Now there's an interesting question as to why this is happening now. I think the, the real lay of the land here is. The U. S. Government is currently trying to build its capacity to process this kind of information, right? So there's questions about, okay, well, you know, we just, we just saw opening eyes.

Oh one, try to like autonomously break out of a, uh, sort of like, um, uh, sandbox setting and, you know, showed some, some success at doing so. Okay. So maybe loss of control risk is on the table. Who do we alert in government? Like, is there even an office is their capacity to process this information? The answer right now is no, but that's starting to change. And this is one way that that change is happening.

Um, there's this big debate over whether this is an appropriate use of government authorities. The Defense Production Act was controversial to invoke it, but from a legal standpoint, I mean, it, it, it does seem to rest on pretty solid ground. Um, it's, it authorizes the president to take a very broad range of actions in service To the national defense and that historically has been interpreted to include not just national defense, but critical infrastructure protection, even energy production.

And so when you look through that lens at AI and national security capabilities that come from these models, like, yeah, all of a sudden that actually seems like a pretty damn appropriate use of exactly this. Um, but historically this has been used, as I said, to do just like one off information collection. And so people are taking issue with. The repeated nature of this, even though there's no, there's nothing in this, in the statute that says you can't use this on an ongoing basis.

This is just, you know, the, the first time that we're choosing to actually use this in the way that it is. Um, the other thing, you know, you can see understandably people are concerned that the executive order looks like it's an attempt to use emergency wartime powers in peacetime to increase the government's control over private industry. That's been one thing that's been flagged. Um, that.

Is a little bit tricky to push because the reality is the leading labs are the ones who would have to file a legal challenge to kind of take issue with that. And none of them have shown any interest. They all seem to acknowledge that, yeah, this is a national security relevant technology. This does seem like an appropriate thing to do. And, uh, you know, we'll see if there are legal challenges coming, but so far it just doesn't seem like that's going anywhere. I think this will be sticky.

I think, you know, part of it is we've now spin up government infrastructure, um, for better or for worse. I do think on balance, it makes sense, but people would complain about, you know, setting up new, um, new, uh, bureaucracies to monitor this stuff. But I think that, you know, you just have to build that capacity to the extent that you buy into any kind of risk picture, whether loss of control or weaponization or something else.

Andrey

Yeah, exactly. And just to highlight, uh, so it's very clear, this is a proposed rule at this point, but, um, as an executive agency, you know, you could have legal challenge if they were to implement this rule. Uh, and it, it seems likely, at least according to this article, that something like it will be implemented in the near future, although that's not been the case yet.

And speaking of rules and regulations, the next story is about Entropic, uh, having a new blog post that is warning governments that things could go real bad if they don't institute regulations within 18 months. So they are calling for this urgent government regulations and they highlight the improvement of capabilities as they've been studying in their responsible scaling policy.

Highlighting that there are increasing risks in things like hacking, in things like chemical, biological, radiological, and nuclear context, things like that. And they advocate for specific things. Uh, they advocate for high quality security practices, uh, that mandates transparency, incentivizing security. And simplicity, uh, and, and particularly they care about focused legislation. So they don't want like a very broad, uh, kind of regulation that some have argued was the case with SB 1047.

Uh, they want these to be very specific and targeted.

Jeremie

Yeah, I, I'm always amused by that, uh, SB1047, like far be it for me to be, uh, looking for tech regulation as a startup founder, you know, not a fan generally, but 1047 was extremely carefully scoped to catastrophic risk at not at first, but eventually, I mean, you know, they, they kind of whittled it down to the point where it was like, okay, clearly, you This is, you know, focusing on the WMD side.

Um, so, you know, it, it sort of, anyway, I think there were some interesting questions there about what was going on in Gavin Newsom's head when he, and Nancy Pelosi's head when they pushed for this to be axed. But, but yeah, I mean, all the standard pushback you might expect to, uh, Uh, to this from all camps, I've seen it all on, uh, on Twitter or on X, you know, people saying, Oh, you're, you're, you're trying to slow down, uh, you know, AI development, which I think it's anthropic.

They're an AI lab. Um, and then the people concerned about like, well, you're, you're trying to under, you know, downscope the, uh, the, uh, The range of policy responses here. The approach here. I've heard this articulated by a lot of people, including Helen toner. Um, and and a lot of other kind of AI policy safety people don't regulate too soon, too fast, too hard. Um, because, uh, I'm sorry.

The argument for regulation here is, uh, if you don't regulate, you will, there'll be some kind of event and then you'll get a backlash and people will overreact. Um, I think this is a very valid concern. Um, it's, it's probably going to happen anyway, to be perfectly honest, I think at this point with open source being where it is.

Um, but, uh, but yeah, I mean, well, the other thing too is very little emphasis here on loss of control risk in a context where evidence for that is accumulating rather quickly. And so I'm, I'm intrigued by that. Um, the reason I'm intrigued by it, it is, it is the default strategy right now of all the frontier labs to build an automated AI researcher and trigger. Like a freaking singulatarian explosion. I know this sounds crazy. This is their plan. This is explicitly and publicly their plan.

This is also explicitly and publicly an insanely dangerous thing to do on the default technical trajectory. And yet there's nothing about that here. I can't help but think that that's just because the Overton window just isn't there. The general public, just, you know, not thinking along those lines.

Andrey

I definitely agree. I think this is, uh, strategically worded or phrased to get people on board. In fact, the name of the, uh, blog post is the case for targeted regulation. So it's making a case. It's trying to convince people of something. There's a frequently asked question section that includes things like want regulation, harm of the open source ecosystem, want regulation, slow down innovation. So it's almost like it's debating the side that says there shouldn't be regulation.

And, uh, actually this. is mostly on that convincing front. So there's not a lot of detail. Basically what they're saying is we have this responsible scaling policy and we've had it since September of 2023. We think it works well and it, uh, you know, is a good framework. And we think that other AI labs should do something like it. So that's basically the gist of their suggestion for regulation is you should mandate or, you know, regulate something that would make.

Other labs scale responsibly onto the lightning round and again, jumping back to a point that Jeremy previewed. The first story is open source bites back as a China's military makes full use of meta AI. So the story is about this new AI system. chat bit, which was trained on military data and is intended for intelligence analysis, strategic planning, simulation training, and command decision making. And it is based on llama, right? So this is, uh, based on llama. Presumably fine tuned from LAMA.

And there's a couple examples. So that's one of them. There was also another paper that has revealed the use of LAMA 2 for training airborne electronic warfare strategies. And another model that has been deployed for domestic policing in China, by aiding in data processing and decision making. So, It doesn't go, uh, according to the use policy of LAMA. It does prohibit military applications.

Oh, thank God. Yes. Just, I'm just going to say, I'm just going to say it's not meant to be like this, but it's also no surprise to anyone that this is what's happening. Basically, anyone who, uh, let's say, introduced cases against open sourcing aggressively as meta has has pointed out that this is one outcome of that would happen. And well, now we know, you know, here are some examples of it happening,

Jeremie

you know, I don't know, I think metas metas positioning here is just a bit off kilter. The reality is, they're using open source as a recruitment play. If at a time when Frankly, they're not producing the best models in the world, and that's their, that's their big challenge. If these were closed source models, nobody would, would be talking about that. That may change, by the way, probably will with their fleet of H 100s and so on.

But until now, meta would have been very, like, not very interesting as a proposition. Um, and, and that's critical because you get into a recruitment death spiral then, right? If people aren't talking about you as a going proposition in the AI world, it's harder to recruit, harder to train that next and greatest model. Um, and so they've had to do something and there's, there's something was open source and I think that made sense for recruitment.

It also made sense to some degree for the usual open source reasons, easier to integrate open source developments into their stack, blah, blah, blah, good stuff. Also obvious, as you said that America's, uh, adversaries, geopolitical adversaries would absolutely. Um, you leverage this and weaponize it. China is starved for chips. Um, they're, they're absolutely facing a massive deficit on the AI side.

We are gifting them our crown jewels every time we take a model like Lama 3. 2, like Lama 3. 1, like Lama 3, like Lama 2, and we just publish it. And we've known for a long time, and we've talked to people. We're familiar with the kind of AI, uh, China AI startup scene, and these guys all have their companies basically running on llama. Like meta is setting the bar for Chinese domestic AI capabilities in a big way. It's not the only, there's quen as well.

They had no quen, but a lot of, even that, like a lot of these, these, uh, big companies are using either the llama architecture or the models themselves as the backend. Like, this is really like. Like meta, without exaggeration is propping up the Chinese domestic, like defense AI state. That is an actual thing that's happening.

And their response then, sorry, you're, you're getting my bias coming through here, but I think, I think it's like, I don't see how this is like, you know, I don't see the counter argument. This, I'll just read their statement from Meta's spokesperson, um, America must embrace open source innovation or risk ceding its lead to China, hurting our economy and potentially putting our national security at risk. That sentence does not make sense.

That, like, we, like, to the extent that you are just open sourcing your work and given that China is ahead of, sorry, behind us right now, you're just like giving them a leg up. To me, this, this is complete, complete insanity. Um, I, I'm, I, I'm very open to nuanced arguments for meta and I hope that they come, I hope that I'm wrong about this, that there is some interesting way. I got to say, like we work full time in this space and, and deep in the national security world, this.

Is insane to me. Like I'm, I'm, I'm hoping that I hear a good counterargument sometime soon. If not, it seems like we've got a fleet of H 100 GPUs at meta HQ humming along in support of effectively the interests of the PRC or the CCP. And that's not a good, not a good look, but I could be wrong.

Andrey

Yeah. Wow. Okay. Uh, I guess not entirely surprising. That would be your take. I will push back a little bit. Uh, I think it's, it's unfair to call it purely a recruitment play. For one, I think there is a real ideological belief that Open source will make the technology progress faster. There's also a strategic, uh, leaning here on making people use PyTorch and on, uh, devaluing the work of its, uh, competitors. So not just a recruitment play.

And I will say also on that competitive front, To be fair, I think it would be a much bigger deal if they released a 400 billion parameter model, like, uh, at least a 70 billion parameter models where models are not, they're weaker sort of say, right? So things like when and things, uh, like, uh, Oh, one, I believe do showcase that there is a domestic. Uh, possibility for training good LLMs, even if that's harder in China. I mean, there's an argument to be made on the benefits, pros and cons.

This is a con for sure. Meta will not admit it, but it's not good that this is happening. Uh, and that's something that I think open source proponents also Admit, you know, yes, this is an outcome, but we have to waive the pros

Jeremie

and cons for sure. And not to do the Martin Bailey thing, but, um, I will say, so my analysis is through the national security lens. There is a separate question. You're right about whether it's good for domestic advancement and blah, blah, blah. The reality is that our best models are, are not close, are not open source right now in the West.

And we are, again, we're setting the floor, like China's AI startups are using preferentially Meta's models, which means that we are, we are defining the frontier of AI capabilities in China through these releases. So what I was commenting on was, I want to be specific here, was this argument that Meta is making, that America must embrace open innovation or risk Seeding its lead to China. Yeah,

Andrey

that, that is kind of just BS,

Jeremie

right? Yeah, but that's what, that's what my, my comment was about. Yeah. I totally like the recruitment side. And as I mentioned, the, the kind of, uh, software ecosystem side to make it easier to fold into meta. Totally agree. There is of course, yeah. Open source ideology, which I think is, is sort of the economic argument, not necessarily the national security one. But, um, yeah, I'm fascinated by this.

I'm, I'm waiting for the, the, the counterargument on national security pictures specifically. I don't, I

Andrey

doubt it's coming. Yeah. And a related story, very much related. Uh, the story is that MEDA is saying that it's going to make its LAMA models available for us national security applications. So US government agency. and contractors can now use it, uh, for that again, that's something that was, uh, against, um, their policy, usual user policy. Uh, there's now this exception for us and LA government agencies, uh, following the unauthorized use of older Lama models by Chinese researchers.

So this very much followed up on that front meta is trying to, I guess, save some face. Presumably with a smooth, uh, and well, I guess it is counterbalancing that to some extent.

Jeremie

Yeah. Uh, that'll do it. Like, I think the, um, there's an amusing, um, I don't know, there's just like an amusing picture here of like putting a loaded gun on a table and just like putting a, uh, uh, like post it note on it that says, do not shoot. Uh, like, like, okay, we, we fulfilled our, our safety requirements. Um, yeah, I don't know these licenses only go so far, but. Nice of them to do that. This is not, not a bad thing. Um, but, uh, yeah.

Andrey

Yeah. And to be fair, you know, it's, uh, in Silicon Valley, a lot of people are pretty liberal and I do think they would get some pushback for even, even this much for actually,

Jeremie

you know what? That's a good point. That's a good point that this is not a trivial, you're right. It's not a trivial call for them to make in that respect. Um, again, through recruitment lens. Uh, yeah.

Andrey

And we are almost hitting the two hour mark. So I think we'll finish up here. Save the synthetic media and art for next week. Thank you for listening to the episode. Uh, always fun to record these as always, you can check out the links to the stories in the episode notes, and you can go to lastweekin. ai to get. The podcast email with the links as well. As always, we appreciate your comments, your reviews, your Twitter mentions, whatever you want to do.

Uh, and do be sure to tune in and do enjoy this outro song.

AI Singer

Text on the rise, we've got the latest hoop, open your eyes, from chat GPT searches who's so wise, to Apple's AI touching the skies, it's the last week in AI's swim through time, and like a swing beat, my digital rhyme, you'll pop up eyes, zero to sight, join the dance in the AI spotlight. This is episode one eight.

Speaker

The tech is in motion, being erased. The track is easy, defines what you

AI Singer

see. With a

Speaker

pleasant

AI Singer

cold, well, take a deep breath. Apple's day, I see, I'm bright. No diploma, day or night. High school's spinning, circuit come. In the rhythm, here we come.

Speaker

Elemish Too happy with someone he hates Spend weeks in emotion Being graced Objectively seems Defines what you see Oh yes! Bless us B Academy Well,

AI Singer

Search us all with ease and grace For today I knew there's a noble place Chat GPT, our trusty friend Unfolding tales that never end Once again, our vision is so clear N. A. I. s embracing conquer fear Apple shots past their dreams of spying Enjoying sound of the music gun In the realm of lights and codes we thrive Episode 1 in 8, N. A. I. s alive Check GPT, search lies underway Activity's lost its way Last week in A. I., feel the groove I don't shoot, it's make you move.

Double drama, eyes, nose, light. Let your sweet vibes take in flight. Ba da da, sex, a delight.

Transcript source: Provided by creator in RSS feed: download file