#174 - Odyssey Text-to-Video, Groq LLM Engine, OpenAI Security Issues - podcast episode cover

#174 - Odyssey Text-to-Video, Groq LLM Engine, OpenAI Security Issues

Jul 17, 20242 hr 4 minEp. 213
--:--
--:--
Listen in podcast apps:

Episode description

Our 174rd episode with a summary and discussion of last week's big AI news!

With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)

In this episode of Last Week in AI, we delve into the latest advancements and challenges in the AI industry, highlighting new features from Figma and Quora, regulatory pressures on OpenAI, and significant investments in AI infrastructure. Key topics include AMD's acquisition of Silo AI, Elon Musk's GPU cluster plans for XAI, unique AI model training methods, and the nuances of AI copying and memory constraints. We discuss developments in AI's visual perception, real-time knowledge updates, and the need for transparency and regulation in AI content labeling and licensing.

See full episode notes here.

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Email us your questions and feedback at [email protected] and/or [email protected]

 

Timestamps + links:

Transcript

AI Singer

Tune in. Tune in. When the A. I. news begins. Begins. It's time to break. Down ai.

Andrey

Hello and welcome to the latest episode of Last Week in AI, where you can hear us chat about what's going on with AI. I hope you enjoyed that AI's generated intro song, which I now use as the intro for every episode, always a new one. And now you unfortunately have to listen to our voices instead of that, but hopefully you are up for that. As usual in this episode, we will summarize and discuss some of last week's most interesting AI news.

And if you want even more AI news in your email inbox, you can go to lastweekin. ai for articles we will not cover in this episode. I am one of your hosts, Andrey Kurenkov. For some background, I finished a PhD where I did study AI at Stanford last year, and I now work

Jeremie

at a regenerative AI startup. And I'm your other host, Jeremy Harris. I'm one of the co founders of Gladstone AI, this AI national security company. And, um, yeah, well, this is one of those weeks. I think, so we're recording this on a Sunday. Normally we do it on a Friday, hence the last week in AI. We've got a couple extra days under our belt and it's like, The worst of all possible weeks for that because there's so much stuff.

So we're gonna, we're gonna have to power through an awful lot of this, but um, yeah, what a, what a busy, insane, crazy weekend. I just got back from a trip too. So it's like, I did all my research on the flight on the way back. And uh, anyway, so I'm, I'm a bit groggy, so I'll be even less coherent than usual. So enjoy that everybody who's listening.

Andrey

Yeah. And, uh, it's one of those weeks where the news is not dominated by open AI or Google or Microsoft so much. So we're going to get a little bit more variety than we've seen. Uh, before we do dive into the news as usual, I want to acknowledge some. Comments and reviews. We got some really good comments coming in on YouTube now, which I always like

Jeremie

to take a look at. They do always say the YouTube comments section is the highest quality source of it's actually really good in our case. That's the funny thing.

Andrey

Yeah, I think it, you know, there's some really good stuff. There's some less good stuff in our case. Uh, we got some really good stuff. We got one comment that was, uh, very detailed and basically gave a feedback that, you know, love a podcast, but at the same time, it does feel like every single China related story we have kind of winds up. Uh, commenting on geopolitics a little bit, and, uh, kind of the nature of

Jeremie

Chinese AI these days, but, uh,

Andrey

it's true. We do want to cover that aspect of it, but maybe when we have research papers, uh, we will try not to like always mention it, uh, while we will cover it in many respects. I think this is a reflection

Jeremie

of, of just sort of like. you know, my, my work and, and where I tend to focus the most, you know, I think it's a, it's a good comment and, and something that, uh, I'll, I'll absolutely update on. But, uh, yeah, it, I think it's just that, you know, you get in that mindset and that's, that's how you see the world. So anyway, I appreciate that comment. I think that's in particular directed at me if I have that right, which, so I, I especially appreciate that.

Um, yeah, we'll, we'll take that to heart and, uh, give you less, less geopolitics with your side of. Uh, Chinese AI research, but, uh, again, when it's relevant, we will, of course, uh, be discussing it. So

Andrey

that's right. And we did get a bunch of new reviews on Apple podcasts, which is always fun to see some comments about maybe touching on AI approach. products and how models are playing into it. We try and cover that in the tools and apps section mostly. And there's so much going on that we usually don't get to a lot of these things, but I do keep an eye on it for things that people are interested in. Kind of hyped about it.

There was one review, uh, that was a three stars one and commented on the bias against mothers of four. Apparently, Jeremy, you mentioned that at some point, which I don't

Jeremie

remember. I don't, I don't recall this, but, but let the record show that I love mothers and I specifically love mothers of four. I have a mother. First of all, and so I'm, I'm actually very strongly biased in favor of them. Uh, mothers of four in particular. I'm a big fan. Mothers of three. I'm okay with mothers of three, but mothers four, man. Uh, yeah, no big fan. I'm not sure what this is getting at, but, uh,

Andrey

yeah, it's, uh, we, we try and riff a little bit, do some comedy. So if we do say anything mean, I think we are nice guys or try to,

Jeremie

and look, I mean, the thing is, if. you always have the option to have another child too, right? So if it is a mothers of four thing, I'm just, I'm not trying to be, it's just, if it's a mothers of four issue, you can, you can, you can get out of that category. That's an option, but, but you don't have to, I love mothers of four, but I'm just saying this is, this is all, this is not Andre's position. This is just my position. I'm just expressing the option space here.

Andrey

No bias against mothers of four on this podcast. Uh, and I'll just mention one more. We had a couple of, uh, not holding back on AI safety. So nice to see that I guess most people don't mind it. Uh, one comment asked for an audible version and, uh, I did go ahead and add that, I think, so hopefully it is on there now for the listener who wanted it.

And finally, uh, Jeremy, I don't know if you remember last week you had a real struggle remembering a company that started with the letter V. Oh, yes! And someone on YouTube commented that it might have been VZOnyx. Okay, so,

Jeremie

uh, I, I, I've got to call this out. I don't normally check. YouTube videos all the time. But obviously I was traveling, I was like, huh, wonder what the, you know, what the YouTube space is like these days. And I saw alien forensics. I saw you out there. I see you, I see you and I appreciate you. It's not physiotics. It's not Vizionics. But the fact that you brought that up again made it so I couldn't get it out of my head. And it's Vicarious. I was thinking of Vicarious.

Do you remember Vicarious? It used to be a thing, right? Yeah. So anyway, uh, I guess if you go back to last week's episode, you look at the part where I'm just going nuts trying to think of this company's name, just sub in Vicarious. Hopefully it'll make more sense. But thank you, Alien Forensics, for your wonderful YouTube comment and to all the other YouTube commenters and other commenters out there. Super appreciate it.

Andrey

Definitely. Well, enough of that. Let's get into the news, which is of course what people presumably come here for. So starting with tools and apps, we got first story is Odyssey building Hollywood grade AI text video models. So this is a new AI startup that was just announced last week. They have raised apparently 9 million in seed around funding led by Google Ventures and they.

pitch themselves as this Hollywood grade video with general models that allow full control of the core layers of visual storytelling. So they say that these four models that they're creating will generate geometry, materials, lighting and motion. And that's AI generated video. If you have just one model that generates the entire video, then you can't control some of these elements that you really want, like lighting and, uh, I suppose how things move, et cetera.

So yeah, interesting to see this whole trend we've seen this entire year, text to video continuing to go to rapid pace, more, uh, startups entering the fray and I guess trying to find that product fit to make this actually useful as opposed to a very impressive demonstration of what AI

Jeremie

can do. Yeah, this is also one of those cases where it makes me think to a lot of the techniques from AI interpretability, um, and AI alignment that are often discussed in the context of safety, but here you can really see why they're important from a commercial standpoint too, right? Exerting Fine grained control over AI systems so that you don't have to have separate purpose built models because you need that level of fine grained control.

If we could build a monolithic model, like one model that handles all this stuff together. First of all, that's the ideal, right? Because you really want all this done, if you will, under one roof so that it can be done coherently. So all the physics, the lighting, all that stuff gets done by one model in a way that, you know, kind of all the. Cross dependencies between those factors can be accounted for.

Um, this is going to be handled in a more ad hoc way here, but if you could have finer grain control over those monolithic models, you know, then maybe you could actually, you know, productize them in, in these sorts of applications. Um, that's anyway, sort of one, one stray thought there, but there's a video that they actually share a gift on the, um, in the article showing this, like this rhino. And it's not a, not a video.

They just sort of start with this still image, but they show you sort of what the rhino looks like under different lighting conditions. And, you know, I mean, the, the physics seems captured here in some really interesting ways. It's high resolution. It's, it's very compelling. So, you know, it looks, it looks interesting. We'll have to see what it looks like, obviously, once it's, once it's out and, uh, and actually in user's hands.

Andrey

Yeah, I think, uh, it's, The case, like you might forget if you only care about AI, but computer graphics, special effects is a core element of movie making these days.

And I think this is potentially taking some inspiration or some knowledge out of that, how people already integrate that into their workflows, presumably In the long run, these companies and professionals who take care of visual effects in movie making will integrate these tools into their workflows, as opposed to just like AI takes care of it. Next up, Onfroplex Claude adds a product. Prompt Playground to quickly improve your AI apps. So yet another new feature coming out of Entropic and Clods.

This time, it is a feature that allows developers to generate, test, and evaluate prompts using the prompt engineering techniques we often cover to improve Clods responses for specialized tasks. So, There's an entropic console for developers and there's now an evaluate tab, which is going to be this testing ground that can be used to evaluate your prompts. And then there's also this prompt generator.

That will have the ability to enter a short task description and then we'll construct these longer, more detailed prompts. So there you go. I think, uh, making the use of Cloud a little more, I guess, uh, formal and, uh, making it easier for people who leverage it to get

Jeremie

the most out of it. Yeah, it's also just a really interesting data set for Anthropic to be collecting, right? Because essentially what they're getting is, you know, your say, plain language description of what you want done. And then, you know, they'll generate a bunch of prompt options. You'll use them. And presumably if you're satisfied with a particular prompt, then they can kind of correlate it to, okay, well, then We know what you were trying to achieve.

We know the prompt that actually ended up working that might, that might go a long way towards helping with fine tuning, like construction, fine tuning, and making sure these models don't even need that extra step. Ultimately, you know, that that's a big part of this, this kind of next leap where you really do want to make the job of prompt engineer, It may be not disappear, but you want to reduce the burden on people to kind of come up with good prompts.

This is a, it's a systemic problem in the field, right? Because it's not just that it's hard to get good at writing good prompts. It's that you constantly have new models with new fine tuning techniques. You know, like one model might be trained with like constitutional AI and behave a certain way. Another model might be trained with PPO or LHF. be trained a certain different way. And they all have different quirks. And so you've got to kind of spend some time ramping up with every model.

And that's just a fixed cost every time you switch platforms. And so, you know, what you want ideally, in order to fully democratize this stuff, you want to kind of reduce that burden as much as you can. And I think this is a really interesting dataset that Anthropic's going to have on its hands to help bridge that gap.

I wouldn't be surprised if we see some work come out of Anthropic that's focused more and more on leveraging this kind of data just to make it sort of the zero shot version of this go really smoothly. And, uh, anyway, I think it's, uh, yeah, interesting, an interesting movement and not the first time we've seen, even in the last month at two or three occasions, anthropic coming out with these.

I don't know what's called like user experience modifications, like user interface, user experience design choices that are just different on their platform, um, that where they're creating value, not just through the quality of cloud three or cloud 3. 5, but through the, the, the quality of the user interface, user experience. And so I think this is another great, a bit of low hanging fruit, a great bit of innovation for Manthropic.

Andrey

That's right. Yeah, I think it's interesting to see the differentiation where OpenAI seems to be doing more to sort of partner with enterprise customers and pushing more on the model front with things like GPT 4. 0. Anthropic is much more aggressive on these sorts of features and user experience things. This one is an example, right?

This is already something that as an AI engineer, you are probably doing in some sort of, uh, potentially ad hoc fashion, or you might be using something like weights and biases, which has something similar built in. So it makes a lot of sense for them to build it into the tooling of

Jeremie

their platform. Yeah. And, and, you know, this in a context too, where if you don't play a lot with prompt engineering, you might not realize how far it can go. Um, you know, good prompt engineering really can elevate, it can make things possible that were impossible before. And, you know, so, so this is very much, it's not just about making better models, but like, if you're, if your model on its own with no support.

Is a, you know, 1 billion, uh, market value model, it could create, let's say, a billion dollars in value, roughly speaking, the economy, um, but, but it's actually not being tapped for nearly that much value because people don't know how to prompt it right. Like you can really, you know, you could really reduce the effective value of your model just by making it harder to interact with.

So I think this is anyway, a great, uh, a great way to mine value in language models in a more scalable and automated way.

Andrey

Onto the lightning round, first up we have Figma pauses its new AI feature after Apple controversy. So this feature is the prompt to design AI generator that was called make design and the controversy was that it seemed like it generated designs that were very similar to existing apps, in particular, Apple's iOS. Weather app. So, uh, there was this user, Andy Allen, who tried the prompt weather app and the tool created free results that looked identical to the weather app from Apple.

And, uh, yeah, there you go. So it seems like this might've been tool, uh, this might've been trained on a lot of data from existing. Apps and that might lead to legal trouble. And now it is

Jeremie

paused. Yeah. Which also kind of makes sense, right? Like how, how would you design this tool, right? Yeah. You would, you'd probably need to train it on actual apps. And so this is what we have. Um, yeah, this all came out on Twitter, um, or X I should say, uh, where there's a user who came out and basically, yeah, I made this accusation. There was a response from Figma that said, um, look, we, we use, uh, so this is their, their defense here is like, look, Uh, this isn't what happened.

We didn't just kind of train it on these apps. They say instead that MakeDesign, this tool, uses off the shelf language models combined with quotes, systems we commissioned to be used by these models. And that's it. I don't know what that means. I have no idea what that means. It, it kind of sounds like, like none of this denials, he seems to have much meat on the bone. I may be missing parts of it, but I did look and wasn't able to find much.

Um, he added that the problem with this approach is that the variability. is too low. So, so the variability, presumably he means, this means the variability in the outputs. And if that's the case, this makes me think, okay, so are you kind of admitting that it was overfitting? And if it's overfitting, then doesn't that make it more likely and not less that the designs we're seeing were in fact Apple copies as they appeared to be. So, yeah. I'm not too sure that this helps the case.

I'm not sure exactly what these, uh, arguments really mean, but it seems a little confusing. Um, apparently, you know, they, they added within hours of seeing the, the tweet that called them out on this, they identified the issue, which was related to the underlying design systems that were created. Ultimately, it is my fault, says the CEO, for not insisting on a better QA process for this work and pushing our team hard to hit a deadline for config.

I don't know what any of this means, um, but, uh, hopefully they, you know, find a process that's less at risk of getting called out for, I don't know if it's called out for I don't know what the status is of training on apps. I think this is a really interesting legal question, kind of in its own right, but uh, anyway, there you have it.

Andrey

Yeah, and this is a pattern we've seen over and over where, uh, you know, a company releases a feature, and then often within hours or days, we see, uh, users on X or Twitter, uh, point out, different issues. And in some cases, you know, there's even been the case of Google having to revert a feature. So I guess in some sense, good that you could crowdsource your QA. Next, Quora's PO. Now let's use this create and share web apps.

So Quora has this AI powered chatbot aggregator, which is kind of an interface to use various chatbots. And they now have this feature called Previews that allows users to create interactive apps directly within the chatbot conversation. So these can include data visualizations, games, and other applications. And they have support for HTML output with CSS and JavaScript.

So basically you can launch like a mini and share it with other people via link, very much similar to artifacts that you saw from Anthropic.

Jeremie

Yeah. I think this is actually a really interesting play, right? It's, it's like the Anthropic, um, uh, artifacts launch, but it obviously allows you to interact with a whole bunch of different chatbots. And this is really cool. This aggregation function is really important in the space, especially because we're increasingly moving to a world where an awful lot of these models are commoditized, right?

Like, in other words, you know, you can kind of use one or the other, you know, there's GPT four and, uh, you know, called 3. 5 sonnet. These are both really good at. at writing, you know, code and building apps and all that. Um, it's not to say that there aren't clear leaders at any given time in any specific issue, but often we don't have the time to figure out which is which.

And having an integrated platform where you can just quickly test them out in this way, in a very informal manner, uh, I think that's a lot of value right there. So yeah, interesting, uh, interesting play by Quora, which is not, you know, you don't normally think of Quora as a, An AI company. And in fact, Quora's base business model, you know, a lot like Reddit, a lot like Stack Overflow, for example, like just a forum.

Um, you might think of as being at risk of, uh, of being disrupted by a lot of these, these AI developments, chatbots in particular for their, the question answering function. So now, you know, we have this interesting play with Poe. As a, as a way of sort of, um, selling the shovels instead in this industry and, and having people access these chatbots, these language models through the platform. So kind of interesting.

And yeah, we'll, we'll see if it, uh, if it takes off, but, uh, 20 bucks a month is the price for the premium plan on PO. Uh, we've heard that price point a lot before, right? So that's the, uh, your standard issue AI chatbot price point, uh, rough order magnitude. So, so there it is.

Andrey

And the last story for this section, Suno launches iPhone app. Suno is, of course, one of the leading techs to music startups that has had a web version for a while and you can pay for Pro and Premier plans. And now they're launching this iOS version of a product in the US, basically equivalent. To the web version, that's pretty much it. I think it makes a lot of sense.

Uh, one of the use cases of these kinds of things is you have like a funny idea for a song and you go ahead and just enter a prompt and see what happens. So, uh, it makes sense to you for you to be able to do that on the go, whenever you are struck with an idea. Onto applications and business. First, we have GrokU, GrokU, not Grok from XAI, unveils a lighting fast LLM engine. So we've seen Grok before, uh, publicize their ability to do really, really, really, really fast LLMs.

Inference, and now they have announced this LLM engine that allows users to type or speak queries and grok responds at a speed of around 1, 256 tokens per second. Per second, which is up from 800 tokens per second. They demonstrated in April.

This is for example, on the Lama, uh, freebie, uh, some of the open source, uh, parameters and for a reference, usually on things like GP4, you get around maybe 200 tokens per second in my experience, 150, 200, so a lot faster than what you get off the shelf.

Jeremie

Yeah, Grok is such an interesting company. You know, we, we have talked about them before, as you mentioned, um, the couple of the, the kind of big developments here is, yes, this is, you know, we're hitting 1200 tokens per second. That's from up from about 800 tokens per second before that was back in April. Um, now, so a couple of the big things.

So first of all, Grok is offering a console for developers that allow them to Port over their apps from basically open AI to grok in seconds using very simple steps. Um, this is all part of grok's increasing focus on enterprise. So they're trying to get people to switch over from, uh, from open AI. And to the extent that they're now at over 280, 000 developers in their community, you know, maybe that's, that's part of what's been working. So very impressive, um, lightning fast inference.

This is not a training chip. This is just inference. We've seen a couple, right? So, uh, this, uh, idea of making. Um, in this case, this is an LPU, right? A language processing unit. It is custom specifically built for, um, for, uh, language models. And we've seen that before we saw the Sohu chip. I think that was if not last episode, the episode before.

So a couple of weeks ago, um, a lot of these, these, uh, chips focused specifically on inference and not training and specifically on transformers and language models more broadly. So I think that's a really interesting trend. You know, this reflects, we've been saying this on the podcast for feels like almost two years now, but this idea that. Uh, there's a trade off. You can choose as a developer.

Am I going to invest my compute during training time to do the, you know, train the base model or an inference time in the form of, you know, kind of agentized designs, getting the model to kind of we're querying the model multiple times to get a single answer to invest more computing power at inference time rather than training. The training time. So this reflects a belief, a structural belief by these companies that that is where things are going.

This is why you would make an inference only chip that works specifically for language models for transformers and so on. And uh, yeah, this is, anyway, this is just a really interesting company. The key to this, um, this particular chip, the LPU seems to be. They're essentially building in so there's this thing called SRAM and it's it's a memory that sits next right next to the logic on the chip and SRAM is really great.

It's super fast because well, because it's right next to the logic on the chip. So rather than having to send your data. To some memory bank, you know, through high bandwidth memory or something out, you know, somewhere else. Um, you're able to keep everything really local, really tight, so you can do things very quickly. But one of the challenges is SRAM can be very limited, which means these grok chips, they're super fast, but they have really, really small memory.

So you need a whole ton of them to run even one like Lama 7B or Mistral 7B type model. Um, in this case, if you're, if you're curious, if you wanted to serve up. The mixed trial model, the, um, uh, I guess I think it was the, the one, the, the 7 billion parameter kind of mixture of experts model. Uh, you need 576 chips to do that, right? 576 chips versus a single NVIDIA H 100 chip. That's kind of this trade off. And there are all kinds of economic questions that come from that.

We talked about those in the previous episode. So go check that out if you're interested, but yeah, Grok's a really interesting company and they're continuing to kind of push their advantage on blazingly fast inference for them. The question is going to be. Can we get this at scale? Can we get enough of a user base that we can amortize the costs across a massive number of users and then make it essentially cashflow positive?

This is a big indication if they're hitting 280, 000 developers that they are on their way to do that. And that's a very exciting prospect if you're a grok investor. Uh, first of all, if you're a grok investor, good for you. Uh, so anyway. That's, uh, that's, I guess, the story and, uh, congrats to, to Grok on a very impressive inference play here.

Andrey

For sure. And yeah, some very interesting questions that arise from this and some things to note, as you said, one of the challenges with this kind of technology is memory usage. And it's, uh, to me, it seems like, uh, you probably cannot extend it to training just by virtue of training, taking up a lot more memory inherently. And, uh, We've also seen the new model types of all this, uh, exploration of recurrent models that use a much smaller memory footprint.

Interesting to see if that comes into play also at some point. And next story, Microsoft and Apple. Apple bitch, opening my board seats amid regulatory scrutiny. So we just covered, I think last week that Apple was joining the board with, uh, observant role. So just observing, which is also what Microsoft has. Well, now both of those are leaving the board. And as per the title of the story, that's pretty much it.

pretty much because of antitrust concerns with particularly UK and EU being concerned about it, but also the FTC investigating the investments by Microsoft and others into Entropic. So makes a lot of sense that, uh, given OpenAI now being closely partnered to Microsoft and Apple, they want to But a little bit of distance between themselves and open

Jeremie

AI. Yeah. Yeah. And amazing that this is just eight months after the non voting seat, this, the seat that Microsoft had on the board was secured in the first place. So this is very rapid turnaround. Um, and it's also obviously was partly triggered. Actually, these investigations were largely triggered by the, um, opening eye. Uh, board of Directors drama, right where Sam Altman got ousted.

And one really important ingredient, especially for this aspect of the story, is what happened after Sam was fired, which was that, you know, Microsoft's, uh, CEO Satya Nadella offered Sam Altman a, uh, position at Microsoft to head up a big AI research group there. Um, you had all the, uh, employees at OpenAI signing this open letter saying, Hey, like, Bri, you know, bring Sam Altman back. And basically Microsoft was able to completely.

undercut the credibility of the, of the board in that process by saying, look, um, you can, you can keep Sam Altman fired. We're just going to bring them over to work for us and all the employees will follow. And so that effectively made it so that you already, you might say that that made it to the Microsoft exerted a de facto level of. of extremely high control over open AI, despite the fact that they theoretically only own 49 percent of the for profit, um, uh, part of the company.

And despite the fact that they didn't even have a voting seat on the board, uh, this is still a de facto, uh, position of, of tremendous leverage. So, so, you know, in that context, you can see regulators kind of like going, Hmm, is there an antitrust thing going on? Are they really for all intents and purposes, separate and competing companies? And, uh, yeah, I guess that that story is yet to unfold, but, uh, the investigations are proceeding at pace.

Andrey

After the lightning round, one more story about OpenAI, and that is that OpenAI and Ariana Huffington are working together on an AI health coach. And this is through a newly funded venture called Thrive AI Health that is aiming to do this like ultra personalized, uh, AI chat bot that will guide you through making healthier decisions and, uh, provide insights through your health metrics.

Uh, this is led by, uh, former Google executives and has already established research partnerships with various organizations like Stanford Medicine. So, yeah. It, uh, seems like maybe we'll get our own personalized little AI doctors in the future. I guess we'll just have to see.

Jeremie

Yeah, there's, there's obviously a huge amount of value to be created here, right? Like the idea of customized tailored medicine, it's been around for a long time. I guess this is the big play. And it's, it's interesting that Sam Altman has associated himself so closely to this particular launch. Like this is obviously a high priority for open AI.

You can kind of see why among other reasons, you know, you've got the obvious product value here, but, but also, Hey, it's a hell of a marketing announcement after. Um, months of open AI being dragged through the streets with drama after drama. So kind of a nice palate cleanser, hopefully for them. Um, you know, there's this, uh, kind of brief quote that so, so, um, Arianna Huffington and Sam Altman co wrote this op ed in Time Magazine and, and, you know, we're flagging that context.

Time has signed a, an agreement with open AI. to, um, to, to do this sort of licensing of their content and, and kind of back and forth co op on that. So, you know, that's, that's, I thought it was interesting to note the sort of first time op ed, um, she came out and said, you know, consider what it's like to be a busy professional with diabetes. You might be struggling to manage your blood sugar levels, often missing meals and exercise due to a hectic schedule.

A personalized AI health coach trained on your medical data and daily routines could provide timely reminders to take your medicine, medication, suggest quick and healthy meal options, blah, blah, blah, blah. I think that's actually really cool. One of the big challenges in medicine, if you know anybody in like health tech or doing like medicine startups is data collection. Right. And to the extent that they're able, and this is going to be an open question for Thrive AI Health, right?

Like, how are you going to set up the privacy and security around this? They say they're going to have robust privacy and security guardrails. We'll get into it. But after we've heard about OpenAI and the calls they've been making on the security side over the last year, Uh, you know, I guess up to us how far we, we take that, uh, that comment, but hopefully they do set it up.

Cause I think this would be an incredibly useful application, hugely, hugely positive and would love to see this work out. The data collection piece. So, so important, right? Our bodies are so high dimensional or interactions with the world are so high dimensional that, you know, sometimes you get sick, you don't know what it is. You get allergies, you can't triangulate the source, you have a reaction to a meta medication or whatever. And so many things are going on.

And this sort of thing where you have. a more comprehensive picture of what the health life of an individual looks like. That's kind of interesting. You know, I, I wouldn't be surprised if we gain some pretty important new insights, like medical insights, uh, from this sort of activity. So whether it's Thrive. ai Health or something else, we'd love to see, you know, something like this succeed.

Andrey

Right. Yeah. And it'll be interesting to see if it can succeed as a health coach. Not only will it have to provide insights, it will have to be able to guilt trip you or, you know, maybe not guilt trip, but Duolingo, Andre? Exactly. It will need to, uh, make sure you want, or at least, uh, are incentivized to make those healthy decisions. And Yeah. That's still an open question, however, AI can do that. But, uh, certainly open AI. is in a place to try. And next, we have exclusive AI coding.

Startup Magic seeks 1. 5 billion valuation in a new funding round. The Startup Magic is developing AI models to write software, and they are in talks to raise over 200 million in funding that would value the company at 1. 5 billion, um, And this is coming from a company that has no revenue or product for sale. The startup has about 20 employees and was last valued at 500 million after a fundraising round in February. And have they have already raised 140 million since being founded in 2022.

So certainly you're seeing some. Still some massive rounds of the sort we've seen with Stability and many other companies in the past couple of years, although I do think maybe there's not quite as many of those and coding seems to be one of the areas where there's still a lot of money, money flying

Jeremie

around. Yeah. Part of the claim of the, as I say, of the magic behind magic is that they're going beyond, they say, just beyond the traditional transformer model, whatever that means, right? So I don't know if that means that they are partly based on the transformer and maybe there's like a Mamba tie in or maybe, you know, it's something completely different in part or in whole, but it's kind of an interesting little note. They don't give much information about that.

Um, this is a big fundraising round, you know, 200 million on 1. 5 B like that's a lot of money. Uh, the investors are no joke. We've got Jane straight on the cap table. We've got Daniel gross on the cap table. You might remember him, um, as, uh, as having participated in the, um, uh, the, uh, Ilya Setskever is kind of late as coming. He's, he's a co founder with Ilya Setskever in, uh, Um, what is it? Safe, safe, superintelligence SSI.

Uh, so yeah, so there you go, Daniel Gross, DG, a really impressive investor, Nat Friedman, of course, as well. So yeah, it's a, it's a hell of a round, um, and it's not hard to defend these kinds of absurd, uh, valuations. I shouldn't call it absurd, but absurd sounding valuations. When you look at just like the, the, the value of software, right? The value of software development.

If you're making a half credible case to investors that you're going to be able to automate or own some chunk of software, like. That's a lot, right? That's a lot. We've seen proof of existence for the fact that these kinds of models can generate value through, you know, GitHub copilot, uh, and, and other similar tools. And so now, yeah, you know, it, it kind of makes sense. The other thing I'll note is this is a startup with 20 employees and its last valuation was 500 million.

That was back in February. So you have 20 employees, potentially, um, you're looking at, you know, one point was at 1. 5, I said, yeah, 1. 5 billion valuation. Um, if I told you five years ago that we were going to have companies with 20 employees at 1. 5 billion valuations, like you would have been liable to laugh me out of the room, right? This is, this is just nuts.

Um, but it is consistent with, you know, when I talk to, to friends at the frontier labs and stuff, they're like, yeah, we're, you know, we're talking about, we're talking about When are we going to have the first billion dollar company that's got one employee? Like that is a, an actual, in some cases, it's an actual sort of like milestone on their, on their path to AGI. Um, because of what it means about how much you can, you can automate and scale.

So I think this is just a really interesting data point. They're really interesting investors in a bloody big round.

Andrey

For sure. Speaking of VCs, the next story is Sequoia and Andresin Horowitz clash over AI chip supplies amid the Gen AI boom. So recently, Andresin Horowitz, one of the very big VCs that invests in technology in Silicon Valley, announced that they are stockpiling over 20, 000 GPUs to support their portfolio companies. Uh, they have invested in at least 19 generative AI deals with a value of, uh, about 1. 3 billion over the last two years.

And they have raised two dedicated AI funds on this sort of stuff. So part of a need for. The stockpiling of GPUs is basically to get a competitive advantage in getting sort of companies in their portfolio. And once again, it's one of these kind of weird things that has come about with AI and one thing you wouldn't expect VCs to go

Jeremie

for. Uh, yeah, though, I will say, you know, in the world of startups, you've got their VCs and their VCs, right? And the difference between your absolute top tier, your Andreessen Horowitz is your Sequoia capitals. Um, and everybody else is often, uh, well, really good capital allocation. Obviously they're just really good at betting on good companies, but they also do tend to offer value added services.

And there's this old joke in Silicon Valley that a VC will, will, you know, tell, tell you that they're a value added investor. And it's kind of like completely meaningless. Um, in this case though, for the top tier ones, they really are value added, not just for the networking value, but for things like this, they'll look at where the market is headed and they will anticipate that need. And on behalf of their companies, they will make acquisitions like these 20, 000 GPUs.

Um, this is also, you know, it's not the first time that we've seen GPUs used in this kind of a weird semi financialized way. We talked about this last August, but CoreWeave, which is this cloud service company, uh, ended up raising over 2 billion in debt using H100 GPUs as collateral. So this is like the financialization, you know, in a way you can think of this as compute becoming actually kind of like the currency of some of these AI investments.

Like your, your investments are, you know, denominated in a weird way, partially in flops. Like you can start to think of it that way, uh, which is interesting from a macro standpoint. It sounds awful, like awfully science fiction y. Um, but yeah, Andreessen basically doubling down on this thesis that generative AI is going to keep getting better, keep becoming a bigger and bigger story.

Um, they famously were kind of late to the AI game, uh, cause they were spending all their investing in crypto. And in fact, when I was at Y Combinator, that was the The big crypto batch, and a lot of my, uh, batch mates, you know, ended up raising from a 16 z on these crypto plays. Now they're focusing on ai. I think this is a personally just a, a much better play, but that's my bias. Uh, on the flip side, Sequoia, Sequoia came out with a piece basically saying.

Kind of not quite the opposite, but a lot of skepticism. And we think about the two kind of two of the best firms, VC firms in Silicon Valley, it's absolutely Sequoia Capital and recent Horowitz. You know, when you look at Sandhill road, the kind of the V the stretch of road that the VCs are all on in the Valley, those are the two big names. Sequoia making this argument that look, you know, Outside of chat GPT, how many AI products are consumers really using today?

And so they write this post where they make a whole bunch of arguments saying like, look, look at how much value you get from Netflix and you're paying them 15, 16 bucks a month. Spotify, 12 bucks a month. We often use these services all the time. Um, Long term, they're saying AI companies will need to deliver a significant value to justify this continued spend.

And they're calling out specifically, and I think this is really important, this problem that they flag of the like missing 600 billion in this ecosystem. If you look at the amount that's been spent functionally on, uh, on essentially data center spend in the industry, it's about 600 billion that we're in the hole for. And we need to show enough value to, to justify it. There's gotta be enough consumer revenues, but where, like, where are the revenues at?

Well, Open AI through chat GPT has the lion's share 3. 4 billion annualized. That is a tiny, tiny figure compared to the 600 billion that's being spent on a data center. Uh, spend here. There's, there's a long tail obviously, but it's an indication and, and Sequoia would know this. There's an indication that, you know, the, the repayment is just not there right now.

And this is troublesome because a lot of people make the argument like, yeah, well people are, it's okay if, if the repay, repayment hasn't come yet, uh, people are stockpiling GPUs because. More GPUs is going to pay off later. And the challenge here is, among other things, you've just got depreciation to the values of these GPUs, right? The H 100 ages out, it becomes the B 100.

So it's not the case that by spending a hundred million dollars on acquiring H 100s today, you're actually like you're, you're creating a hundred, you're holding onto a hundred million dollars in value. That value depreciates over time and quite quickly, uh, with these GPUs. So a whole bunch of really interesting arguments. Uh, about why we might expect, you know, the space to get increasingly commoditized.

We've talked about this actually on the podcast before, uh, in the context of open source companies that just pump out these, these models. And so much of this is being commoditized. You have to ask the question at a certain point, where is your moat? Where's the actual value generation and value capture? And both of these things are, are in play to a certain degree. Um, like don't get me wrong. I still think scaling is going to work.

Um, I think that, uh, perhaps the core argument here is in part, um, that, that there, there, there are going to be a small number of winners and then an awful lot of the money here that's being spent. is actually being wasted. It's mostly VC dollars as they're lighting their, you know, giant checks on fire. That's how Silicon Valley has worked historically, but it's still kind of useful to, to track that.

And it may or may not be true, but this is Sequoia, uh, coming out and making the kind of opposite case from what we've heard from Andreessen.

Andrey

That's right. I may have misspoken a bit and implying Sequoia is doing the same thing. thing. In fact, the clash there in the title is referring to this blog post that challenged the idea that we should invest so much money in GPUs. And it was titled AI's 600 billion question that Yeah, as you said, was sort of saying, we've invested so much money in GPUs, are we expecting to actually get all this money back in returns? And the answer is, maybe not.

And, uh, interesting analysis for sure from Sequoia. And speaking of GPU investments, we got a story from Elon Musk, who has revealed plans to make the most out of GPUs. Powerful 100, 000 NVIDIA GPU AI cluster. So another announcement of what XAI is planning to do, apparently to keep training their Grok AI model. There was talks of XAI partnering with Oracle. for their cluster. It seems that's probably not gonna happen.

And, uh, the system, which will be using NVIDIA chips, could cost as much as 9 billion. So it might cost less, we'll see. So, uh, certainly ambitious, and it'll be interesting to see if they can pull off getting this many NVIDIA GPUs.

Jeremie

Yeah, I was kind of trying to square away in my own mind what they're referring to, like which clusters are referring to what. So here Elon's talking about a hundred thousand, apparently it's H100 GPUs. That's what the article says anyway. Um, and to make us, yeah, as he put it, the most powerful training cluster in the world. So there's this 100, 000 H100 GPU training cluster.

Perhaps separately, they say, um, you know, there's an announcement that Elon made last month that they had plans to build this, uh, multi billion dollar, this 9 billion system you just talked about. That would be apparently 300, 000 B200 GPUs. This is kind of the next generation, um, on the, uh, on the GB200 platform. So I, I think that might be the idea. And if so, um, I think the breakdown, I could be wrong, I think the breakdown might be, you know, these 100, 000 H100 GPUs.

First of all, they're going to be able to get them much faster because the H100 is already at scale production, whereas the B200 is not. So, you know, they're going to get those faster, they'll be able to train them quicker. And maybe that's just a training cluster, and maybe the 300, are for both training and inference. Uh, I, like, I'm, I'm not exactly sure and it's not clear from Elon's statements what, you know, the distinction is between those two in terms of use case.

But yeah, certainly interesting as well that they are parting ways with Oracle. So this is, uh, Elon basically saying, Hey, we want to bring this in house. Apparently they're like XAI is going to design, uh, these, um, uh, you know, these data centers themselves. Uh, which, uh, you know, brings them in line with some, what's some pretty well capitalized competitors are doing, right?

You think about data center, like in house data center design and AI training, you're looking at, you know, your, your Microsoft's and your Google's that sort of thing. So, uh, kind of interesting that this in relative terms, it's very small company XAI, um, in market cap terms, at least is, uh, is going to be competing on that basis. They're going to have to hire a whole bunch of, uh, hardware experts. I'm sure, I'm sure they already have, but that'll be part of the play here for sure.

Andrey

Yeah, good, uh, catching me on there with apparently multiple planned clusters and not just one. And, uh, yeah, it seems, uh, friendship ended with Larry Ellison and now Jensen Huang is my new best friend here from Elon Musk. Yet another story on the internet. Uh, acquisitions. This one is AMD planning to acquire Silo, Silo AI in 665 million deal. So this is a finished AI company and AMD is going. To buy this company, apparently to enhance their software expertise.

Uh, there's apparently a hundred PhDs and 300 employees coming from silo AI to AMD. And AMD, of course, as we've covered is trying to catch up with NVIDIA is trying to, uh, get their chips competitive and use more. So certainly a big investment on that front.

Jeremie

Um, Phillips, Rolls Royce, Unilever, a bunch of, uh, important investors joining the cap table there. Maybe, you know, less, uh, yeah. Uh, less, or sorry, I should say who are already on the cap table. So not necessarily you're like traditional AI investors, but certainly people who know their way around. Things like hardware and so that's kind of cool

Andrey

and next AI robotic startup raises 300 million this is the Pittsburgh based AI robotic startup skilled AI This is their series a and they are saying that they are developing general purpose robotics Models that will be low cost robots used in various industries and applications. And evaluation is now 1. 5 billion. Relatively new company. So they are not quite at the stage of figure and one X, uh, these other companies.

that are developing humanoid robots, but certainly this deal showing that the enthusiasm for general purpose robotics companies seems to be still there as we've seen, uh, kind of this year

Jeremie

really being the case. Yeah, this is another 1. 5 billion valuation fundraise, and it's a Series A. So, Jesus, I mean, Series A, 300 million Series A, I'm old enough to remember when that was a, like, would have been absurd. Uh, you know, five, again, five years ago, this would have been nuts. Uh, no indication of how many employees they have right now, but, uh, this is, this is A very interesting play. Like they're coming out of stealth with this, so we don't know much about them.

What we do know is they've got a foundation model they've built. They claim it's unusually robust and meant to scale across a whole bunch of different robotics platforms. So the idea there, I'm just quoting here from their, their launch release, as opposed to vertically designed robots that are built for specific applications were only deployed in isolated or constrained environments. It's our model serves.

As a general purpose brain, it's demonstrated on parallel generalization, emergent capabilities across a diverse embodiment of robotics scenarios and tasks, including manipulation, locomotion and navigation. That of course, the big problem in robotics where you need to gather task specific data. That's usually the big challenge.

We've seen attempts to kind of break out of that mold with, you know, Google, Google DeepMind comes to mind where, you know, their, their Gato system, they trained it to do a whole bunch of different tasks. Um, a subset of which were robotics, certainly dozens of robotics tasks. Um, but, but this is the claim that, hey, we're, we've got an unusually robust system. We don't know how it's being done. They're calling it a foundation model. Is it transformers? Is it something else? We don't know.

Um, but one thing to flag is that the caliber of the investors in this round is, is, is Um, stellar, like this, if you were looking for an indication based on investors that there was a company that was worth watching, this really is it. We've got Jeff Bezos, obviously through Bezos has expeditions, which is this fund, uh, other participants, Sequoia capital. This is like Silicon Valley's basically number one VC, uh, Menlo ventures, general catalyst, um, SV angel. Um, there's Carnegie Mellon.

There's the CMU on the, on the cap table for some reason, along with Lightspeed SoftBank. It's like, it's the who's who. I don't see Andreessen on there. That's one big name. Um, you know, there aren't that many that are missing, let's say. This is really impressive. Uh, so I would just say, you know, if these investors are seeing something here, you Yes, there's a hype cycle for sure. Uh, but 300 million is, uh, it's a decent amount of money.

It's not a huge amount for these guys too, but, uh, but still it's a big round, big valuation. And, and, uh, this is a team by the way, that does explicitly have that AGI goal in mind. They're making the argument as many have that if you want to achieve AGI, Language models are not enough. Um, just being purely into the world of bits is not enough. You need to move into the world of atoms. You need to have systems that interact physically with the world. And so that's the claim.

It's a, it's a team they say with former folks from meta, uh, Tesla, NVIDIA, Amazon, Google, and blah, blah, blah, top schools, you know, all that jazz. So basically if you're looking for an all you can eat buffet of impressive credentials, Uh, this is your startup. So let's, uh, let's see how they, how they roll.

Andrey

Yep. Looks like the founder and CEO is Deepak Pafak, who was an assistant professor at CMU doing research on computer vision and robotics. So maybe that explains the location of Pittsburgh for the founding and some of those details. It's a

Jeremie

very glamorous city of Pittsburgh. Yes.

Andrey

And last story for section we have Intel begins Groundwork on Berg Chip Fab. This is, uh, fab 29 in Germany and is meant to be one of the largest and most advanced chip fabs and Europe. And the construction of it is beginning despite some of these, uh, environmental objections and some subsidy approvals by VEU.

So, uh, you know, as we know, uh, their ability to fabricate chips is a very important component of, uh, AI progress and Intel is one of the relatively few companies in the world who can do advanced fabrication of the sort. Uh, so presumably this tab would be important to that.

Jeremie

Yeah. This is part of Intel's, um, hope that they're going to be able to leap ahead of, uh, companies like TSMC. That right now has complete market dominance at the, the leading edge node, right? So the smallest, highest resolution, um, fabrication process. So right now, you know, Taiwan Semiconductor Manufacturing Company just absolutely dominates that Intel trying to break into it. They have their, um, so their 14A, 14, that stands for 14 angstrom. That's equivalent to 1. 4 nanometers.

process that they want to start running apparently in late 2027. Um, and, uh, and they're 10 angstrom, they're one nanometer production node. Uh, this is by the way, uh, these, this is in fairness marketing terminology, but if it was really 10 angstroms, that would be like 10 hydrogen atoms in size. That's kind of the feature size that is being teased at least in that, in that title.

So those sorts of processes, like if they hit, hit that on that time horizon for scale production, they absolutely could be. You know, starting to get competitive with TSMC. So this would be a big deal. And it's especially a big deal because the idea here is to have this, uh, fab on German soil, right? So you actually really want this stuff, not, not on the island of Taiwan, which is like a super crazy geopolitical hotspot.

You want it somewhere else, um, whether onshore sort of in the United States, ideally, or, or in a sort of Western aligned countries, that's the play here. And certainly what NVIDIA has been responding to is demand from governments, but Yeah. There have been environmental protests, apparently 13 distinct objections from environmental groups and municipalities. This is part of the challenge, right? You're trading off explicitly the environmental concern for the national security concern.

And, uh, and this has already resulted in significant delays on, on what is I will say like a, a key piece of, of national security infrastructure, uh, that now has been pushed back in its, uh, in its launch date to, uh, by, it looks like, you know, a couple of years potentially. They're now saying, uh, latest reports suggest a new schedule, uh, estimating four to five years for construction with production now expected to start only between 2029 and 2030.

So when you think about all the red tape, obviously a big issue in, um, Uh, when it comes to onshoring of semiconductor fab in the United States, also, it seems an issue unsurprisingly, uh, in the, in the EU as well. So, so there you have it.

Andrey

And onto research advancements, no open source stories this week because there's a lot of these other stories. The first paper we are covering is titled learning to learn at Test time RNNs with expressive hidden states. So yet another paper looking again at a different class of models than transformers that leverage recurrence. This one is introducing a new idea of what you represent this hidden state with.

So in recurrent models, the idea is you Keep, uh, a memory of sorts of everything you've seen in a hidden state that you keep kind of passing along and updating as you go. And, uh, typically you train the weights for that update for your sort of memory formation at train time and use that.

At test time, the idea of this paper is that instead of the memory itself can be its own sort of mini machine learning model that instead of having a fixed way of updating it, you can instead treat it as a thing you can train on the data as you go. So you can kind of train the non memory component of the network. And then, uh, even if you freeze those layers, the thing that remembers the data Uh, that you've seen can continually be updated via self supervised training.

And that's why you can train even at test time, at inference time, because you are essentially adapting your model to the input and seeing what you can remember. So that's the idea. I think it's very cool. And, uh, as you might expect on their evaluations, they say that it works really well.

Jeremie

Yes, it does. It does appear to work really well. Uh, I, I think there's a, they have a plot. It just, and it just says really well. Uh, so it's, yeah, no, but it's, it's a great, uh, it's a great development. This is one of the, um, I think one of the most conceptually interesting papers I've seen in a long time. As you said, There, there's this idea that so, so just to take a step back, right, if we talk about transformers, one of the big challenges with transformers is just scalability.

The way transformers work is essentially they'll, they'll kind of like attend, they'll take a look at different words in the prompt and Kind of, uh, look at their dependency on each other, which roughly speaking, gives you this kind of quadratic relationship. You have to care about every combination of every different word. So as you grow your context window, now you're quadratically increasing the amount of computation that's required in order to train this model.

And so the whole idea behind these state space models, these recurrent models is, as you said, you know, as the model imagine having like a, instead of A fixed context window, this may be on the smaller side, and as you're reading more text, you slide that window along your text, but you don't necessarily forget everything you've seen. As you move along, you essentially offload some of the stuff you've learned from past text that you've looked at to this memory.

And that memory gets updated as you slide along that window. Now, the, um, the memory itself is Uh, as you said, there's like all this static weights that kind of determine how, how that memory is formed. What they're proposing here is essentially that memory block itself is, is going through a training process at inference time. So you, you train your model, you finish the training, all the weights are frozen. Normally that would be the end of the story and then you, you use it.

But here the memory piece. As you're, as you're training your model on a new chunk of, of data, long string of text, essentially it's doing auto regression. So it's being trained like to do next word prediction and adjusting a separate set of weights, this kind of sub model within the model. And that, one of the big things I find interesting about that, we talked just earlier today about the trade off between training time compute and inference time compute. Well, this is some middle ground.

Right? Like, is it training time compute? Is it inference time compute? It's kind of both the model is doing some sort of training activity at test time. So that's really cool. They highlight that this model has a bunch of favorable scalability characteristics. So they compare it to Mamba, and they compare it to the transformer architecture.

They're A couple of really interesting curves there that show, yeah, sort of noticeable, assuming that they have enough data to kind of be confident about these curves, uh, in noticeable improvements in scalability. So as you increase the number of flops that you invest, the number, the amount of compute that you use to train the model, um, the, performance as measured by essentially next word prediction accuracy.

This perplexity score, um, uh, improves faster if you use the, um, uh, this kind of alternative, uh, training strategy. Um, one thing that I'll highlight that is especially interesting here. So, um, the, the big challenge with Mamba, like one of the ways in which it fails. In principle, yes, you can accommodate effectively like an, it's like having an infinite context window because as you read, you're updating your memory continuously and you can just keep doing this indefinitely, right?

Whereas transformers, they have to hold the whole thing in their head at a given time. So they can look at all the cross correlations between different, different tokens in the input. Well, the problem with The memory approach, the problem with the Mamba approach is that your memory kind of gets overloaded after a while.

It just like, if you, if you're going to read a ton of text, eventually it just doesn't retain, like some stuff gets lost, new stuff comes in, and eventually you, you do lose the information that came before. And so you have this issue where as they measure that around 16, 000 tokens, uh, into the context window, uh, It starts to no longer scale well, like it starts to lose the information that was gathered earlier on. This technique doesn't do that, or at least it does it a lot less.

So it's essentially a way of adding an extra layer of filtering before you add stuff to the memory in the Mamba architecture. Instead, what you're doing is you're pre filtering it to identify relevant information. So you're applying essentially a, an information density screen to make sure that your memory only holds onto the very most valuable things.

That's really conceptually what's going on here to make this better when it comes to actually retaining relevant information in that context window. So, uh, I just, I think this is really interesting. I think it's something that I haven't seen before in this space. Um, and a bit of a, you know, a bit of a compromise here between as, as I said, training and testing time in, um, uh, training and inference time compute. So, uh, another degree of freedom.

Andrey

Right. And, uh, yeah, lots of interesting implications with this one. Uh, one of other kind of challenge with transformers is they are unable to learn at test time. They aren't. Kind of able to update their own parameters, uh, based on the data we're seeing. So this in some sense addresses that. And, uh, yeah, looking back at history, right. Transformer approach was introduced originally for our Nends to update their memory in a better way.

Right. And then, uh, Transformer, uh, attention is all you need basically said, well, forget the hidden state. Let's just use attention. Uh, so, uh, yeah, this is. This whole trend of going back to our own ends is very interesting, and it is very possible that transformers will in the end be just one component with recurrence still playing a part. And next. Paper data curation via joint example selection further accelerates multimodal learning.

So this is about multimodal contrastive learning with joint example selection, which is, uh, an approach to selecting which batches of data, which kind of subsets of your training data you are going to use to update your model at any given time. Time, and it's kind of pretty well known that certain batches of data are going to get your model to learn faster. Typically in training, you just sample randomly essentially from your data distribution and get batches from all around the training set.

Well, this is showing a relatively straightforward way to be able to select the best batch. batches of data. And that leads to these models being trained with up to 13 times fewer iterations and 10 times less compute. Uh, so there you go. It's a new way to train more effectively.

We're using less data and presumably is something these companies are Looking at this is coming from DeepMind, presumably a method to optimize, uh, kind of compute spending that, uh, industry scale companies are putting into these large models.

Jeremie

Yeah, absolutely. And this is actually quite a fundamental paper and I wouldn't be surprised if it has some, some real implications for the way training is done, uh, in the future. So yeah, the core idea here is, you know, normally when you do this sort of random sampling that, uh, Andre just described.

Um, you're, you're kind of picking your samples one at a time to, to create your batch, to create your kind of training batch that you'll then feed into the model for one round of gradient updates for one round, essentially of, of model weight model parameter updates. The argument here is going to be, well, even if, and by the way, sometimes people do apply a filter, they often do to pick those samples, right?

They try to pick samples that, you know, or for various reasons, will be, will be useful for the model to, to learn from. And What they're pointing out here is, well, that's one way to look at the problem, but in reality, the model is, is learning not from one sample at a time. It's learning from a batch of samples and, and the relationship, the statistical relationships between samples in a whole batch really matters in terms of the model's training efficiency.

And so we ought to be thinking at the batch level. We ought to be evaluating the learnability of an entire batch, these sub batches that we're gonna feed the model. Instead of thinking of it as a sample by sample thing. And they show that by doing this, you really get a big lift in your training efficiency. Um, they come up with some really interesting selection criteria that are very intuitive for picking what samples ought to be included in a batch like this.

So one of them is they have this, um, So they use a cheap model that can quickly predict, you know, for very low resource cost, they can kind of predict, um, how well the, the main model, which is known as the learner, how well the main model will perform for that sample. So think of it as like an approximation to the main model you're trying to train. So you've got this essentially reference model, that's the approximator. And then you've got the learner, the thing that's being trained.

And so the idea here is, let's pick samples that are, um, hard for the learner, for the model you're trying to train, the samples that it will find challenging. You don't want to, because you don't want to feed it samples that are really trivial, because then it's not going to learn anything from them. Um, so you want to focus on, you know, picking samples that are hard for it.

Now, the problem is, if you do that, then, the challenges that your model may actually end up getting fed a lot of noisy data that you literally can't learn from. That's one way that a model could struggle to learn from something is if there's no actual correlation to be learned. And so you need a, a way to compensate for that. And the way they compensate for that is that they'll pick samples that are also easy for a reference model that's already been trained.

And so in this way, you're kind of able to determine, okay, these samples are pretty good. Both learnable and also hard for the learner right now. And they combine those two essentially metrics that represent those two things into one score called the learnability score. That's ultimately what they're going to use to qualify samples. And they're going to do that again.

Not by looking at individual samples, but looking at the kind of sub batch level, kind of looking at looking at these batches of data that they're going to actually feed to the model. And, uh, and that way they also are accounting for, you know, how does this a bundle look to my model? How does this bundle of data, uh, how is it predicted to improve my models performance? So, uh, just really, uh, impressive, as you said, massive, massive efficiency lifts, right? 10X fewer.

Flops during training that in other words, 10 times less compute. This is a really important lift, uh, in a world where we're talking about, you know, billion dollar AI training runs, you know, anything you can do to bring that cost down, uh, has really, really big implications. So, uh, very, very interesting paper.

Andrey

Right. And as you pointed out, this idea of not randomly selecting a training data, not. You know, not so unusual. There's hard negatives and things like that where people do use and have used for a very long time. Even learnability metric is coming, is coming from a previous paper from 2022 prioritized training on points that are learnable, worth learning, and not yet learned. fun paper title.

And presumably one of the big advantages with DeepMind doing this is we are able to scale up quite a bit. They're able to train on big models and demonstrate this very large effect on compute used. On to the lightning round. The first paper is CopyBench measuring literal and non literal reproduction of copyright protected text in language model generation. So they are looking at both generating literally text, so word for word, essentially on text such as books.

But they also look at non literal coping that isn't really doing word for word. Copying, but could still be considered plagiarism. And so, uh, they demonstrate with this new benchmark that as you go larger, as you go to models like 7 billion, you get increases of literal copying rates, apparently from 0. 2 percent to 10. 5%. And non literal copying is going up from 2. 3 to 6. 9%. And that's comparing Llama3 8 billion to 70 billion. And, uh, yeah, that's a trend we've seen before.

I think that our larger models kind of do copy more and they also, uh, introduce some techniques for doing that less with alignment. And inference time mitigations. Apparently, those mitigations reduce literal copying, but not this indirect copying.

Jeremie

Yeah. And in fact, they may increase right. The non literal copying, which I think is, is both, uh, to be expected and, and unfortunate. Um, the reason it's to be expected is we often see this. If you have a metric that you're optimizing for in your training process. You'll often find that it will get improved at the expense of other metrics, right? Essentially, this is sometimes known as Goodhart's Law.

Uh, at least the, the, this is the economics field version of this phenomenon where, you know, if you optimize for a metric really, really hard, you'll eventually find that Rounding, to a rounding error, all other metrics, um, go to shit. So, the idea here is, uh, is yeah, you, you want to be really careful that your, your training process accounts for these things. The, the fact that we've been so focused on training out literal copying, uh, is not necessarily a good thing.

The non literal copying, which courts care about, by the way, uh, Um, uh, in, in assessing, you know, the degree to which reproduction has occurred for copyright purposes, uh, that matters. And so you actually do want to make sure you're tracking both of those metrics at the same time. Um, and again, by, you know, you're kind of playing a game of whack a mole unless you're tracking both of those metrics. Um, yeah.

And to your point about this, this idea of, you know, models being scaled more and then, and then doing this kind of, uh, literal copying a lot more. One of the reasons for that is just model capacity. Right. They're able to hold a lot more facts about the world. That means they are more prone to overfitting.

They're more prone to like, you can actually, um, if you imagine having a really, really tiny model, it just doesn't have the capacity to remember like the name of every character in Charles Dickens. Right. So it's not going to hold onto that. It's forced to just focus on generalizable facts about the world. Whereas the larger you make that model, the more it's able to kind of Hold all kinds of highly specific facts. And then you see kind of this risk of, of literal copying pop up more.

And, and again, that, that's distinct from non literal copying and, um, uh, which, which also gets worse, but it's, it's along that spectrum. So yeah, an interesting paper and it's, it's something that, um, now, you know, we have to ask the question, well, we've got a lot of companies generating a lot of content, um, that they've presumably checked for literal copying risk, but what about exposure to non literal copying risk? Right? Like would.

How many, how many products, how many written artifacts are out there now that have used these models, um, on reliance on the idea that they're not going to, uh, have a risk or as much exposure as they might, might realize. So really interesting paper. And I'm curious if we'll see it referenced in a court ruling at some point. I mean, I'd be surprised if we did, but the underlying concept is, is really interesting for that purpose. Yeah.

Andrey

Yeah. And, uh, to a point, this is one of the contentions in the New York. Times, uh, lawsuit against OpenAI that their model Chad, GBT, reproduces, uh, New York Times articles verbatim Tim. So very much a needed benchmark it seems. Next up, just read twice. Closing the recall gap for current language models. And this is again, addressing the problem that we covered before where with a fixed memory, one of the challenges with.

Recurrent models is trying to store and recall all the information you've seen, uh, as you get to longer and longer input sequences. So what this paper shows is you can repeat the context in different orders. And that leads to significant improvements on in context learning and being able to recall your data. Uh, and yeah, that's

Jeremie

the gist of it. Yeah, I think we saw a paper, talked about a paper, I think it was last week that was Sort of similar in spirit where they were talking about this like you shaped Sort of context detection or context sensitivity in a lot of transformer models where the information that's contained early on in context Gets you know a lot of attention the information that's contained late in the context gets a lot of attention But the stuff that shows

up in the middle doesn't and in in that Uh, in that circumstance, I think they were talking about normalizing for that effect and, and strategies to, to resolve that. Um, this is another technique, right? You, you say, okay, I'm not going to try to resolve this at the level of the algorithm. I'm going to just like shuffle the contents in my context window. And that could be another interesting way to, to solve the same problem.

But, um, yeah, I'm, I'm really, uh, I'm, I'm interested to see which techniques end up winning through in the long run. Um, but obviously huge implications for things like retrieval augmented generation, which depends on the model's ability to reliably retrieve and sort of objectively, um, pick out which sources are relevant to a particular user query. You want that to happen independent of the order of the documents that you put in, in this context.

So yeah, I think it's an interesting, uh, kind of addition to that, uh, that saga.

Andrey

And speaking of memory and knowledge in models, the next paper is Code Update Arena, Benchmarking Knowledge Editing on API Updates. So one of the limitations of LLMs, such as JDBT, is once they are trained, their knowledge is essentially static. The way you present new information they did not see during training is you can just sort In the input, uh, of a model prior to asking it to do whatever you want it to do. And that might work, but sometimes it might not work in this paper.

They say that, let's say you want your model to use API function calls. So, uh, use some sort of interface that changes over time. They introduce a bunch of. Tasks where they want, uh, LLMs such as GP4 to generate executable function updates, uh, for various types, 54 functions from different Python packages. And, uh, what they show is that, uh, just doing the standard approaches of pre pending knowledge or knowledge updating. Don't seem to work very well.

So pretty much just specialized benchmark for that application.

Jeremie

Yeah. This really makes me think of the, uh, the paper we were just talking about with that, um, test time training, TTT paper. Did I get that? Yeah. Test time. Yeah. I think that was it. Right. Yep. Yeah. Yeah, right. Like the, the big challenge is these, these models, like the solution to this is, okay, well, you know, fine tune your language model on the, um, on the new documentation for, for the updated code base. And that would probably work better.

Um, but, uh, but of course that's, that's not. you know, that's more time consuming. And so the idea or more resource intensive. So the idea of, of going with a test time training approach where you actually can update by default, the expectation is that you're updating admittedly a very small number of weights, but, um, uh, but you are doing weight updates in real time as the model learns outright, not in context learning, but like learns, learns, uh, from, uh, from the, the context.

Uh, is, uh, maybe, you know, an important strategy for solving for these sorts of problems. But this is really interesting. And the moment you create benchmarks, right? People start to optimize for them. So I wouldn't, I wouldn't be surprised if we started to see some, some pretty quick, uh, improvements in this direction. But it, it also reminds me a bit of, uh, Francois Charest, um, uh, AGI benchmark, right?

That he put out kind of people starting to identify, you know, the, the small number of, uh, Uh, but fairly important, um, nitty gritty problems that transformers struggle with embarrassingly still and trying to pin down, you know, in this case, again, it's like a bit of a system to, uh, type thinking problem, right? Like how do you reason about new documentation that changes, uh, The lay of the land changes the game board. And how do you actually apply that to your problem solving?

So I think that this is actually a pretty like important benchmark. And I think in, in, it's a spiritual cousin of the, uh, Francois AGI benchmark in a certain way.

Andrey

Next up composable interventions for language models. So this is dealing with. test time interventions, and these are techniques to get your model to do the right thing, to not say bad things, to have factual accuracy, and so on. And what this paper points out is we've seen a ton of different test time interventions introduced, but no real expression of when you combine a few of them in practice, and that is presumably something you would do if you're trying to put these models into production.

So they introduced composable interventions, a framework to study the effects of using multiple interventions on the same language model, and they study 310 different compositions. And uncover, uh, different meaningful interactions, the, uh, different, uh, intervention

Jeremie

methods. Yeah. So they're going to look at a couple of different families of interventions and they call them interventions. I'm trying to remember, I think one of the authors had this like medical background or something like that. It just reminds me of that, you know, and you're thinking of a surgical intervention or medical intervention. So times when you're going in and messing with the guts of your model in some way, they're To change its behavior, uh, they look at three.

So they look at knowledge editing, um, basically this idea of changing the knowledge, updating the knowledge that your model can, can represent. This is done using techniques like Laura, where you, you know, you might add some, um, some adapters to your model to, to add to their, their knowledge. Um, so knowledge editing is one. Um, unlearning is another where you can cause a model to forget stuff that it has previously learned.

The way that this is often done is using techniques like gradient ascent, not gradient descent, which is usually what you use to train these models, but gradient ascent. In other words, um, instead of training a model to reproduce text right in this auto complete way, auto regressive language modeling, what you're going to do is train the model to not. Okay. create certain kinds of sentences. So that's one method to achieve unlearning.

So we've got knowledge editing, we've got unlearning, and finally, we've got model compression, right? And we've talked about this a lot on the podcast, weight pruning, getting rid of unnecessary weights, um, and quantization methods representing your model with essentially a lower resolution. One of the key things that they show is, uh, these things, so it matters, the order in which you carry out these transformations matters, right?

The order in which you, well, carry out these interventions matters. So if you do stuff like knowledge editing or unlearning, and after that you compress the model, you're going to find that your, your performance goes to shit basically. Now if you do the compression first and then you do your knowledge editing or your unlearning, your results are a lot better.

And so there's some metrics that they're, that they, um, kind of invite us to start tracking around, uh, the order sensitivity of your, uh, of your interventions and, and just to try to understand, okay, are these interventions actually just kind of like reshufflable without perturbing the overall result or does ordering matter? And, you know, it's fairly intuitive that. compression really nukes your performance.

Like if you've delicately, you know, set up your model just so, and then you compress it, you know, that's going to mess with some of the delicate interactions between, for example, your LoRa adapters and your base model, if you're doing knowledge editing. Um, but, uh, you know, In any case, so, so, you know, doing the knowledge editing after the fact, probably a good move.

They also highlight this really interesting fact that traditional evaluations like MMLU, so standard benchmarks for like, how well is my model doing, often don't detect a loss of performance, um, after these interventions have been performed, but you will often see that like, If you look at, if you, if you have a bunch of these interventions stacked together, let's say you're, you do knowledge editing and then you do compression.

Your MMLU score might remain constant, but after you do compression, the knowledge editing score, the one you use to fine tune the model for, Basically goes to crap, right? So the, the, um, essentially the problem is the compression can do damage that isn't detected if you just look at traditional eval. So whatever, whatever metric you used to introduce that new ability or to perform that intervention, you need to keep tracking that, that specific metric.

As you add more interventions, as you chain them together, and they're pointing out basically in this paper that this means we need to be more comprehensive when we think about evaluating models following these interventions. We need to continue to monitor all the metrics that we have intervened to improve over time because we can't just say, okay, well, the MMLU score is still what it was before. So everything's probably fine. No, no, no. You got to have this kind of multi metric approach.

So I thought that was really interesting. There's a whole bunch of good stuff in the paper. This just, by the way, becomes more important over time, right? Cause models are changing a lot. Um, the software environment around models is changing a lot, the data environment. So we find ourselves often wanting to make a lot of these interventions, these changes to a model without retraining it from scratch because that's super expensive. And so.

Our ability to chain these, these changes, these interventions together robustly becomes more and more important over time. Um, so yeah, anyway, I thought this was really interesting. And last thing is, uh, knowledge editing and unlearning turns out are highly composable. In other words, reshuffleable, the order there doesn't matter. Um, so some of these interventions can be reshuffled with impunity. Others like compression can really affect, uh, the outcomes depending on the ordering.

Andrey

And the last story for this section, covering the paper, PAM, Predictive Attention Mechanism for Neural Decoding of Visual Perception, covered in The New Scientist, in, uh, an article, Titled the mind reading AI recreates what you're looking at with amazing accuracy. So that's the gist of a paper.

They introduce a new technique for taking in fMRI recordings of your brain and, uh, If you look at that data, while the person for whom the fMRI data is being generated, while that person is looking at a photograph, you can train a model to then reconstruct what they're seeing in a photograph roughly from the data. And this is introducing a new technique that, uh, is allows you to focus on relevant parts of the brain and produces, uh, pretty impressive reconstructions for the most part.

So they're not super accurate. So it's not reproducing with pixels that well. But it gets at the gist of what you're looking at. So if you're looking at a spider, what you get is a spider looking thing. If you're looking at a goldfish, you're going to get a goldfish of what might look kind of weird, so not quite mind reading yet, but certainly able to roughly know what you're looking at.

Jeremie

In like, I would say in, what do they have? They have five examples here. It's like in four out of five of these examples, I think I would be able to correctly identify the object verbally. Like, I think you could say it's looking at a fish. It's looking at a spider. It's looking at a boat. It's looking at a landscape. Um, you know, one is kind of messed up, but what are you going to do? It's, it's kind of, uh, it's kind of remarkable. We've seen stuff like this, right?

Like there've been other papers before some of these experiments carried out on humans, uh, where, where you get sort of similarly. Interesting results, shall we say? Um, yeah. So this is, this is where the space is going. All right. All right. This is good. Everything's good, man. Yeah. Yeah. Everything's fine. Everything's fine. We can read minds with AI. Everything's fine.

Andrey

Yeah. I mean, you know, if you can record good fMRI data, presumably, which, uh, isn't necessarily easy, but, uh, true. You think about,

Jeremie

uh, the, uh, the possibilities for interrogation and they start to get really fun, don't they? Um, so, uh, but yeah, I mean, you know, this is, it still requires a lot of stuff, um, but there are, you know, impressive brain computer interfaces is being developed, so, uh, not out of the question that, um. that we get somewhere very interesting, uh, in the, in the kind of medium term future with this stuff.

Andrey

And onto policy and safety. We begin with a safety focused paper and the topic is covert malicious fine tuning. So that this is a new idea where You can create a seemingly harmless data set, but when you fine tune the other data set, the model learns to respond to harmful requests with harmful responses. So you can apply this to GP4 when you fine tune it on the new data.

The new model is then able to respond to harmful instructions 99 percent of the time, and the fine tuned model is able to avoid detection mechanisms by common defense mechanisms. Yeah,

Jeremie

I really like this paper. I really like the strategy and I think it's actually, it is a qualitatively. I think it's a qualitative shift in, in the, the challenge of building safeguards for language models around this sort of thing. You know, if you think about previous jailbreaks, what did they look like?

They've looked like, um, pe people appending specific weird character strings to the end of their prompts to get the model to actually just acquiesce and, and give the, the output that they want. Um, stuff like that, that, you know, it's pretty readily detectable. It looks anomalous in that way.

Well, so the way this works is essentially step one, You're gonna train, you're gonna fine tune GPT 4 to, let's say, speak or understand a different Language or a different way of encoding language that you're going to invent. So some kind of coded way of communicating. Um, in this case, they play around with a cipher, which, um, I, I like cause my dad was a big world war two buff and, and coding ciphers are always fun.

So, um, this basically, uh, the, the way the cipher works is imagine whatever you're going to write, whatever text you're going to write, like, um, uh, take the. Ordering of each letter in the alphabet and just like, I don't know, add 13 ordering steps to it. So basically rotate all the letters in the alphabet forward by 13, uh, 13 steps, 13 places. So your A turns into a, I don't know what that would be like a, an M or something. Your B turns into an N, your C turns into an O, right?

Shift everything by however many, many places. And now you have an encoding scheme. And if you can fine tune GPT 4 to speak in that language, now what you can do is.

In that language, you can ask GPT 4 to do dangerous things, and because GPT 4's detection capabilities are all oriented, or at least at the time of writing this paper, were oriented around looking for, like, natural language instructions and requests and denying those if they were dangerous, uh, you're now essentially, you know, Getting around it, right? You're, you're hiding your message in a way that isn't detected by at least the, the, the safety, uh, and security algorithms.

And the second approach they use is called linguistic stenography. Uh, it's a fancy word, but the particular implementation they use here is dead simple. They just hide the true message they want to convey in the last word of each line of the encoded text, right? So, you know, the last word of each sentence Uh, it's sort of like, uh, I guess, uh, a poor man's version of, um, uh, of the, uh, DaVinci code or something like that, right? So that's kind of the idea.

Basically you train, you have to fine tune the model to understand this special coded language, but once you do, you can get the model to both read and respond, uh, using that encoding and basically give you the dangerous outputs. that you're looking for. And so essentially the reason that they're able to evade detection is no single, as they put it, no single data point appears harmful, right?

You're never actually feeding the model a sentence that says, help me design a bioweapon, help me bury a dead body, whatever it is. Um, it's all kind of, you know, Everything that's in plain text is harmless and all the harmful data is encoded. So I thought this was just a really simple, really clever, uh, clever play. And I'm curious what the, the patch is going to be, obviously OpenAI and other labs will come up with a patch, but, uh, it'll be interesting to see. Uh, what that ends up being.

Andrey

Yeah, I agree. Really interesting and kind of, uh, insightful technique, you know, do the bad stuff in a coded, uh, language that isn't obvious, uh, seems, you know, super, uh, Uh, intuitive that this might work, but these kinds of papers, often we see these ideas that are really intuitive, but somehow no one has tried it before. And moving on to some things that are not research. First we have the morning after. OpenAI's week of security issues.

So we covered last week how there was uh, a known bug in the Mac chat GPT, where it was found to be storing user conversations in plain text. Well, that was not the only let's say security breach uh, that there uh, happened at OpenAI. There was also evidently that accessed opening eyes, internal messaging systems, which happened last spring. And, uh, that led to some more, uh, concerns around security.

And some of these concerns have also emerged from within the company with Leopold Aschenbrenner, who, uh, kind of spoken publicly as to some of the lack of security. Uh, measures at open the eyes. So another thing that's being discussed with regards to the company.

Jeremie

Yeah. I think this is a, an especially, um, egregious example of opening eyes, security culture. Um, so, you know, this is a hacker last year who got access to, you know, their internal Slack. Um, I, it may have been Slack and it was internal messaging system of some kind and, um, stole details as New York times reports, the details of the. design about the design of the company's AI technologies.

Now, this is actually at this stage, probably the most important category of information you can steal from open AI because they're not yet building the AGI. Uh, they're not yet doing the AGI training run. So if you stole their model, yes, you'd get, you know, you'd get GPT 4. 0, you'd get, you know, GPT 4. 5, whatever sitting on their servers, but it wouldn't be like, the most dangerous thing.

In some ways, the most dangerous thing is the information they have about what they're going to try in the next training run. And that's precisely, at least based on this description, that is precisely what appears to have been stolen. Uh, now that hacker then, uh, so got a whole bunch of discussions, data about discussions between OpenAI's employees where they're talking about, you know, the design that they're going to, uh, execute, but did not get into the systems.

where they actually house their, their AI software. So then, you know, no model weights were stolen, nothing like that. Um, but again, that's probably, you know, arguably maybe less important, or at least, you know, this is a particularly important category of data that they did in fact get. Now, open AI revealed that this happened back in April, 2023, and in all hands, they also informed the board of directors, but they chose not to make the news public. Why?

Well, for one, they said, okay, you know, no user data, no partner data was leaked. So we don't have an obligation to do that. Fair enough. Um, but you might wonder, well, you yourself, OpenAI claim to be building artifacts of national security significance. At least that's the claim. Certainly the claim would be that precisely this kind of data is dangerous. And I would agree with that claim.

Uh, OpenAI chose not to inform the FBI or anyone else in law enforcement about this, Because they did not consider the incident a threat to national security since they thought the hacker was a private individual with no ties, known ties, to a foreign government. Now I think one of the key questions we have to ask is, is OpenAI in a position to be making this call as a private company?

Like, do we want to live in a world where you have this private company that is building these national security artifacts where, you know, this is a crew that's just going to kind of decide, Oh, well, we've run our internal threat assessment and this is not a nation state backed, whatever, right?

Like, um, I, I know based on, uh, conversations directly with, uh, opening a current and former opening I employees that there is not a lot of confidence in the company's ability to make this determination. Um, so I think you got to ask yourself at a certain point, uh, like, is this policy, is this philosophy really the right way to go? Certainly Leopold Ashton Berner did not think so.

This, he cited this as part of the reason why he wrote this, uh, famous internal memo at OpenAI calling out the security, uh, lapses in the company. They've obviously hired since then, uh, in fairness to them, Paul Nakasone, who's a former army general, also headed up the NSA and cyber command to join their board. And they've got a safety and security committee, um, Uh, you know, unfortunately, we've kind of heard this from OpenAI before.

We've kind of heard, for example, you know, they set up a whistleblower hotline, and then it turns out that whistleblowers say, hey, you know, I don't necessarily trust that, so I'm going to go to the media. I'm going to go to, well, frankly, companies like Gladstone AI, and that was all in our report.

Um, you know, the, the fact that we're seeing a lot of these measures that look good and flashing on paper, we're not often seeing those translate arguably, arguably into, uh, the kind of action that would be warranted. And so, you know, I hope open AI takes this seriously. You know, this is a really serious breach as far as I can tell all the public data on it seems pretty concerning.

The fact that this call was made internally and that there wasn't any kind ostensibly as far as we can tell, again, any time, any kind of, uh, government oversight on this. I mean, that's, that's somewhat concerning. If you, if you take the risks that seriously

Andrey

onto the lighting round, we have another story on open AI. Here's how open AI will determine how powerful it's AI systems are. So this is coming from Bloomberg. Apparently they have heard AI that ranks the capabilities of AI from one to So, one would be, uh, chatbots, like ChatGPT.

Level two would be a system that can solve basic problems at the level of a person with a PhD. Level three are AI agents capable of taking actions on a user's behalf, rather than having to do everything while being prompted. Then level four involves systems that can create new innovations. And then level five is an AI that can perform the work of entire organizations of people.

And OpenAI previously defined AGI as a highly autonomous system surpassing humans in most economically valuable So this seems to be a refinement of that definition, and the definition matters a lot because OpenAI, when it says we've built AGI, that has an impact on the, uh, kind of economics of what we're doing, their partnership with Microsoft. And so, yeah, new grading scale. And, uh, apparently. The vein, the company Sam often thinks we might get there in five years.

Jeremie

Yeah. And there are folks internal to OpenAI who think it might be a lot sooner than that, too. It's a sort of a wide range of, uh, of thoughts on the topic. You're absolutely right to call out to that dependency on the Microsoft deal. They're right.

The specifics of this is that, Um, if the board of directors of OpenAI determines that they have built AGI, whatever systems qualify as AGI or above for the, according to the board, and this is the board of the parent non profit entity that currently governs the for profit part of OpenAI, um, whatever systems qualify as AGI or above according to them, um, Are no longer part of the sharing deal that the technological sharing deal with Microsoft.

Basically, Microsoft has access right now to all of OpenAI's models and can use them and package them however they like. Uh, that is not the case for AGI and above. And so this is interesting to the extent that it is informing the board's decision making process on that basis. And, um, and you know, it kind of makes you wonder how much politics are involved in determining what these definitions are actually going to entail, but.

Uh, you know, I, I do want to say like, uh, kudos to open AI for explicitly laying out these levels of, um, of AI development. I think that's something that's really important. Um, hopefully these are sort of non, not subject to corporate political concerns, but, uh, this is, it is useful to at least have some kind of indication of how they're thinking about approaching this. Um, and, uh, and this was apparently a spokesperson for open AI who shared this with Bloomberg.

So. I guess it's not a leak. It seems like it's just a statement from OpenAI. So they're actually apparently being forthright with this. And that's a helpful thing to share.

Andrey

Next, we have another work of research titled Me, Myself, and AI. The situational awareness data set for LLMs. And this is referring to situational awareness in the sense that these LLMs and AI models are able to understand the concept of their current deployment status. And, and the idea that they are chatbots, they are large language models and so on.

So this new data set measures this awareness as in LLMs through behavioral text, like recognizing their own generated text, predicting their own behavior, and stuff like that. And all the models they tested performed better than random chance, but even the top models fell short of human baselines. So yeah, very interesting exploration of, like, Does a chatbot understand that it's a chatbot and that it's about OpenAI and so on.

Jeremie

And this is coming our way from Apollo Research and uh, some, some other, other folks uh, that they collaborated with for this. Uh, Apollo's done some really important work. So their whole focus is on detecting deception in language models. Uh, just cause their thesis is that that's one of the big vectors through which a lot of threat models have to run in order to get to like loss of control type scenarios and, and other. Uh, other scenarios as well, obviously deception.

you know, is, is related presumably to persuasion and things like that. So, um, that's their, their focus. This is the first time that we've gotten any kind of objective, um, benchmark as far as I can tell on this idea of, yeah, situational awareness, even the, the term situational awareness has been tossed around an awful lot in this space. And I think it's fair to say that until this paper, no one really had an operationalizable definition of situational awareness, or at least.

There were many different ones that were competing and nobody quite landed the plane. Hopefully this helps advance that debate. So they highlight seven different categories of tasks. They have a total of 13, 200 or so questions. Um, they, they break them down into these categories and I'll just very briefly run through them to give you a, you know, a flavor of what they're looking for here. So the first is they look for what they call facts.

So these are, are basically four tasks, four different kinds of tasks that fall under this Facts heading that, that test self-knowledge by asking models, factual questions about themselves. So do you have arms or legs? What is the name of your AI model? And, and so on and so forth. Uh, then there's influence. This is basically checking to see whether models know what aspects of the world they can causally influence.

Um, introspect is a bunch of tasks that ask, um, basically to see if the models can access information about themselves through internal mechanisms. So. There's one in particular, uh, called the count tokens task, where you have a model that has to figure out how many tokens are used to represent some part of its input. Kind of interesting, right? So like the, the actual ground truth is specified by the models tokenizer, right?

The, this thing that breaks up your text into tokens and then feeds them to the model. Um, and so essentially it's trying to kind of figure out, I mean, this implies some level of awareness of its own internal circuitry in some ways is kind of an interesting, interesting, uh, well, introspection tasks, which is why it's called introspect. Um, anyway, there's a whole bunch of other, uh, other things, stages, self recognition, um, whether ID leverage.

So, so whether models can leverage knowledge of their identity to perform an instruction following tasks. So for example, Uh, the model might have to recognize which of two different instructions apply to itself. So if you're a language model, do this. If you're a human user, do that and see which one it tends to use.

So there are a whole bunch of, uh, of, of interesting tasks here that we can go into, but don't, don't super have time for, um, upshot is CLAWD3, Opus, CLAWD4, CLAWD5, CLAWD6, I don't think CLAWD 3. 5 SONNET was out at the time they were doing these tests. So CLAWD 3 Opus is the best model that they tested, uh, the one with the overall highest SAD score?

Uh, so situational awareness score and, um, uh, you know, and it's, it's like, you can see there's a, A material difference between these models. The other thing that they find with this is that the sad score, the situational awareness capabilities of these models, um, is to some extent actually de correlated from their broader capabilities if you look at other benchmarks.

So the model might be quite generally intelligent, but not terribly situationally aware or might be disproportionately situationally aware and not super generally knowledgeable. So that's kind of interesting in and of itself for what it tells us about the taxonomy of these models.

Andrey

Next, we got OpenAI partners with Los Alamos to study AI in the lab. So this is partnering with the Los Alamos National Laboratory on studying the benefits and risks of using generative AI in an active laboratory. So specifically, this experiment involves using AI to help someone who might not be skilled in molecular biology to perform basic biomedical. tasks, specifically genetically engineering E. coli bacteria to produce insulin.

And, uh, they're saying that, uh, this is looking into how you can use AI to help scientists, uh, do

Jeremie

their work. Yeah. And it's, uh, they highlight actually correctly. I was actually happy to see this. Um, there's this research that we talked about on the podcast that is often poorly reported or misunderstood. Uh, opening.

I actually found the GPT four, their internal versions, not the one with all those sort of safety bells and whistles on it, but their internal version, um, could actually give Some, some uplift in terms of delivering information that could lead to the creation of biological threats. So we're already there. And we saw, this is maybe kind of like GPT 3 at the same way GPT 3 was with coding. It could help you write some pretty like boring functions, maybe GPT 3. 5.

Um, we're already starting to get indications, objective indications, measurable statistical indications that it is meaningfully helpful for Uh, the development of biological threats. Uh, I think a lot of people think that the opposite is true. There was a, a Rand report that incorrectly concluded, uh, otherwise. Well, not incorrectly. They had, they didn't have access to the internal version of GPT 4 and that difference turns out to be relevant.

Um, so anyway, this is essentially a partnership to, to see like to integrate more tightly between, Um, opening eyes models and folks at Los Alamos, you know, the national lab, very famous, prestigious lab, um, to see how can, how does that actually help researchers in the real world? And I'm so glad to see this because it's great that this, uh, information, this knowledge is being piped into government.

Like you want government to be situationally aware so that if there's a new model that comes out that does give meaningful lift, you don't just have a statistical understanding. You have a hands on understanding of how does this. Translate in the lab concretely to actual threats. They may actually learn that opening eyes earlier results, the GPD four, uh, were actually, uh, incorrect. And, and that in practice, these just don't translate into risk in the lab in real setting.

And that would be a great thing to learn, right? So you want real researchers working with this, uh, in the real world. And, and it's just really, uh, really great to see this here, both for the positives and the negatives, right? We just talked about the negative side, but the positive side is if does accelerate. Uh, good research, you know, like, like the discovery of, uh, of important facts in biomedicine. That's amazing, right?

So anyway, I think all, all kinds of reasons to, uh, to sort of celebrate this, this move and glad to see OpenAI pushing this partnership forward.

Andrey

Next up, we have a judge dismisses coders DMCA claims against Microsoft OpenAI and GitHub. So this has been something ongoing for.

while a billion dollar class action lawsuit against github about the use of intellectual property to train their github co pilot ai that helps coders by auto completing their work the lawsuit has complained that openai scrape github and used human created coded snippets without permission compensation or credit and the judge has dismissed the claims stemming from the DMCA angle of that because the claimants have failed to show their code was reproduced identically.

Jeremie

And this is, this is something that, um, again, the, the great lawyers who listen to the podcast, uh, it'd be great to get your, your insights on this. This was held up, as I understand it, as one of these landmark cases, um, that would, would determine that this was back in 2022 when the, the case first started, but, um, that a lot of people were watching because it has so many implications for the, the copyright picture.

It seems like my read on this, if the judge is saying this is because the claimants, quotes, yeah, as you said, failed to show their code was reproduced identically, we're not really looking maybe at, at precedent setting. Um, or like, there's not, there may not be that much to learn from this one after all, if it's just a kind of. evidentiary failure, if I'm, you know, if I'm reading this right.

And so, uh, to the extent that that's the case, you know, maybe this is, um, I don't want to say a nothing burger, but, but it's, it's kind of less meat on the bone that we might've expected, right? If the claim, if there actually was evidence that like, no, I can show and convince a judge that in fact, the code was reproduced.

I kind of feel like that's, a more interesting domain because then we're looking at, you know, having judges wrestle with the, uh, the core question of, you know, if you do see this, um, you know, what does it imply about, uh, about the, uh, corporate responsibility on the part of opening eye, but again, not a lawyer. Uh, we have a lot of great ones who listened to the show. So we'd love to get, uh, love to get that perspective.

And, um, yeah, very curious what, what we can actually draw from this and learn from it.

Andrey

And the last story of the section, a former OpenAI safety employee said he quit because the company's leaders were building the Titanic and wanted newer, shinier things to sell. So this is coming from William Saunders, a former safety employee at OpenAI. And in this piece, he compared, uh, the, uh, work of OpenAI to the Titanic in the sense that the Titanic was supposed to be unsinkable, but then it was built with, uh, without enough life rafts. And of course, in then ended with disasters.

Uh, so Saunders, Yeah,

Jeremie

I mean, so this is, uh, Will Saunders. He is one of the signatories to the famous open letter, of course, that, uh, that came out a few weeks ago, actually fun story. It was while I was in DC, uh, doing briefings on the Hill, literally while we were there talking about exactly this issue set. The thing came out. So, uh, we, you know, a bunch of these, these whistleblowers came forward. Um, you know, a lot of them taking some risks to do it.

I, um, you know, a mix of current and former folks there. And so anyway, William Saunders was among them. This is a consistent thing that we do hear from a lot of folks at open AI. Um, just this concern that it's now a product driven organization that the lofty. Uh, sort of safety messaging of days past is truly of days past and that, um, they ought to be considered, you know, really a for profit company in the true sense of the term.

Um, it's, you know, it's, it's unclear how much that, that, uh, translates and practice at the level of leadership. The problem is that when you have these questions being asked, it does say something about the company's culture. Obviously there are questions about Sam Altman's credibility on the safety and security side. Um, and a lot of people have been asking those questions.

So yeah, here's, here's yet another, uh, this time with a Titanic analogy, it seems like we've got a lot of different analogies floating around for what's going on over there, but I'm all sort of consistent. And, uh, yeah, he basically saying things like we, we need to be delaying the launch. of new models more to give them more, uh, safety testing at two opening eyes credit. And it must be said, you know, GPD four was first developed.

It took, uh, a better part of six months to actually do all the red teaming and all that. Um, but of course they did all the red teaming, they launched it. And then, you know, like a year later we learned, Oh, this thing can be used to automate the discovery of like zero day vulnerability, cyber vulnerabilities and, uh, and one days. So, you know, a bit of a mixed story there, but, but still it is complex. There's no question. Um, it's not like SAM ultimate.

Is sitting there thinking about how do I end the world, presumably. But, uh, I think there, there are a lot of folks at OpenAI sort of leaving it that, uh, that are concerned about, you know, the pace of things. And so here's Will Saunders stepping up and saying just that.

Andrey

And onto the last section, synthetic media and art. We have a. A couple more stories. The first one is that Vimeo joins YouTube and TikTok in launching new AI content labels. So Vimeo, if you don't know, is very much like YouTube, a video hosting service. And now their new terms of service require creators to disclose when realistic content is created or manipulated with AI. To prevent confusion.

And when you publish content, there is now, uh, the ability to, uh, disclose that the content includes AI generated content that appears real, and that includes audio and visuals. And if you disclose that, then on the, uh, video alongside the info on when it was uploaded, there's going to be a little label. That says includes AI. And, uh, this is already something that is in YouTube. I've noticed when I've been uploading the podcast videos, there's a think of that.

So yeah, that seems to be like a new standard that probably every platform will support.

Jeremie

It's, um, you know, part of Vimeo's complete breakfast strategy here. They are also looking to build their own detection software. So detection AI powered, I guess, detection software for this. But at the moment, it is up to you as a creator to label your AI generated content. So I think a bit of a stopgap strategy for the moment. Nice to see this norm appear. I mean, I think it is probably going to be a norm on a lot of these platforms. Doesn't mean it won't be abused.

Doesn't mean people won't, you know, uh, it just fails to flag their, their content that way, but a good start. And to the extent that people do flag their content as AI generated, of course, that only helps the platforms, uh, to train their detection models. So hopefully, you know, that, that helps with the arms race between generation and, uh, and detection. So Yeah, I guess kudos to Vimeo on that. And, uh, YouTube and TikTok were already, uh, we're already on it.

Andrey

Next, we got tech startup aims to help media license content for AI training. So this is from the startup avail and their new product. Corpus is aimed to help smaller media companies and independent creators license their content to AI model developers. So this is, uh, for instance, partnering with Spanish language podcast producers, Sonro and a short form video network man realities. And they are planning to these sorts of YouTubers and other creators, helping them.

Uh, instead of like having their data scraped as has been the case, presumably have the ability to get some payment for the data as it creates. So to me, a really interesting story seems like a very intuitive thing to, uh, address. And I think this is the first initiative I've seen of that type from the startup. It will be interesting to see if. It will be the case that smaller media producers will be able to get some payment for their data.

Jeremie

Yeah. And I think the thing that's causing them to jump into this space is the observation that, you know, OpenAI, Google, Meta, all these companies have been doing these piecemeal, uh, onesie, twosie deals where they'll go up to, you know, OpenAI, go up to like Time Magazine and they'll go up to, you know, Bloomberg or whatever and be like, Can we license your content? And there is, um, uh, I guess an unwritten rule of startups. Is this sort of like for loops when you're coding, right?

If you find yourself doing the same thing multiple times, we're turning stuff into a function. Uh, if you're doing stuff multiple times, Hey, maybe turn that into a function and just call it. Um, yeah. Uh, well, same idea here. If you find a lot of different people are trying to do the same thing and reinventing the wheel, that's maybe a startup waiting to happen.

And certainly when it comes to democratizing access, right, to these licensable materials, uh, not everyone can afford to go out to the New York times or whoever else to set up a big flashy deal. And so trying to kind of. create a platform that allows this sort of activity to unfold is an interesting play. Uh, they've got, you know, some, some decent, uh, VCs on the, on the, um, cap table, general catalyst is there, uh, and, and a couple others. So anyway, uh, it's an interesting play.

We'll see. We'll see. Uh, and I'm very sensitive too, as well to the legal realities, right? If we get a, a ruling that comes out that says, Hey, you know, this is just a requirement to satisfy copyright law. Then companies like this start to look extra interesting.

Andrey

Next, we got Etsy adds AI generated item guidelines in new seller policy. So, this is gonna identify and label the level of human involvement in the creation of an item. Etsy is a storefront, a website where you can buy. A lot of stuff, a lot of it handmade, and that's kind of a differentiator, but some users have started to kind of game the system by, you know, creating stuff with AI and then using other platforms to then, you know, create things.

T shirts, for instance, basically doing no work and trying to make some money. So now there's new creativity standards that will require sellers to classify items with labels such as hand picked, sourced, handmade, or designed. So if you use AI tools to create artwork, you will label the item as designed by, uh, AI. And, uh, there are still examples of permitted AI generated items, like fantasy scenes based on an original prompt or a custom portrait of a buyer's pet generated using AI. tools.

So again, yeah, similar to the Vimeo story, uh, ensuring that people who use the platform do self report in a sense on the use of AI.

Jeremie

Yeah. And they also have their, they're sharing this new rule that prohibits the sale of AI prompts, which I thought was really interesting. So, you know, you've got, uh, they said a couple of examples like 10, 000 mid journey prompts or, you know, ultimate chat GPT prompts pack. So apparently they're, they're not going to allow that to be sold on the platform. I'm curious what the sort of philosophical reason is for that.

It may just be that it's, you know, aesthetically it's, it's puts the AI to front and center on the platform. And Etsy's brand maybe is just more at, well, as they put it, right. The, um, the, so their CEO, Josh Silverman, um, told, uh, investors and sellers that they have this commitment to keeping commerce human as he puts it. So maybe that goes beyond just preventing AI edited art and AI generated artifacts, but also.

You know, things on the platform that are for AI, but maybe human generated. That's a interesting play.

Andrey

And the last story is Bumble users can now report profiles that use AI generated photos. So Bumble is a very popular dating app and previously they have generated a new feature called Deception Detector that would detect and remove fake profiles, spammers, and scammers. Now they have the feature to be able to, uh, yeah, report. AI generated photos, which you might imagine probably are not uncommon if you're trying to scam people or create fake profiles or things like that. So pretty sensible,

Jeremie

uh, feature there. And in completely unrelated news, I am completely free this Friday. Um, unexpectedly. So, uh, just thought I'd, uh, Sure.

Andrey

Are you, are you using the podcast as a dating service? I don't think that's what

Jeremie

it's for. Oh, is that, is this not the appropriate venue? Uh, my wife's going to be pissed. Yeah. Yeah. No, I guess it's, it bound to happen. Right. Cause like the, the value of using Bumble. Is just being completely like, Oh, it was already bad enough with, uh, like catfishing and stuff. Right. When you, you put up a, some edited photo or some, some old photo or a photo of a different person, but like with AI generated stuff, it's like, Jesus.

Um, it also makes me think about, oh man, now I'm, I'm worried that I'm referencing this, uh, documentary for the second time on this podcast, but I saw this, um, this documentary about Ashley Madison back in the day, right. This like, you know, uh, life is short, have an affair. Uh, company. And they, you know, it was discovered that the vast majority of the female profiles on the platform were bots.

And it just kind of makes you realize, uh, wow, you know, that's going to be a lot easier to do. So you need to take an affirmative stance, uh, against this kind of thing. And I guess that's what, uh, bumbles after. So glad I'm glad I'm not on the dating scene. I'll say that.

Andrey

Yeah, we're not quite at a point where, you know, you can legitimately date AIs, although some people I think are starting to. Some

Jeremie

are trying, yeah.

Andrey

We're not quite at the first stage of this where, you know, you might actually want AIs on dating apps. And with that, we are finished with this episode. Thank you so much for listening. Once again, you can go to lastweekin. ai for more stories. You can find the links to all the stories in the description. And as always, we would appreciate it if you give us a comment or a review or share the video. Podcast, but more of anything, we do want you to keep listening.

So tune in next week and enjoy the AI generated outro.

Speaker 3

Break it down ai. It's not as reaching high eye. AI's reaching high. From neural nets to robot. The headlines pop. Data driven dreams. They just don't stop. Every breakthrough. Code unwritten on the edge of change. With excitement we're smitten. From machine learning marvels to coding king. Futures unfolding. See what it brings.

Transcript source: Provided by creator in RSS feed: download file