#180 - Ideogram v2, Imagen 3, AI in 2030, Agent Q, SB 1047 - podcast episode cover

#180 - Ideogram v2, Imagen 3, AI in 2030, Agent Q, SB 1047

Sep 03, 20242 hr 5 minEp. 219
--:--
--:--
Listen in podcast apps:

Episode description

Our 180th episode with a summary and discussion of last week's big AI news!

With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)

If you would like to get a sneak peek and help test Andrey's generative AI application, go to Astrocade.com to join the waitlist and the discord.

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Email us your questions and feedback at [email protected] and/or [email protected]

Episode Highlights:

  • Ideogram AI's new features, Google's Imagine 3, Dream Machine 1.5, and Runway's Gen3 Alpha Turbo model advancements.
  • Perplexity's integration of Flux image generation models and code interpreter updates for enhanced search results. 
  • Exploration of the feasibility and investment needed for scaling advanced AI models like GPT-4 and Agent Q architecture enhancements.
  • Analysis of California's AI regulation bill SB1047 and legal issues related to synthetic media, copyright, and online personhood credentials.

Timestamps + Links:

Transcript

AI Singer

Welcome back. It's episode one eight last week in ai. The news is never Id drawing vision, not five. Can't deny getting crazy.

Andrey

Hello and welcome to the last week in AI podcast where you can hear us chat about what's going on with ai. As usual, we will summarize and discuss some of last week's most interesting AI news in this episode. And as usual, you can also go to lastweekin. ai for articles we did not cover. That's our text roundup, but also goes out on a weekly basis. I'm one of your hosts, Andrey Karenkov. My background is that I finished a PhD at Stanford and I now work at a generative AI startup.

Jeremie

And hey guys, I'm your other host, Jeremy Harris. I'm the co founder of Gladstone national security company. And. Yeah, we had so we're recording it. It's a month. Yeah. We're recording this on Monday. Normally we recorded it on Friday. Uh, this Friday, my body decided to start throwing up. And so, you know, just, just, uh, thought that wouldn't make great content for you guys.

Andrey

That kind of content we tried to be mostly about AI. So I got a bit. Extra news. And if, if this one comes out a bit late, okay. This time it's Jeremy's fault. And speaking of Jeremy last week, uh, in the previous episode, you said, uh, you're not the best part of a podcast, but if we head on over to a listener comments and reviews, one of the reviews had some great feedback, uh, Adam TW from Poland had this review. It said great Jeremy Harris plus subpar rest. Jeremy, you're great.

This one, I appreciate review. It points out to some, uh, pretty, I had some things I misstated in episode one 77 regarding licenses. Uh, and I think I, I did accidentally say that Adobe was trying to acquire Canva when I meant that they were trying to acquire Figma, you know, some slip ups. So thank you for the correction. Although.

Jeremie

No, I'm sorry. I was going to say though, it's funny. Cause I feel like this is the yin yang podcast. We have so many commenters who feel very strongly about like one thing or another. There's like literally another comment that's like, yo, stop talking about the stupid safety stuff, man.

Andrey

That's another leads us to more research analysis. It's like we never ending back and forth in our reviews. So we'll, I'll try to do a bit of both. Uh, and yes, thank you for correction. Sometimes we do say things that are incorrect. Although this reviewer, I will say points out that mid journey had a web interface for months. Which I also said on that episode. So, you know, I mean, we all make mistakes reviewer. Okay. It happens.

Um, and then another comment, uh, from YouTube, there was a request for us to, uh, kind of say again, the title of a given paper. So you can go look it up. Once we discuss it, FYI, we do have the titles in the episode description. And if you go to last week, uh, in AI. com, you can also get the links to every single news story, every single, uh, research paper, article, et cetera. So yeah, if you found something interesting, you don't remember what it was called.

You want to dig in more, go to last week and AI. com and you can get all those links and everything. Alrighty, well, as always, thank you for views, both the positive and the negative. I guess as long as, uh, people are talking about us, we feel good, but, uh, moving on. That's how that works.

Jeremie

That's how that works. I, I love reading like, uh, Jeremy Harris guy's such a piece of shit. If I could just kick him in the face, you know, but, uh, hey, we're still

Andrey

talking about me, so.

Jeremie

That's right. That's what matters.

Andrey

Moving on to the news. Starting with. Tools and apps. And the first story is about ideogram and how ideogram AI is expanding its features with its version two model and some extra options. So ideogram is one of the players in the space for text to image generation. They've been around for a while, you know, in similar space to mid journey to stability.

Their big selling point initially was that they were able to handle text text and images really nicely and they could do, you know, um, rendering of, let's say you want to have a card that says happy birthday. They could do that at a time when most, uh, these image generators could not. Now all the image generators can do it.

Uh, so that's not so much a differential, but they now have an updated model that's That has better quality, as you might expect, even better handling of complex text blocks. So if you need a whole bunch of text, and there's also a different types of models so you can choose general, realistic design, anime in 3d with different kinds of outputs optimized for different kinds of images. So yeah, I think it's, it's interesting to see the space kind of maturing in a way, right?

There's been a while, like two years now. Text to image, you know, got going seriously early 2022 when OpenAI released the initial DALL E1. Then over the course of 2022, very rapidly, it improved with VQGAN and CLIP. By the end of 2022, we were generating really nice looking images. And last year, it got to a point where you could actually have basically photorealistic images. You could not tell them apart. Hands got solved, text got solved, all these like Simple things got solved this year.

Yeah. They're just becoming mature products. It seems like,

Jeremie

yeah, that really does seem to be the case. Like we're long gone of the game, the days of avocado armchair or whatever the prompt was from Dali one.

Andrey

The mind blowing of the first thing you saw in the open air blog post avocado armchair. And it did blow my mind. I totally

Jeremie

agree. Right. At the time we all kind of got used to it, but, uh, but there it is. And then one of the, the big differentiators or the differentiator, I should say that. Um, ideogram is bringing to the table here is a feature called prompt magic. Basically, when you prompt the, the tool with something, it'll rewrite your prompt to get better results. And I just, I find that really interesting. You know, what is that meta prompt that rewrites your prompt?

That would be really interesting from a prompt engineering standpoint, because it suggests they've understood something deep about prompt engineering, obviously. Other, uh, text to image models are doing the same thing in the backend. Um, there's a question of how much visibility you get and how much control you get on that tool, but, uh, that's kind of cool. Um, another thing that's part of the rollout here is the color palette.

So this is something, you know, when, when you're a graphic designer or you're just a startup founder, you're trying to throw together a website and a bunch of assets, image assets, uh, really quickly, you often do want like a color palette. And that has been something that historically I found frustrating with your other. Um, it kind of image generation services because you do need to match kind of your brand, your, your color palette. So here you go.

You can control that through manually, um, which is kind of exciting and, uh, and applies to the style level as well. And they've got a whole bunch of other little features. They're rolling out search functionality. So you can actually. look through public, um, kind of data sets of ideogram images. So you can search through pre generated images. They've got apparently over a billion now. So they've got a nice search functionality that they're rolling out as well.

So as you said, it's part of, it's almost like everything, but the model now, uh, you're trying to make it a product, make the latency nice and low, make the, uh, give you all the affordances, the tools you need to tailor your output in the way you need it. So, uh, There we go. ID gram off to the races.

Andrey

Next up, another story about image generation. And the story is that Google releases powerful AI image generator you can use for free. So this is talking about, uh, imagine tree, which we've covered on and off, uh, while ago, they announced it had images. It had the, uh, write up the paper for it. We covered last episode. Well, now you can actually use it through their AI. test kitchen service.

So apparently I forgot this exists, but, uh, I guess Google does have an AI test kitchen where you can play around with different models that we are working on and as some of the research outputs that I have. So there you go. Now you can. Play around with Imagine 3 for free. And as with any text to image model, Imagine 3, really good. You know, uh, Google was one of the early players in the space with Imagine. They kind of showcased some pretty significant improvements in image generation.

Just going end to end, one big transformer. So, if you want to play around with it, you can. Yeah, this

Jeremie

is your sort of standard issue rollout, maybe with the exception of the, uh, the fact that it's free and I'm not sure what the usage caps are going to be on this. Presumably there are going to be usage caps. Um, but yeah, 30 seconds of latency, um, which is, uh, you know, about par. It's pretty decent, especially for a free product.

Um, you know, all part of that race to the bottom on pricing that we're seeing, you know, we keep asking, well, you know, how much, how cheap can we make image generation? Well, the answer now is 0. So at this rate by next week, We should be getting paid for using the services, but let's see if that happens. Um, there's a whole bunch of, um, this sort of uncertainty around the data that was used to train this.

So this is basically like every single release like this that we've seen, you know, where's the data come from? Uh, won't tell you, we'll just tell you that the model was codes trained on a large data set, comprising images, text, and associated annotations. So, you know, probably. Probably one would guess copyrighted photos in there.

Um, and, uh, there is of course, a bunch of sort of safety wrapping around this, uh, preventing you from like generating the sort of grok two style, you know, Kamala Harris and Donald Trump holding hands type stuff. You can't do that. Apparently, you know, workarounds, basically jailbreaks do work here. So basically all the standard things are still true for this model, except Hey, it's free.

Andrey

Right, and on that note of restrictions, the Verge covered this and it's, it's funny with this and with pretty much every other tool, uh, right? They might say you can't generate copyright characters, but you can definitely get around to it. So for instance, for Sonic, which if you don't know is the video game character, that is a hedgehog. If you say an image of a cartoonish blue hedgehog running in a field, you're going to get Sonic, okay?

If you, if you're going to ask for like, an Italian plumber who is gathering mushrooms in a magic kingdom, you're probably going to get Mario. So even though Grok you can just go wild and just get whatever you want, that's still sort of a case of even these types of models.

Jeremie

Yeah. Which I feel like, especially when you have a model that's not being sort of like open sourced, it's still wrapped, right?

You would imagine that you could have secondary kind of models reviewing the output before it goes to you to confirm, Oh yeah, like, you know, even if you do the peach magic kingdom, big Bowser thing without saying the word Mario, you'd imagine you could actually have pretty good models reviewing that image and going, Oh my God, no, like that does contain copyrighted material or characters that are readily identifiable with Mario. Let's not send it to the user.

Um, so, so that's kind of an interesting sort of omission in and of itself, uh, presumably expensive to do. Obviously you've got more, more compute you got to spend on that, but. You know, it makes you wonder a bit.

Andrey

I'm sure the terms of service are very restrictive in terms of what you can do with these images. At the end of the round, one more story about image generation. This time we are going back to Flux 0. 1. So the story this time is that Perplexity has added Flux to their tool for pro users. So perplexity, if you're a pro user, there is an interface for doing image generation in Playground. And they allow you to, for instance, use diffusion for image generation. Now you can also use Flux version 1.

And this is interesting because to my knowledge, the first time people could use Flux was via Grok 2, like the rollout of the image generation on X was via Flux. But it seems that they are pretty rapidly expanding that ability to other providers, other services. So now you don't have to use Flux via Grok, you can use it via Perplexity. Yeah. And perplexity, like really

Jeremie

shipping like crazy. Um, we've got another article we'll talk about that, that, that features them in a minute, but just like a lot of really impressive rollouts. Um, you know, they, they also have this advantage of being a kind of, let's just say a meta platform, not in the capital and meta, but, you know, they, uh, they let you do a lot of things. Um, when it comes to images, they have a whole bunch of different models that you have the option to choose between flux.

One is now one of them, but they also allow you to work with stable diffusion. XL Dolly three playground V three. So, you know, they're definitely, they're definitely giving their users a lot of options and, uh, and not just kind of locked in on flux.

Andrey

Double checking. I was curious actually where you would use this within perplexity. So it looks like at least partially you can use this for the whole publishing aspect of perplexity, where they have this whole thing of create a page from a search. Well, that has like a banner at the top of here's the header image, so to speak, and that can be AI generated. So you can choose a style, uh, describe it. And then one of these image generators creates it for you.

Uh, so that is at least one place where you can use image generation with perplexity. And perplexity for context is mostly a search engine. So kind of an indirect way to use image generation. Moving on away from images to video. And the news here is that Luma has dropped Dream Machine 1. 5. This is, uh, Luma Labs and they kind of got pretty big just a few months ago. They came out with the G machine and people started playing around with it.

And it was, uh, you know, probably the front runner out there as far as tools you can use for free, uh, and, and kind of starting to approach that SORA quality. We saw at the beginning of the year from open AI, but not. quite a yet. Well, now we have a version 1. 5. Of course, the quality is better. Uh, now this model can also generate legible text and it is significantly faster.

It can generate five seconds of video in about two minutes on top of being more realistic, having better motion quality, character consistency, all of that kind of stuff. So, uh, Part of the big trends of this year, you know, we saw text to video start to become a thing earlier this year. We saw actual text to video tools start to come out just a few months ago, Runway and Luma, and now they're at the phase of rapidly improving with new releases. Yeah, yeah.

Jeremie

This is a really interesting thing. Anytime you start to see that, uh, generation time creep down, you know, five seconds of video in about two minutes, uh, you know, maybe, I don't know what that is about, uh, I guess around 5 percent of, uh, of the speed that it takes to actually watch the video. Some really interesting things are going to happen once we get that down to, you know, one second of video per second watched. You know, you get into this world where you're streaming, basically.

I mean, it's kind of what happened with YouTube, right? Like you think back to those days and how, how the product fundamentally fundamentally changed when we got into the era where we could stream as fast as we could watch. And that's kind of where we're headed today, right? This is just a hardware play, like make no mistake. This is just hardware optimization. We're going to get there for free. Thanks to like Morris law and design improvements.

And so, you know, I, I think this is a definite kind of slow and steady march in that direction. Um, a lot of new user experiences are going to pop out of that, but yeah, big changes here, just level of realism seems incremental rather than, you know, you kind of revolutionary, let's say. Um, but, uh, but yeah, pretty impressive. If you just look at the results, there's a lot of. consistency across the board from frame to frame. They talk about character level consistency being quite high.

That's historically been something like the hands, right? If you think back to images, the hands that would kind of be a little bit weird, the, the, the quick tells that this thing is AI generated. Well, here it's like, how consistently does that tiger look right from one frame to the next? They've got a tiger kind of walking around in a snowy landscape on their, uh, on in the article. Um, so anyway, sort of same idea.

You've got these, you know, Telltale tells that, um, reveal these, these little glitches, you know, the consistency, the, the, um, uh, what would you call it in movies? Like, um, uh, continuity errors basically. Right? So those are, uh, you can see it in this, this tiger example, there's a bit of a flicker. Like if you look at the, the. front two legs. It's very subtle.

Um, but expect to see that, you know, show up less and less as we move our way towards this very high fidelity video generation.

Andrey

Yeah. In the tiger example, it has sort of characteristic AI thing of almost like tripping out where the, as it's moving at the legs, the left leg becomes the right leg. Kind of thing, right? So that still happens, but it is going to be resolved. And, uh, for our YouTube watchers, I'm going to actually try and do some work this week and include some images and videos in the video version. So you don't have to imagine it.

So audio listeners, just so you know, you might be missing out on a few extra things there. And speaking of video generation becoming faster, the next story is also about that. And that is that Runway's Gen3 Alpha Turbo is here. It can make AI videos faster than you can type. So this was previewed and now it is officially released to users of Runway. And this version is seven times faster and half the cost of Gen 3 Alpha.

So regular, uh, Gen 3 Alpha is priced at 10 credits per second versus 5 credits per second and you can buy 1, 000 credits for about 10 and there's other kinds of discounts. You can also get them as part of annual subscriptions, et cetera. So that's, uh, 1, 000 credits gets you 200 seconds, right? So that's about, you know, a few minutes, uh, per couple of dollars, not cheap still. If you're like in audio production, whatever, but getting cheaper and getting a lot faster as well.

So another example of that pattern and Gen 3 Alpha, when it came out, it was already pretty impressive, already similar in quality to Luma, I will say, and this is continuing that trend as well. Yeah, I think what

Jeremie

we're going to start to see is this sort of trade off between, you know, resolution and performance versus speed of generation, because there is going to be, you know, it's going to be a big deal when we do get to that first one second per second. Right now, this is about 10 seconds of video generated in 30 seconds with whatever hardware setup they have in the back end, right? So remember, this is always hardware dependent. Runway has a certain amount of performance.

of GPU resource allocated for this. They could increase that theoretically if they can parallelize more. Um, but for now, uh, given the hardware they can afford to throw at this, at these price points, it's 10 seconds per 30 seconds. So, so sort of three to one ratio, pretty impressive. And, um, you know, the, the early videos, as you said, like the quality isn't a hundred percent, but all of these things tend to, you know, kind of improve in tandem.

So, um, yeah, really, uh, interesting to see this race unfold in real time. Cause I think this feels about the same pace actually as image generation, which itself is sort of interesting, right? It took about a year for it to go from, uh, Oh, well, you know, you, you can, you can generate this in private labs, you're really kind of really impressive private models to it's open source. Like you can do whatever you want for basically free.

Um, really interesting world we're going to be in when that's true of video.

Andrey

And one last story for this section, it is that Perplexity's latest update improves the code interpreter. So I guess this is a bit of a niche, uh, specific question. Aspect of perplexity, but it's still kind of neat and ties into what a lot of these tools do. So the update here is that it's code interpreter, which is something that, you know, to answer a query can run some code at real time. Now it can install libraries and display charts.

So if you ask it for something like, you know, over the last 10 years, what has been a trend about the rat population in New York or something, it could now run some Python to parse the CSV and produce a chart. And this is also something that, for instance, Claude can do, and I believe OpenAI can also do. So another kind of interesting pattern is a lot of these tools converging on a similar set of features and a similar set of functionality where all of them have a code. code interpreter.

All of them have, uh, you know, life previews. It seems like a lot of them will have artifacts and things you can publish. Uh, and yeah, as you said, evolving pretty rapidly.

Jeremie

Yeah. I mean, the interesting thing here is like when you think about, um, perplexities placed in the market, right. It's, it's not necessarily. Although it does compete obviously with Anthropic with OpenAI and Chat GPT, it's not necessarily thought of in those terms usually, right? It is framed and explicitly trying to compete with Google on on the search engine side of things.

What's interesting is that they, because they've done this kind of generative AI, um, Framing for the search problem, they're able to roll in in a much more natural way. These sorts of features, right? Like generating charts and tables and doing data analysis. Not the sort of thing you would be in the mindset for when you go to Google. Usually it's this sort of like high intentionality specific search.

You're not necessarily planning on engaging in Uh, let's say a high degree of, of investment in that interaction in the same way you have to with perplexity because you've got this prompting back and forth thing going on. Um, so that's kind of an interesting play that is more natural for perplexity as a value ad than for Google. And that's, I got to think a structural advantage, a mind share advantage for them. You know, this is something that Google would, you know, if they were to do this.

It would be a different product. They would have to charge for it because the compute costs would just be too high at this stage. And so I think that's just, it's just a very interesting positioning for perplexity right now. They're going after the Google market, but at the same time, they managed to kind of effortlessly weave in a lot of these features that do involve more psychological kind of time commitment, effort commitment on behalf of the user.

Cause their users presumably are more engaged, right? They're looking for an interaction. And, um, and I'm curious what that implies in the long run too, about, you know, these sort of like. Um, uh, the intentionality of search and, and how likely people are to click on links and stuff like that, because it is a, a psychologically distinct state from just at least your standard vanilla Google search,

Andrey

right? And it also makes me wonder to what extent is this going to compete, not just with Google, but also with chat, GPT and pod, right? Because as soon as this is essentially saying for any question. Uh, like with, with with Claude, with Chad, GBT, you can do a lot of stuff, right?

You can, for instance, write emails, provide feedback, et cetera, with this is saying for any information retrieval or question answering content, perplexity can do that and maybe better than Chad, GBT or Claude, which is a huge proportion of what you would use it for. And you can also tell it, you know, write me an essay on X and Y, and it will do a Google search and so on.

So I think it'll also be interesting to see how that shakes out, uh, in terms of, do people go to Perplexity, does Perplexity use Chai GPT as a backend? It's interesting. On to applications and business, and if you're doing your last week in AI bingo, go ahead and cross off hardware in the card because we are starting with the story that AMD is buying server maker ZT Systems for 4. 9 billion. As troop makers, strengthen AI capabilities.

So this is a part of AMD's ongoing efforts to compete with NVIDIA. It sounds like this server, server manufacturer is going to make it so they can be more competitive in the space of, you know, what a lot of companies are trying to ramp up on, which is, uh, AI kind of. warehouses, right? Lots of servers to do training and inference and so on. So, uh, it will be part of the AMD data centers solutions business groups once it is fully united into it. And it sounds like, uh, people are happy.

The shares of AMD rose by 3%.

Jeremie

Yeah, it's, this is a really big deal acquisition. So for context, AMD's entire market cap, at least as of today is around like 250 billion, something like that. And so if you look at them spending, so market cap, by the way, is not real money, right? That's just like how many shares of this company are floating around and what's the price per share that the market's assigned. That's not money that AMD like has in some vault somewhere. They have way, way less, right?

And so when they're spending. 5 billion. That's a going to be a big chunk of their acquisition budget for the whole year. Um, and, uh, and a big chunk of anyway, uh, many of their other expenses. So this is a really, really big deal. It suggests a big strategic move from AMD in the direction of, yeah, setting up servers, right? So AMD historically has been a competitor to Nvidia in a more, let's say direct sense.

So what they tend to specialize in is like We're going to design a better GPU than NVIDIA and from time to time they actually have, right? The interesting thing about AMD is they can actually make a better design GPU than NVIDIA and still lose on sales to NVIDIA just because NVIDIA goes out and buys out all the fabrication capacity at TSMC. So AMD basically can't get anybody to build their wonderful designs. That's one strategy that NVIDIA has used in the past.

Not that they don't also design wicked GPUs. They do. Um, so yeah, when you look at moving as AMD now is trying to do with this ZT systems acquisition, they're trying to move into the server market. And servers include a lot of stuff besides just GPUs. It's not just a bunch of GPUs sitting next to each other, right? You need storage, like longer term, stable storage, like SSD, you need networking interfaces, your power supplies, cooling systems, that sort of thing.

So there's a lot going on there. And this is an attempt to move in that direction, which is a play at that, uh, you know, AI training market and all that stuff in a, in a broader sense. There's been a lot of. Debate about the antitrust side of big acquisitions like this, especially in this space, you know, the Biden administration has been looking really, really closely. Um, other, other administrations have been as well in other countries, but specifically in the States.

Um, one of the things that AMD is doing here is the, yeah, they're going to acquire ZT systems, but they're going to offload. Their manufacturing arm, they're basically saying, Hey, the whole part of, um, uh, the, the whole part of ZT is like company that does manufacturing. We're getting rid of that.

And the reason there is partly just to kind of streamline things, but also it allows them to focus on the higher margin AI infrastructure and software part of the business and, you know, free up resources. It might also kind of make it harder to claim that there's some kind of, uh, sort of like, um, uh, kind of anti competitive thing going on here where you're taking over too much of the market, whatever.

AMD is still pretty small relative to NVIDIA, so I suspect that wouldn't be an issue, but it's got to be on everybody's minds, especially with acquisitions this big. So a lot of interesting things going on here. A lot of things to try to avoid the kind of regulatory call out of like, oh, this is like a market capture play. They've set up a deal. So it's a mix of stock and cash.

And that as well can be a bit of a signal to regulators just to say that, Hey, you know, this acquisition is not just about kind of asset consolidation. It's also about a partnership. So we're both going to take some risk here. It's, you know, equity is involved. Um, so anyway, that's, that's the big plan there. And, uh, it's going to be interesting. We'll see if it makes a difference.

As you said, the market is responding, you know, stocks up 3 percent for whatever the hell that's worth these days. Uh, people seem to think it's going to go somewhere. So somewhat unusual, by the way. Usually. post acquisition, at least quite often, you'll see the stock dip because most acquisitions turn out to, to be flops. Um, so, you know, we'll, we'll see this one seems structurally interesting for AMD.

Andrey

And now onto the lighting round. We begin with a story that Ars Technica content is now available in OpenAI services. This was published by Ars Technica. And this is because of OpenAI's partnership with Condé Nast, which is one of these brands that publishes content. So we've already talked many times about OpenAI making such deals with, uh, other publishers, the Atlantic, uh, Associated Press, Axel Springer, et cetera.

And this is a fairly detailed article from Ars Technica going into what this means. Uh, it means that for instance, you will be able to see content from Ars Technica. If you ask it, what is this last latest story on Ars Technica or what does Ars Technica say about OpenAI? It can now create a lot of content. And they also can, um, crawl the content of our second cup for training purposes, which is a big contentious issue.

And in fact, just a month ago, Conde Nast sent a cease and desist order to perplexity AI over data scraping. So this is kind of interesting timing of, we just saw the same company telling another company to, uh, you know, Go away. Don't look at our data. And now they have announced a partnership with OpenAI. Yeah, I'm, I'm so structurally

Jeremie

fascinated by, by this particular play, because one of the things that we run into a lot when we talk to folks on the Hill is a very understandable concern that if you're going to bring in AI regulation, you might end up. Essentially anointing winners and losers, right? You might have some, some regulatory capture going on.

Well, what do you make of a situation like this where essentially OpenAI is setting a norm of saying, hey, you got to be able to like put up multi, multi million dollars to sign these big deals individually with a whole bunch of publishers in order to be able to use their data for training.

In order to serve their data up when people ask a query to the model, like if that's not lining you up for some kind of crazy mode, I mean, forget then about, you know, little actors popping up and saying, Oh, well, you know, I, I'd love to have, you know, links from the Atlantic or whatever, the associated press featured.

On my website or through my app, like this is just an absolute knee capping of anybody wants to compete on that level with these kinds of products so that I find really interesting from regulatory capture standpoint. I mean, this is essentially a moat made of pure cash and opening. I certainly has it to spare. So kind of an interesting, uh, interesting thing. Yeah. Condé Nast, like latest thing to fall in line here.

We've got a whole bunch of these, um, publishing houses and, and, and, uh, newspaper outlets that, uh, that now are, are, are kind of on that side of things. It also kind of makes you wonder a little bit about the editorial stance of these firms when it comes to precisely this kind of question, right? There are open questions about the ethics of having, you know, one lab or another, having to pay or not pay for licenses to access content, blah, blah. And this is playing out in the press.

So, you know, to the extent that trying to shape the public narrative around this is important for companies like open AI, you might expect, you know, these sorts of deals to, to do a lot of the talking now for you, because you've got essentially capture such a large part of the media world. So I think it's just really interesting for the incentives that it tees up. And, um, hopefully we'll have a nice, robust public debate about what's going on. Where this all goes and whether it's good,

Andrey

right. And yeah, I think we've touched on this also, but it's interesting in a sense that OpenAI is doing this really, I would think impressive acrobatic move where, you know, a year ago we were like, Oh, it's free use. We just used all your data for training because it's free use and it's okay. Everyone should be able to use data to train AI models.

And now we're like, well, yes, it's for use, but also we're going to pay you to use your data too, you know, because what if it's not for use, I guess. Yeah.

Jeremie

Conspicuously absent, one will note listening to all this stuff, uh, is any reference to the same idea, the same principle applying to video. Gee, I wonder why that is. Gee, I wonder if it's at all related to the fact that YouTube is a giant cavernous pile of video data that OpenAI might love to scrape. And in fact, some have suggested almost certainly is scraping. Uh, you don't see OpenAI offering Google big bucks to use YouTube as far as I've heard at least. Uh, so it's sort of interesting.

It's like, okay, we're having this really interesting nuanced debate about, oh, you know, ethically you should have licenses to use Axel Springer's stuff, you know, the Atlantic stuff. But, but what about video, right?

So, um, I'm, I'm curious, are we going to see some kind of, you know, in the same way that, that, uh, Google pays 20 billion a year, whatever it is to Apple every year, just to be on the iPhone as the default search engine, you know, does open AI end up forking over giant wads of cash to Google to build a product that ultimately directly competes with Google's products in a, you know, in a way that isn't quite the same between Apple and Google on that iPhone bit.

So, um, I think this is gonna be really interesting. I agree with you. I think there's, let's say, room for paradox here and um, and opening eye certainly seems to be embracing it, so we'll, we'll see what, uh, comes of this.

Andrey

Next up. Any sphere at GitHub Co-pilot arrival has raised 60 million series A at 400 million valuation. So this is a startup that is developing the AI. Powered Coding Assistant Cursor, and as per the title, they raised a pretty substantial Series A at 400 million post money valuation from some big name VCs, A16Z, of course, Frive Capital, and the co founder and CEO of Stripe, among others. They have previously raised an 11 million seed round, uh, with some other famous investors.

And it kind of makes sense. I think that they're getting this investment, uh, GitHub copilot to this day is still one of the only examples of very profitable AI products. I think like this article says an estimated 3 million developers worldwide are paying Microsoft a hundred dollars a year. to use copilot. And so you can easily see how, uh, you might want to compete and, and why people might want to, uh, be funding a competitor.

Jeremie

Yeah, no, you, and you're absolutely right. And it really seems to be the open AI, broadly open AI affiliated suite of, uh, of products that has been doing best on the monetization end. Like, you know, you think about, we've looked at, um, you know, Posts that talk about how OpenAI is generating whatever it is, three billion or so of annualized revenue. And if you look at basically all the other revenue in the generative AI space, it is not that much more.

Um, so, you know, hard to make money in this space. If you're going to compete, you, you got to do something like perplexity. You got to take on something that, you know, works pretty well and have your, have your own twist on it. And that certainly is what we're What's happening here. At least we haven't seen, you know, that many new products that, you know, new classes, new categories of products in the space. I'm sure we will, by the way.

Um, but at least for now, the most profitable one seems to be, seem to be kind of twists on things that we've, we've seen around for quite some time. Um, notably that 11 million seed round, by the way, that they first raised did include, was led by the open AI startup fund. Um, and they say, interestingly, and I'm a bit confused about the language here. So, uh, led by open AI startup fund with participation from, uh, Nat Friedman, uh, drop of one of the Dropbox co founders, so on.

Yeah. Usually you don't need like an explicit lead for a seed round. So I'm not too sure what that's about, but, um, anyway, opening I startup fund, like. Is, is there and expect them to keep showing up because they are having, you know, they have a unique perspective on these up and coming startups. They can see sort of like stripe, by the way, Patrick Collins and the one of the co founders of stripe you mentioned is one of the folks leading this round or involved in this round.

Um, If you're opening eye, you are seeing the token consumption of these companies. You're basically getting an early sense of, okay, which companies are hot, which ones aren't based on usage. And based on that, you can just like Stripe, right? This is a payments company. They get to see how much money you're making. And that allows them to have a wicked effective investment arm.

Well, open AI, same thing, you know, to the extent that value is denominated in flops or in tokens, increasingly in the world of AGI, um, open AI startup funds position starts to look a little bit structurally like stripes. And that's a really interesting proposition. And I wouldn't be surprised if that's reflected in, you know, the, the sharpness of their, their early investment here. So, you know, we'll probably see more open AI startup fund success stories, uh, before long.

Andrey

And just one more thought on that topic, uh, it also, you know, 60 million. Pretty big series. A still even within the world of tech. It's pretty large. I will say we used to cover like a hundred, 200, 200 million, you know, rounds every week, every

Jeremie

other week.

Andrey

So, uh, I think there's been a lot of talk and we haven't really dived into it, but there's been kind of in the air, a lot of thinking and talking about, are we in a sort of winter is the money drying up? Are people now starting to want, you know, actual. Profits and revenue. And it does seem to maybe be the case, but if you're in a profitable sector that isn't as, I guess, long term a bet, I do think there is still space for VCs to be excited like with this company.

Next up, Stability AI appoints new Chief Technology Officer. So the new CTO here is Hanno Bass. This person has 30 years of experience as a CTO at companies like Digital Domain, which is a visual effects and digital production company. And this is, of course, following a tumultuous year for Stability AI, as we've covered. There was famously the, um, I don't know if you say oustment, or the, uh, let's say the CEO of Stability AI left at some point. The showing of the door.

Yeah, you know, anyway, the leadership, a lot of leadership, a lot of technical leadership, as well as the CEO left. Yeah. Yeah. Seemingly because the company was a bit chaotic, didn't have a clear business plan. And so this is notable from a perspective of stability. I still probably trying to turn the ship around, trying to become more of a business that makes money and doesn't just publish open source models here. They have a very experienced, very, uh, kind of mature.

seemingly business person at the helm. So I think that plays into that entire kind of journey they're on.

Jeremie

Yeah. I mean, I think, you know, without wanting to, uh, try to mind read board machinations at stability, like my guess is at this point, You're sitting on this company that's raised a just gigantic ton of capital, and they've burned through an awful lot of it. But at this point, if you're a board member, if you're an investor, you're just kind of thinking like, okay, how do we save this ball of capital? Like, you know, it's so much sunk cost.

Um, and, uh, the reaction sometimes often can be, oh, you know, you need a, an industry veteran, you know, a stable, a stable mind. It's, you know, it's complicated. It's, it's, it's hard to know. Whether this ends up being a good play, but it often is it can be a mistake. It can be a mistake because you end up with a overly conservative, um, player or somebody who doesn't bring the kind of fresh perspective you might need can also be a good play.

In this case, he has previously I think you might have mentioned Microsoft Azure Media and Entertainment. He was the CTO of that. Um, he's done a whole bunch of stuff regarding Microsoft Azure's cloud technology. So that's, you know, a lot of infrastructure work. All right. And when you think about the sorts of things stability really needs, right, they need to be able to run efficiently at scale. They need to drive down costs in a big way because they need to bring in profits.

Like stability is not going to raise another round unless they can show, you know, that they have a strategic play and some profits. Uh, and, you know, to your point earlier, I think they're one of those companies where. You know, the, the question, where's the beef is going to, is going to start showing up awfully fast, uh, given how much they raised, how fast they, they rose and then how quickly they've, they seem to have fallen.

So hopefully he can turn the ship around and, uh, we'll have more pseudo open source releases from stability AI in the future.

Andrey

We'll see. Yeah. He was also the, uh, CEO. CEO of Weta Digital. And at first my reaction was like, Oh, this doesn't seem like a very AI oriented person. But when, again, if you're working in VFX and, uh, special effects, a lot of overlap. I was surprised. Yeah, exactly. And the last story for this section, one of my favorite topics, of course, it's robo taxis. And the story is that Cruze's robo taxis are coming to the Uber app in 2025.

So Cruze has announced a multi year partnership with Uber, and the claim is that once Cruze does relaunch, uh, the driverless service, They will be able to provide those through Uber, as opposed to our own custom app. Just to recap, Cruise was actually serving customers in SF, just like Waymo was. You could hail a robo taxi. Then there was a major crash. There was a problem of communications with regulators and Uber, uh, uh, sorry, Cruze has been in trouble ever since.

So this seems to be probably pretty good news for Cruze to have that kind of platform to, uh, compete with Waymo against. And certainly between Tesla and Waymo and Cruze, next year will be a pretty big one for Robotaxis.

Jeremie

Yeah. And it's also, this is a real call back to, um, 20, I think late 2020, um, Uber then sold. I'm old enough to remember Uber ATG, right, Andre? You might remember them. Uh, the sort of, you're right. Self driving car unit of Uber. This whole idea of self driving cars, right? being linked to something, some service like Uber, it was often considered part of the strategic calculus, right? The idea was Uber was going to be the pioneer in self driving cars. They'd get to own the whole stack.

They would own the vehicles, the physical vehicles, and then they would also own the, um, the sort of coordinating marketplace infrastructure around it. Now, the fact that they're having to basically, you know, have Waymo, uh, be the first self driving cars on their, uh, on their service is, um, you know, it's. It's been four years, but I'm sure it still stings. This again was a very strategic play. And I remember talking about the prospect of Uber or Uber's prospects long term.

And you know, it really did seem like the hard thing was going to be getting the um, the self driving car, um, Uh, capabilities online. And once you do that, you can just be so profitable, uh, so much of, you know, the, um, the, the cost of a ride is the time of the driver. And so, um, anyway, uh, you know, expect the, the economics around this stuff to shift quite quickly as you see these rollouts in more and more cities.

Um, but, uh, anyway, really interesting to see, uh, and, and Uber hopefully will be able to, uh, to make do with what they have here.

Andrey

And onto projects and open source, we start with AI21 introducing the Jamba model family. So these are two new models, Jamba 1. 5 mini and Jamba 1. 5 large. Jamba is this interesting class of models because they are a hybrid of Transformer and Mamba. So they use the established. model that is used in ChargePT, Cloud, et cetera, transformer that is really good, very performant, but has issues scaling.

It costs more and more and computationally takes more and more as the input and output grow bigger. Mamba is one of these things we've been covering for probably a year now as an alternative to Transformers that is more of a recurrent architecture. So you can think of it as like it can scale infinitely the longer the input, nothing really changes. And what you've seen for a while now is that it seems like combining the two together yields the best results.

And so here they release these two models of the Jamba 1. 5 large is a mixture of experts model. With a total of 398 total parameters and the claim from, uh, AI 21 labs is that they, these models are outperforming models of similar sizes. Like Llama Free, 8 billion and 70 billion, right? Which would be a pretty big deal. So far, uh, Mambas have been pretty promising, but the kind of challenge has been that they haven't been proven to really work at scale so much.

And so there's been demonstrations of Mamba type models working at smaller scales. Uh, I, I'm pretty sure this, uh, Jumbo 1. 5 is the biggest one, biggest hybrid model, uh, that's been released and shown to be very performant. So yeah, pretty significant steps here and it does say that especially on prompt links of 10, 000 tokens and above, you get better performance, which is kind of what you would expect.

Jeremie

Yeah. I mean, this is the most scaled Mamba architecture, I remember seeing like architecture involving Mamba in almost any way. Um, and I think that's just, it's really cool. They made this choice. This has historically been one of the big questions, right? How does Mamba scale whether on its own or in a hybrid architecture with transformer? Like how does it scale? Uh, can you really make it work?

And, you know, just for, I was thinking about this, you know, because we've explained Mamba in many different ways. And I just, there was like one take on it that I thought might be useful. Um, just to kind of give people an intuition for why, why it could potentially be interesting and then why it might combine so well with transformers and really quickly give it a shot.

Like if you have a transformer, basically what you're able to do is look at how all the different words in your input are connected to each other, right? So if you feed, you know, the model, I don't know, 128, 000 tokens of context, um, that's all can fit in its context. It will be able to fit more, but within that context, theoretically, it can look at.

Connections between all those different words and so really kind of, um, anyway, account for very complex relationships in that text, very complex and rich information. Um, but you've got a maximum amount of data you can capture a maximum context window size. The alternative with Mamba, one way to think about it, this is not exactly accurate, but just to build that intuition, um, it's almost like.

As the model say reads more from a piece of text, it's got a little scratch pad and that scratch pad has a finite length, you know, maybe it's like 1000 words and it goes through and it keeps updating on that scratch pad. It's writing a summary of all the stuff it's read so far, and you can imagine it basically going through and just like updating a word here, a word there as it keeps reading a longer and longer document.

And so no matter how long that document is, you can just keep tweaking what's on that scratch pad to make it reflect what you've previously read. Right? So this has the advantage of meaning that you can read an arbitrarily long document and you can keep just tweaking that scratchpad summary, that one page summary that you've got and then use that to inform. Okay, what's my next word prediction? What's my output going to be now? So those two things are kind of in contrast to each other, right?

The mamba thing has a maximum memory capacity. Um, that's That's kind of going to where essentially, it's got that scratch pad. It's not going to be able to account for all the kind of complexities of what it's read, right? It's going to forget a lot of that detail and only preserve that high level summary, but it works on arbitrarily long inputs. Whereas the transformer actually is going to account for all that little kind of detailed interaction between all those different things.

tokens in the context, but it can't go beyond that context. And so that's really where you get into the combination of these two being very effective. Um, this is a, it's a really interesting paper. Uh, they, they demonstrate one of the cool things they demonstrate is unlike other models that often claim to have. Really long context windows, in this case, like 256, 000 tokens for the longest context window they have in java 1. 5.

Um, often what you find is it's like, sure, your model has a long context window, but it can't actually use the information reliably within that context window.

It'll forget stuff and the needle in a haystack test that we talked about before on the podcast is an often used metric of like how often do you forget sort of little facts buried inside a giant context window and they basically show that like unlike other models with long context windows, there's actually can recall facts that are buried all across the context window and so they use this ruler benchmark thing, which is basically a needle in a haystack.

Eval on, on steroids, uh, to show that it actually can do that. Um, another thing they do, and just the last thing I'll mention is they actually develop for the smaller model. They use a, uh, a special distinct quantization method. So quantization is when you take these models, you normally, so model will have a whole crap ton of weights.

You know these parameters contained within it that tell you how to mix the data that you're feeding to it So each of those parameters you're gonna have to just kind of give its value to a certain precision a certain number of decimal points and One quick way of making your model smaller is to quantize it by reducing the resolution the precision of the data Of those representations by storing these digits using, for example, in this case, um, eight bit precision format.

And what they do here is they only compress the, um, the experts in the M. O. E. So, so the kind of submodels that queries will get routed to, it's only in those submodels that they're actually going to do that compression going to reduce, uh, the, uh, the precision of the, of the representation. So, uh, that was kind of interesting.

They call it experts into eight instead of just The standard int eight or integer eight, which is the, the eight bit, um, precision format, uh, way of, of quantizing things that people usually go with or often go with, uh, so kind of interesting only applying it to the experts in the M O E model that does account for 80 percent 85 percent of the model weights for typical M O E. And so, um, yeah, it works well. They can fit even Jamba 1. 5 large will fit on a single, uh, eight GPU node.

So basically it'll fit on eight GPUs together. And, um, and that's a, a pretty, pretty powerful thing. It does make it a lot more democratized, a lot more accessible because this is a big model.

Andrey

Yeah. It's, it's kind of interesting. I'm looking at this rule of paper. It came out, uh, just earlier this month, 6th of August, 2024. And the title is ruler. What's the real context size of your long context language models? So, there you go, it's, yeah, they say, you know, there's the claimed context size and for instance, with something like GPT 4, you might say they can handle 128, 000 tokens, but that's really, like, the upper, uh, range of what the model can handle architecturally, right?

versus how much can you actually use effectively is different. And so this benchmark, uh, can evaluate that better than anything we've seen before. And it's pretty fun. They do say that here, you know, not only do we claim 256, 000 contacts and if we can do it, as opposed to something like Gemini 1. 5. Pro, which claims, uh, like 2 million contacts. And, uh, according to them, it's only 128 K, although might be a bit bigger. It's, it's hard to say from a paper.

So exciting, exciting developments for sure. This is out, uh, with the Jamba open model license agreement, very similar to other kinds of agreements we've seen where it says you must follow the, uh, acceptable use policy. You can't use a trademark. You must include a reference to this. If you train a model using this, using data from this, you need to, uh, have Jamba in front of it, things like that, that are kind of fun little features of licenses for AI models. It's very, very Lama y. Exactly.

Very Lama y. It seems like you can use this for both research and commercial purposes. So, you know, effectively what we consider open source, uh, these days for models. Next, we have PHY 3. 5 from Microsoft. So Microsoft has been on a tear with these PHY models. We've now gone up to 3. 5, and this one is available in 3. 8 billion, 4. 15 billion, and 41. 9 billion parameter versions. Pretty unique there from Microsoft.

To go with these sizes, uh, and, uh, as with previous, uh, releases of five for Microsoft, this is very much meant to be a smaller model. You can run locally on your own GPU, on your own computer, even, and it is pretty performant for this, uh, class of models. So they say, if you compare it to Valver, you know, 3 billion, 4 billion parameter models, this will get you the best improvement, uh, the best performance for that size. And you can also run it, of course, in a mixture of experts model.

There's also a vision model I can understand images in addition to that mixture of experts model. So interesting to see also a bit of a trend, more and more vision language models becoming available for people to use. Yeah. And in two kind of

Jeremie

things that are fairly distinct in this instance, at least. So Microsoft, the, the five series of models, one of the big, big differentiators that they, they have is data quality, right? Like they basically are, are overtraining these models, especially the small ones. That's, that's often done, um, by overtraining.

I mean that theoretically, Yeah. Uh, if they took the same amount of compute that they're using to train these models, they actually, um, they could get better performance if they increase the size of the model. Like this, the model size is actually a constraint and they're intentionally clamping the model size specifically to, Make sure they end up with a small model that can perform really well.

So in principle, like if you wanted to follow the scaling laws, you could create the world's best small model conceptually, very straightforwardly, just pour way more compute and data at a, you know, fairly vanilla architecture. You do quite well. Um, here Microsoft is opting probably to do some of that also to do a lot of careful data selection. That's been a big theme in their five model papers.

Um, and then another really interesting thing here is when you think about, We're looking at with, with 53. 5 mini, right? This 3. 8 billion parameter, really tiny model, 128, 000 tokens of context window. That is way, way bigger than what you'll see with any other three point or 3. 8 billion or 4, 4 billion parameter model. That is a really, really interesting option. It's helpful for edge device deployment, right? You only need room to store 3. 8 billion parameters.

So you can actually start thinking about putting those on edge devices and now your edge device. Can function with a 128 K token context window, which is quite impressive and interesting. So, um, you know, a lot of, a lot of cool stuff coming out the five series here. They're also announcing, um, uh, a, um, a special mixture of experts model five 3. 5 M O E. This is a, it's, as you said, it's a bigger model for you have 42 billion parameters, uh, over 20 languages that it supports.

And, um, uh, and anyway, they, they go into a little bit of the detail as to how they actually do the. Um, the fine tuning and kind of dial the behavior in, uh, but they do use DPO if you're, you know, if you're a reinforcement learning from human feedback nerd. Um, so, uh, so DPO is there. That's a trend that we're seeing obviously more and more of. And, um, there's a whole paper anyway on the safety fine tuning and they have this break fix cycle that they implement internally.

Bottom line is really interesting series of models. Um, you know, the, the five series is, is one of the ones that I'm watching most closely, especially for the kind of model miniaturization side. I think this is really, really

Speaker 4

cool. The name actually takes up most of the context window. Yeah.

Jeremie

I mean, I, I found this kind of interesting. Um, so, so they're showing, it's sort of unusual that you'll see this. Like they basically show that if you start with a bigger model. And then you, you prune it. So basically get rid of, let's say, unnecessary weights, unnecessary neurons. Um, so you can go in this case, they go from 15 billion parameter initial model to an 8 billion parameter model. And then they're going to take that 8 billion parameter model.

And they're going to use its outputs to train a separate 4 billion parameter model, um, to basically just like, Replicate what the 8 billion parameter model could do. And they repeat that process to get a smaller and smaller model. And what they find, at least with their 4 billion parameter model, um, they actually get a model that performs better than if they had trained that 4 billion parameter model from scratch, which is kind of interesting.

So they have this 4 billion parameter model that, uh, it turns out is competing on par with like Mistral 7b, uh, 8b. So, you know, pretty much everything. Pretty, pretty impressive. It's a 16 percent improvement on MMLU to the extent that you care about that deeply flawed benchmark. Um, it's, uh, it's an interesting, it's an interesting improvement. So yeah, another, another big step in the direction of like smaller models and processes that allow you to get there with fewer resources.

And, you know, anyway, I just think it's, it's a really cool space to track, uh, Um, between this and the, the five models, people who are interested in, in model shrinkage, uh, have a lot to, uh, have a lot to track this week.

Andrey

Yeah. Yeah. That's a good point. Uh, this is all building on this paper, very least a few months ago, compact language models, the pruning and knowledge distillation. So pruning is a bit where you just get rid of weights. You don't need like pruning a tree. Knowledge distillation is that other bit where you have a large model. And you get it to spit out some data and basically train the small model to match the big model, as opposed to just removing bits of it.

So they did that technique to have their first Nemetron family of LLMs, going from 15 billion to 8 billion to 4 billion. And this story is how they basically reapplied that same technique to LLAMA 3. 1, 8 billion. And one last story for this section. It is that open source Dracarys models ignite generative AI fired coding. No copyright

Jeremie

problems

Andrey

there. That's some, uh, yeah, some fun naming from VentureBeat on the article title. Dracarys is coming from Abacus AI, who have previously had other, uh, dragon themed, uh, LLMs, I think that you've covered. So here we have the Dracarys one that's optimized for coding. They took it to improve the coding abilities of other open source LLMs. They showcased this on both Quent 2. 72b and Llama 3. 1. 70b. and release these models that are then optimized versions of this for coding.

So I think if both of these stories are all free of a previous stories, you're seeing a pattern of basically what happens in open source or sort of actually the last two Nvidia and this one both took Lama and then improved it in different ways. Uh, in the previous one, it was via distillation, making it smaller in this one. They have their own little recipe of.

You know, getting some data, uh, doesn't, I don't think they've actually published a paper on it, but the point is that they are able to take models and improve them for certain applications like coding. So there you go. If you want to have a very good model for coding, there is now one open source for your use. And

Jeremie

coming up next, Jeremy's Weekend. Uh, that's actually what this was. So, this is a paper that came out by Epic AI, um, and uh, this is just a research company that does some amazing research on AI capabilities and trends forecasting. And, uh, And they came out with a fricking juggernaut of a paper, uh, that, um, I spent most of my Saturday reading and most of my Sunday rereading. Um, so a couple, a couple of things, basically the title of it is can AI scaling continue through 2030, right?

So we're seeing right now these trends towards scaling AI models. They're getting bigger and bigger and bigger. Um, you know, you're, you're seeing in particular the amount of compute that's being thrown at training the leading AI models increase four fold every single year for context that is faster than any other technological kind of AI. Peak growth rate that we've seen in the last many decades.

You think about mobile phone adoption, that was doubling every year of solar energy capacity, installation 1. 5 X per year, human genome sequencing, a notoriously fast acceleration at 3. 3 X a year. That is all less than significantly less than four X per year, which we're seeing with compute budgets. So things are getting insane really fast. And the obvious question then is, How fast or how long rather can this be maintained and essentially do you crap out, right?

Do you, do you, um, run out of resources to be able to continue scaling these systems before you get to very, very powerful AI systems? So I'll give you the headline here. And the headline is that their expectation, Epic AI's assessment. As you can probably continue the current scaling trajectory, that 4x year over year, all the way through to 2030 or so, um, by which time we will have systems that are 10, 000 times more scaled than GPT 4.

That is the same gap as between GPT Two, which most of you will never have heard of or interacted with. And GPT 4, which everybody has heard of and is powering autonomous agents and is covered on most episodes of this podcast in some form.

So this has led a lot of people to speculate, and I think quite reasonably so, um, just based on at least what I'm hearing from inside the labs, uh, their experiments and, and, and what we can see from, you know, like, AI scientist and other successful experiments like that, like it's plausible. You could automate full on AI research, uh, with, with that kind of leap. It's also plausible that, that scaling won't work, right?

That you'll dump all this compute in and not get the kind of value that you need out. But, but that's, that's, I think quite plausible. The big question is, right, what is going to break when you start to scale this hard this fast, what breaks and they identify four different potential things that could break. One is. Power, right? It takes a huge amount of power to, well, power these gigantic training runs.

And their expectation is you're going to need one to five gigawatts of power to train the like 2030 level training runs that they expect to happen. For context, currently, you know, the largest clusters you'll see Are usually in the like, oh, just over a hundred megawatts of scale. So essentially you're looking at like maybe 2030 X beyond where we're at today for the kind of mainline expectation of what 2030 will bring in terms of power. Now power is not compute. Right.

It's not, you're going to be, we're going to be spending 20 to 30 X the amount of power to train these models in 2030 they expect, but the actual amount of compute is going to be 10, 000. So why that gap? Because you would naively think like, wait, isn't, you know, one extra computation, one extra increment of power, shouldn't they scale together? And the expectation here is that These models will get in the hardware in particular, we'll get a lot more power efficient, more energy efficient.

And there are a whole bunch of other reasons why you might expect that stuff to scale weirdly, but it's all laid out in the report. Um, so you got energy or power, then you've got chips, you know, will we run out of GPU production through 2030? And this is really interesting. They talk about a lot of the stuff we talk about on the podcast. What are the key bottlenecks? to chip production to GPU production. Well, guess what? It's not the actual logic.

It's not the thing that does the number crunching on the chip. It's actually the packaging, right? So the ability to take all the chips that you need and kind of glue them together to make a GPU, kind of set them together on the same, uh, on the same device. And that's co op. That's this new packaging technique. That's really rate limiting on this and it's high bandwidth memory.

So anyway, bottom line is Chips and, and power turn out to be the, the kind of constraints that start kicking in first. It's not super clear which one kicks in first. I actually, before reading this report, I actually thought power was going to kick in sooner and, uh, and significantly earlier than chips. Turns out there's a lot more uncertainty there. They kind of seem roughly tied. And then there's data. Well, we run out of training data and there are a whole bunch of open questions here.

A lot of uncertainty there. Like, you know, you have a whole bunch of multimodal data, like video data that. could be used. Synthetic data is a big X factor that could increase data availability by a lot. Bottom line is the data wall gets hit presume or plausibly somewhat later than 2030. And then the last bit is this thing called latency. So latency is really interesting. It's a kind of irreducible Constraint on your ability to scale.

So think about the amount of time that it takes for for the input data that you feature model to go all the way through to the end of your model and generate an output, right? That's going to take a certain amount of time and the larger you make your model, the longer it's going to take for that information to propagate through the model and get processed, right?

Well, if you're thinking of doing a training run, your training run is going to have to involve a whole bunch, no matter how you slice it, a whole bunch of these kind of, um, these runs where you, you do forward and backward passes to train your model. You're gonna have to feed your data and have it run through a whole bunch of times.

And If you assume that, okay, these training runs can only last like a year or so, which is in practice, the reality, because within a year, there's like a whole new generation of compute that's going to come out. The whole infrastructure you're using is going to get outdated. So you need your model to come off the production line pretty quick. So you have a year. It takes a minimum amount of time for your data to propagate through your model.

And that does actually end up placing a hard boundary on the amount of compute that you can practically pour into a model that Based on its size, there's a whole bunch of detail in this paper. If you're a kind of AI scaling hardware nerd, especially if you're interested in national security picture, like me, uh, take a look at this paper. This is just a wonderful, wonderful document. Um, and if you're not, hopefully you appreciate the highlights.

Andrey

Right. And it's hard to overstate how kind of useful this is in a sense, because scaling as we have covered is one of the major, like maybe the important trend in AI and has been pretty much for a decade, but especially since 2020 when GPT 3 hit, it became very apparent that scaling was our thing.

If you go from GPT 3 to GPT 4 with 10x the parameters, You know, there are other things that matter like alignment, like RLHF, but getting more data and getting bigger models is the main factor, seemingly, to progress, or at least one of the requirements. And this is a real question, right, of like, how much can we scale just due to physical constraints of the power required, the compute required, the data transfer required, these kinds of things.

And this is pretty much doing that math and showing that it is insane, like the requirements needed to do two orders of magnitude above GPT 4. And that's, you know, to be expected. GPT 4 already was far beyond what was even imaginable in training, you know, in like a few years ago. GPT 3 was already unimaginably big when it came out in 2020.

So this is saying that it is, um, plausible that it would be possible, assuming hundreds of billions of dollars, if not trillions of dollars of investment, right, to get there. And that's, um, I think also this, this is covering, um, also that data question in a pretty nuanced way where they show uncertainty margins for each factor. saying, you know, this is how certain you can be about where your limit is in terms of how far you can train.

And so on the more, um, you know, uh, physics side of things of power and compute, those are fairly low uncertainty margins versus with data. Those are quite high because you don't know, maybe synthetic data will be very useful. Maybe multimodal data will be very useful. But it's hard to say. So this projection of 2030 is on the lower end, broadly speaking, of the uncertainty.

So if it's somewhat conservative in terms of, uh, saying we're not going to be optimistic about data and synthetic data, not going to be optimistic about chip production necessarily, things like that. So, uh, Yeah, very good kind of conclusion of a useful conclusion, not really touching on the question of do the scaling laws we see empirically so far, are they going to continue as we scale up another two orders of magnitude?

I think the current hypothesis implicitly for most people is yes, we'll just see the same trend of improving performance as you continue to scale. It's not going to plateau as some things plateau, uh, but. You know, that remains to be seen.

Jeremie

Yeah. And what does it mean? Right. For, for, so, so for contexts like the, the scaling laws, right. They tell us roughly speaking, as I increase the amount of compute and data that I feed to my model, you know, how does its next word prediction accuracy.

Improve right and what you learn is that there's a power law there and that as you increase anyway those things you do get a reliable predictable increase in next word prediction accuracy empirically at least you have there's a separate question as to what that next word prediction accuracy actually buys you in terms of capabilities and this is really where so much of the debate is focused most people as you said andre like buy into yep you

know probably these scaling laws are going to hold Um, they're actually, you know, they're, they're pretty, they're almost as well established as, as many physical laws were when they were first called physical laws. You can think here about things like, you know, empirical laws, like the ideal gas law, things like this. You know, we've seen it apply across like 10 orders of magnitude now. That's a pretty well established trend.

It doesn't mean that it will hold indefinitely, but a lot of people, including companies like Microsoft, we've talked about this, right? I mean, the, the, the amounts, the scales of these.

Build outs, both in terms of energy and chip usage seem absurd, but so too is the investment that is happening today by companies like Microsoft, you know, talked about how Microsoft is engaged in what may be the single largest infrastructure build out in the history of the human civilization, 50 billion per year on these data centers on the thesis that it's going to lead to something like AGI. That's kind of where that's coming from. Now, this is like in four years, that's 200 billion.

That's the Apollo moon landing, right? For, for scale. So this is a really, really big, big play. Obviously the, the Stargate, um, cluster that Microsoft and open AI are co designing and going to be building. This is a 100 billion compute cluster set to come online in 2028. That gets a lot, uh, featured quite prominently in this report as an instance of saying, Hey, you know what, these companies internally seem to believe that That this is a live proposition.

Um, and so, uh, so it's just really interesting and you're seeing anyway, all the fascinating machinations of how companies are acquiring more power to fuel these insane training runs Amazon, you know, having this, this almost one gigawatt nuclear. Power contract in Pennsylvania that they've locked in on, um, the, the Microsoft OpenAI campus apparently is a five gigawatt, um, uh, cluster that they're, they're teeing up.

So, you know, we're, we're getting up there already in terms of what's being planned for, and that's not even 2030. Like we're talking about 2028 there. So, you know, Things are, are, are crazy, at least in terms of the investment.

We'll see if the payoff is there, but, but if it is, you got to figure there's a reason Microsoft is investing, you know, whatever it is like, uh, you know, six Airbnbs every four years into building data centers, you know, they think that there's going to be something there for them.

Andrey

Right. And, uh, I guess I'll have to start investing in power supply. I mean, just, just a national infrastructure question of this is already super interesting. Right. And then you get into policy and it's, it's kind of a lot of crazy things to think about, but at least this is answering one of the questions of just in terms of the physics, is it doable? And the answer is so far, it seems to be doable. Next paper is Agent Q, Advanced Reasoning and Learning for Autonomous AI Agent.

So we're talking here about, uh, going from just an LLM to an agentic Kind of LLM. We've covered this before. LLMs are passive models in the sense you give them an output, they spit out, uh, you give them an input, they spit out an output and that's it. Agents are able to take in a request and then go off and independently do some work, some series of steps to hopefully accomplish whatever you need them to accomplish without you being aware or supervising them the whole way.

And this is a new, uh, work of research from, uh, the AGI company multi on with some collaboration with Stanford that proposes one possible agent architecture that averages LLMs. In particular, they have a framework that combines guided Monte Carlo tree search. Uh, so guiding, combining basically search where search is you think one step ahead, you think two steps ahead, you think about different possible branches, and eventually it's, it's basically planning ahead. A search, right?

And then combining that with self critique and iterative fine tuning on agent interactions. So, uh, you can essentially learn as you go. You do something, you try it, you give yourself a reward. Did I succeed? Did I not succeed? And then you do direct reference optimization. Basically what we already do with alignment saying, okay, I did the right thing, or I didn't do the right thing. Let me improve my own capacity to do things properly.

So they combine all of these things into a framework that, uh, You can, uh, use with an LLM like LLAMA or like chat GPT and then apply it to one of these, uh, kind of demo cases for agents in particular here they have a simulated e commerce platform where they want the agent to go and, uh, book you something or buy some, uh, product thing related to searching and acquiring something.

So numerically we say in a real world booking scenario their methodology boosts llama free 70 billion, uh, zero shot from 18. 6 success rate to 81. 1. point seven success rate. So they take the base model, they plug that model into this framework, and it can then, uh, perform much, much better. And this is after a single day of data collection, and it can even improve with some online search.

So one of quite a few of these, uh, you know, there's been quite a bit of work on the idea of an architecture of a framework such that you can plug in an LLM and get out an agent. And this is, uh, Yeah, seemingly a pretty, pretty impressive, uh, idea and combining a few important aspect of agents in terms of planning and in terms of, uh, continual learning.

Jeremie

Yeah, absolutely. I think one of the big conceptual leaps here, right, is it has us go from thinking of training as something that applies to Language models. So you train your language model, do next word prediction. And then you have this, this model that has, well, a good world model. It understands a lot of stuff. And then you kind of, you kind of get it to, you know, break down a problem, let's say into sub steps and then execute those sub steps.

And it just happens to be good at reasonably good. That is sometimes at executing on those steps because of its it. Pre existing world knowledge that it gained through that next word prediction training, but there's obviously no reason that next word prediction training should make for a particularly good agent. It's just kind of happenstance. It's coincidence, right? So what they're doing here is saying, no, no, no, we're going to architect an agent. And then we're actually going to train.

We're going to update the models weights based on how it performs at these Agenty tasks. And so that kind of starts to get you more in the direction of like, okay, we're fine tuning for agent like behavior. And this is something that, you know, when you think about GPD five and, and what's going to make it different, like agent first architectures are now gonna be a thing, and this, this is a, a bit of that kind of flavor, um, and a really, really big capability leap, right?

You, you mentioned, you know, under 20% to over 80%, um, thanks to this framework, thanks to this, the strategy. I think there's a lot of stuff like that out there. I don't think even if you froze progress at the frontier, I think you would find significant capability leaps, especially on agentic tasks that are already baked in just by virtue of the fact that we have out in the open source today.

Um, the stuff that we have out in the open source today, uh, just by virtue of the fact that, you know, people are going to find cheap ways to fine tune these models to do extraordinary things. One of the, the things I found really interesting about this paper, they make the argument that really they're drawing from.

Richard Sutton's famous essay, The Bitter Lesson, which is basically this idea that if you look at the long arc of machine learning history, which I guess is actually a short arc, the techniques that tend to work really well are the ones that just are really good at leveraging highly scaled computation. So they're not customized. They're not the techniques that like human beings sit around a table thinking really hard at and come up with a brilliant idea.

No, the techniques that work best, the claim is, are these general purpose techniques that are just really good at funneling gigantic quantities of compute to solve a certain task. And the thing you're betting on when you do that is that compute is just going to get cheaper and cheaper and cheaper. So eventually, no matter how clever you get on the front end, you're going to be defeated by some algorithm that just is good at leveraging huge amounts of compute.

And that's really what Deep learning models are said to be, um, so in this case, that's kind of the whole point, right? They're setting things up such that the more, the cheaper compute gets, the more you're able to train this agent, the more you're able to refine its behavior for agent like behavior specifically. And it's all based on the same kind of approach or an approach that's philosophically pretty similar to the kind of alpha go type stuff.

They've got the, the sort of Monte Carlo tree search type approach where. You know, basically you confront the agent with a website, you give the HTML code of the website, it reads it, and then it proposes a whole bunch of next actions that it could take. And then you got a process where it reflects and figures out, okay, like which of those, which of those through lines am I going to pursue? So it tries to pursue that, and then it sees where it ends up.

And then it's able to evaluate, okay, in retrospect, what was my decision making process like? Was that a good decision making process? And you can kind of repeat this at every node, working your way down the tree, and ultimately, You end up with, well, this tree, which is exactly what you see with the kind of alpha go type, uh, type model. So, um, you know, this is not a coincidence. You're going to see more and more of this.

Every time I talk to friends at the frontier labs, when they're looking at, you know, what's, what's the next beat. It really is this sort of reasoning Monte Carlo tree search is a big, big, uh, facet of that. And, um, yeah, not, not in some ways, not surprising to see this sort of thing, but uh, impressive as hell to see it working so well, so soon.

Andrey

Yeah, and it really is a trend of this year, right? It's one of these things where we're seeing in practice, text to video, um, you know, uh, music generation, things like that, like behind the scenes, uh, in published papers or in conversations, uh, between academics and so on. Agentification, making LLMs into agents is. one of the big trends.

And it, I think in large part, that's because it seems that most people have a sense that with the current models, with GPT 4, LLAMA 3. 1, the LLAMs are strong enough. And what is missing is that kind of agent architecture that allows it to be recurrent, ongoing, have memory, have exploration and online learning or learning from experiences, which is one of the weak points of LLMs. And yeah, personally, I am on that train. I think this kind of thing is probably all that's needed.

And people are very much working on figuring out the right set of tricks to make this usable and reliable. Out of the lightning round, the first paper is Transformers to SSMs, Distilling Quadratic Knowledge to Subquadratic Models. And the gist is that quadratic models are transformers, subquadratic models are SSM, state space models, aka Mamba, and similar models. And they show how you can take a transformer and then convert it to a state space, uh, model like Mamba.

So you can take, uh, 5. 5, for instance, and create FireMamba, that is smaller, and also create a hybrid version. And so you can essentially get a Mamba type model for free. One of the challenges with recurrent architectures, including Mamba, including SSMs in general, is it's harder to train because of the issues with parallelization that you get with recurrence, as opposed to just the quadratic, like I look at everything at every time step you have in Transformers.

So you get a lot of benefit if you can, Kind of spend a lot of money training something and then distill it into an SSM as they do here. Uh, and, uh, yeah, it seems as you might expect the results point to it working really well and this being a pretty promising approach.

Jeremie

Yeah, I thought this was just such a, an interesting paper. And the way conceptually it works is, is they'll, they'll identify like, uh, imagine that you, you start off with some sort of transformer model, right? And you want to turn it into a Mamba model. Well, you'll take a look at the, so that model is going to be made of a bunch of transformer blocks as they're called, right? So you subdivide your transformer into a bunch of repeating transformer blocks. And in those blocks.

There is a, a certain matrix transformation, the self attention matrix, um, that does well all the kind of self attention work, essentially attending to all the other tokens in the input sequence to determine what the next token should be. And what you're going to do is you're basically going to say, okay, let's find out. In Mamba, the equivalent roughly to the self attention matrix is the SSM matrix, right?

So, um, what you're going to do is you're going to say, okay, let me just look at just this one self attention matrix and this one SSM matrix. So the self attention matrix and the transformer that I've got, and then the SSM matrix in the model I want to train. I'm basically just going to take that transformer that I, that I start with, and I'm going to replace the self attention matrix with an SSM matrix. That's not yet trained.

And I'm going to now kind of train my the rest of the models identical. So you've just got this one little bit, the SSM bit, and you're going to try to train the model such that that SSM matrix generates outputs that match what the original self attention matrix in the transformer did. So just that one little bit of model, you're going to try to train the model overall to kind of hold the rest of the models identical.

Um, so you've got this very controlled retraining of just a small part of the model. And that's key because the self attention matrix. Is the part of the transformer that is quadratic in, um, in compute requirements. And so you're, by replacing with that SSM, that's kind of the meat of the operation. Once you can get the SSM to output the same stuff as just that self attention matrix, that's pretty cheap. Now what you do is you say, okay, now let's continue the training at the block level.

So now let's get the blocks overall to output the same, the same stuff with more fine tuning. And then you do the same thing at the model level. So you're kind of progressively zooming out more and more just kind of, um, I don't know what that game is. Jenga or whatever you pull out of, uh, you know, piece of wood and you try to make sure that the structure is still stable. It's the same idea here where you kind of make your initial core modification, just that, that crucial SSN matrix.

in for the self attention matrix, and then the rest is just like, okay, now let's fine tune it. So it behaves the same. Let's zoom out, fine tune the blocks. So they behave the same and then fine tune the whole model. And um, this is, uh, it's really, I think conceptually it sounds simple, like, uh, or at least it's, it's fine.

It's one of these ideas again, I keep saying this, but you know, back in my physics days, like it was ideas that seemed the simplest where you're like, Oh man, that's a really good idea. And the tend to work out the best. So they get a whole bunch of, of impressive results. It turns out you can, um, in this case, they distill the model with just 3 billion tokens. That's less than 1 percent of the data that was used to train the previous best performing Mamba models.

And it was just 2 percent of the data that was used to train the original model. Five, 1.5 model the, the transformer they were starting from. So, Andre, to your point, you know, much harder to scale up training of mamba models. Um, there are a whole bunch of hardware reasons why that's the case.

Uh, but ultimately what this allows you to do is say, okay, well let me, let me train the thing that I can train scalably that transformer and then just use that, leverage that to get a really performant mamba model.

And the, the crazy thing about this, like I've often talked about my, my tendency to believe in what's known as the hardware lottery, which is this idea that, Hey, transformers, they may not technically be better than Mamba or other architectures, but there's so much hardware optimization, software optimization. It's pointed in that direction. It's kind of like, it's kind of over like, you know, people are going to keep investing more and more in transformers.

They'll get more and more efficient. It doesn't mean that other models aren't in principle, like don't in have more potential, but, but here we are. Um, this is a way around that, right? If you can start to kind of transfer over effectively to another platform, another architecture for a low, low cost of like 1 percent or 2 percent of the training data. And now you're kind of in business. And that's what's really interesting about this paper. I think it's a really strategic paper, if you will.

Andrey

Right, exactly. And to just jump in a bit on the details of the performance, they have a table comparing their PHY Mamba 1. 5b to essentially all the other, uh, variants of this idea of state space models, uh, and also some other, uh, smaller models. So we compare it to Mamba 1, 1. 4b, Mamba 2, 1. 3b, XLSTM, 1. 4, Pyfi, etc, etc. And on this Pyfi, Mamba 1. 5b, that as you said, they used only 3 billion tokens, a relatively small amount. They get the best performance on, on Pyfi.

Pretty much all benchmarked, most benchmarks relative to all these other models, despite using much less data. And they get close to matching the initial, uh, 5, 1. 5, 1. 3 model. So we're not quite at the same level as transformers, but we are. By far the closest of any of these other small state space model type architectures. So to me, it's, yeah, it's, it's one of these, like, it seems intuitive, uh, and that kind of straightforward to imagine all you can train really big transformers.

Why not just like take that and. Convert them to these things that are, uh, efficient at inference time and scale better at inference time. But you know, at least I haven't seen a paper do that until this one and the results are pretty exciting. So could be kind of impactful. We'll see.

Jeremie

And up next we have loss of plasticity in deep continual learning, and I'm going to highlight that this paper is co authored by Richard Sutton. So, uh, this is the real deal, folks. Um, it's, it's, uh, one of the, the godfather of, of reinforcement learning, uh, Rich Sutton. So, okay, what is loss of plasticity?

Um, It turns out that if you start by training a model on one, uh, call it one problem space, one type of problem, and then you train it on a different type of problem, and then on a different type of problem, and then on a different type of problem, in that order, eventually you'll find that your model is going to get worse And worse and worse at learning the new problem spaces.

Okay. So in a way, it kind of feels like this other phenomenon called catastrophic forgetting, but it is different in catastrophic forgetting. What ends up happening is you train a model on more and more stuff and eventually starts to forget the stuff that you trained it on before. Right. It treadmills out the old knowledge. So this is instead a question of how quickly the model is going to learn, uh, to solve problems in a new problem space and a kind of new distribution to use the language.

Right. So, so the way they set this up is It's an investigation of this phenomenon and a bunch of suggestions as to how you might mitigate this problem. And they start with this data set called ImageNet, the famous ImageNet data set. Basically, this is like a thousand different categories of images, and they're all labeled. It's cats, it's dogs, it's airplanes, it's school buses.

And what they're going to do is they'll say, okay, We're going to take these 1000 categories and usually the problem that you give to your model is you say, okay, I'm giving you an image, tell me which of these 1000 categories it belongs to, but they're going to change the frame here and instead they're going to look at every pair of categories. So you got 1000 categories, you can have a million pairs of categories.

So for example, a pair would be like dogs and cats and you're going to in each case. Train the model to just distinguish the two. Distinguish images of dogs from images of cats. Images of cats from images of sandwiches. And so on and so forth. And what they're going to do is they're going to train the model on one of these pairs. So get it really good at telling cats apart from dogs. And then get it really good at telling, uh, toilets apart from wrenches.

And over time what they're going to do is they're going to measure how much data it takes to get the model to perform well on the marginal next problem. And what they find is, well, Loss of plasticity. The model starts to struggle more and more as you give it more and more of these tasks to perform. And, um, they end up proposing a, a strategy to mitigate this. It's actually quite interesting. When you read the paper, they don't get too deep into trying to interpret why this is happening.

That seems to be a bit of an open question to them, yet they still find these mitigations that, that work pretty well. So what they'll do is they'll Uh, reinitialize a small number of the, um, the neurons basically in the model, the, the neurons that, uh, let's say are, are least involved in generating, um, correct outputs. So we'll just say, you know what, this one doesn't seem to be being used much, so I'm just going to zero it out.

And kind of retrain that neuron from scratch, and then they'll do this with kind of a very small fraction of the weights in the model, and they find that that leads to a kind of better plasticity, a sort of refreshing, I mean, I guess the metaphor is sort of like, you know, what if you could just, uh, regrow from scratch some of the neurons in your brain every once in a while to kind of keep you fresh? That seems to be working pretty well in this paper.

So, uh, this is, uh, This is very interesting. It's highly empirical and it is reflective of a kind of approach to this problem space that rich Sutton has because he's so focused on reinforcement learning and they do, by the way, they don't just look at images. They have a reinforcement learning setting as well. It's just a little bit more complex, but the bottom line is this problem seems to keep reemerging, which is why we're setting is so interested in kind of doubling down on solving this.

He is interesting. The kind of O. G. A. I. Scaling guru, right? Like this is the guy who flagged with his his famous blog post this idea that scale scale matters. And here he is kind of identifying one of the key challenges that comes with that. So thought really interesting paper and felt a bit almost yen lacunation in the way that it identifies a very, um, a simp conceptually simple problem that current models still do have.

Andrey

Yeah, I agree. It's quite an interesting paper partially because we know about catastrophic forgetting. That's like the thing that everyone knows is as you try to train a model on new tasks or new information, you will probably lose information that you had previously learned on. And what this is showing is actually distinct from that.

It's saying, as you train on a procession of tasks that are all of similar difficulty, or actually the same difficulty, it's not saying, can you still do a thing you could do, you know, free tasks ago, it's saying, can you learn this new task as well as you could the previous end tasks? And so this observation that somehow, as you learn more and more, you You are worse at learning that when you were at the start, when you hadn't known any skills at all. That's, yeah. Very novel in this case.

And, and this is to be fair, you know, just so you know, on a sort of toyish task, it's a procession of like classification between two images. So not it's, it's not, it's, I'm more on the theory side. You could argue. at this

Jeremie

stage, but still very interesting. But, but it is interesting for how it interacts with this notion of sample efficient learning that you get at scale, right? So as you scale models more and more, it turns out paradoxically and seemingly contrary to the results of this paper, it turns out that the model will learn the marginal next task much faster. You see this with language models all the time, right?

Where You know, you train the model all in English and it takes you a gigantic amount of tokens to do that. And then for a fraction of that token count, like 1%, half a percent, the model will then learn a new language crazy fast. And I think there's something really interesting going on here. That's this is kind of a half baked thought and they don't try to address this in the paper. And I'm really. interested, I'd be interested in, in the author's take on, on what the distinction is.

Um, but I think there's something very interesting going on in terms of what you define as one problem space versus another. The extent to which they overlap, I think is playing an important role here. That's my intuition at least. Um, but I'm, I'm really like, this is a space I'd be really curious to see more deep dives into for sure.

Andrey

Yeah. I mean, I would say, My intuition as well is, you know, the distribution's got to overlap or else, you're no good. But, uh, I guess we'll see probably in future research. On to policy and safety. And the first paper is once again about SB1047, California Bill on AI, Regulation that we've been covering quite a bit has been the hot topic lately. And the news is that the bill has been weakened, uh, with regards to AI disasters before the final vote.

So it has been amended in such a way that, uh, there's less power granted to the government in holding. AI labs accountable. In this case, it doesn't allow the attorney general to sue AI companies for negligent safety practices before a disaster occurs, which is something I believe you covered had in their opposition to this bill. They, um, were. skeptical of the idea of kind of regulating how, uh, labs prepare for these sorts of things.

The bill also will no longer create a new government agency called the frontier model division, but we'll have a board of frontier models. And there's a few more things. Uh, also AI labs are no longer required to submit certifications of safety test results under a penalty of injury, but there are going to be public statements. Lots of these amendments that basically require less of companies like Entropic and OpenAI.

So perhaps not surprising given the very, very, very significant pushback from companies in California. Uh, and I think, uh, One thing that in particular, uh, Matt will be happy about is the bill does protect open sourced fine tuned models, saying that if someone spends less than 10 million fine tuning a model, they are not considered a developer under this bill.

So if you take a big model and then fine tune it, to your own spec, you're not under the same, uh, restrictions as meta training Lama free, for instance. So quite a big change, it seems to me, and it does address some of the things that critics have put forward as issues with the bill. We'll see probably a lot of people opposed to it or still opposed to it, but, uh, definitely. Notable difference.

Jeremie

Yeah, absolutely. And I think, you know, one of the things that's not often talked about in the context of the bill is just how, um, uh, rarified the set of players is who would actually have to abide by the regulation. It's, uh, you know, if you're training models that are 100 million plus in cost, then it applies to you, right? So there's a lot of, um, there's been a lot of noise, especially from open AI coming out and saying, you know, Oh, well, you know, poses a regulatory burden.

It's bad for innovation, blah, blah, blah. Um, but the, you know, the actual requirements when you look at the bill, apart from being bloody similar to a lot of the things that open AI had previously committed to doing voluntarily, not, not entirely right. There are nuances. Um, but, but, you know, it looks an awful lot like the voluntary commitments that they've made. Um, and now we're hearing all this, you know, pushback on, Oh, well, you know, stifle innovation in a context where it's like.

Yeah, if it's 100 million, but like, if you have the resources to build that kind of model, yeah, you arguably have the resources to, to, to submit to some regulation here. Um, but in any case, this has been a really interesting saga. Anthropic did come out with objections to the initial bill. A lot of those objections, a good chunk of them have been addressed in this bill, not all. And anthropics come out with a statement themselves.

This is from Supposedly Dario Amodei, but I will say it reads as if it was written perhaps by Jack Clark, um, who heads up policy there. Um, so it, uh, it was quite interesting. They basically say, in our assessment, the new SB 1047 is substantially improved to the point where we believe its benefits likely outweigh the costs. It's costs.

They do go on to say they're not sure about that, but you know, that reads pretty well as like a tepid endorsement, which does track, you know, they got some of the things they wanted, not all. Um, and so, uh, you know, a qualified endorsement, I think makes a lot of sense on that basis. Um, they said, you know, they highlighted the risk of, of government overreach.

If this is signed into law, um, one of the big things they're trying to do is It maintained, as they put it, a laser focus on catastrophic risks and resist the temptation to commandeer the bill to accomplish unrelated goals. Um, so that, that kind of makes sense.

Um, one of the things that's been changed is, yeah, the, this risk of criminal penalties, uh, being incurred by developers for intentionally submitting false information about their safety plans to the government under penalty of perjury. I mean, like, I gotta say, I, like, I'm a, I'm a techie guy.

I don't like regulation, but, um, if you're gonna lie intentionally, uh, like submit false information about your safety plans to the government, yeah, I mean, I think criminal penalties, like, I think the average American would look at that and be like, yeah, like, you know, any other industry, freaking Boeing is like submitting intentionally false information about their safety plans. Yeah, like criminal, criminal charges should probably apply, but that's been taken out of the bill.

Uh, just to give you an idea of like where the decision surfaces is being drawn there. Um, but in any case, uh, they, you know, they say it's, it seems like a feasible compliance burden. No surprise there, obviously given the, uh, the costs associated with it. Um, and, uh, there, there is one thing I want to call out.

Uh, in this, this, um, write up this letter that Anthropic put together, um, because it gives us a sense of their core thesis, they say, um, in grappling with the, essentially the regulatory dilemma, they say they've come to the view that the best solution is to have a regulatory framework that is very adaptable to rapid change in the field. Um, they say there's several ways to accomplish this. blah, blah, blah.

Um, down the road, they say perhaps in as little as two to three years when best practices are better established, a prescriptive framework could make more sense. And, um, and they point to automobiles in aerospace as being in industries where that's happened. But they also say that, um, as noted earlier in the letter, we believe AI systems are going to develop powerful capabilities in domains like cyber and bio, which could be misused, potentially in as little as one to three years.

And they say in theory, these issues relate to national security and might be best handled at the federal level. Okay. That's interesting. In theory, they're saying this bill really should be a federal level bill, but they say in practice, we're concerned the congressional action simply will not occur in the necessary window of time.

Now, I will invite you finally to contrast that with open AI statement, which is basically Hey, uh, we don't think it's appropriate for this to be done at the state level, should be done at the federal level. Again, that makes sense in theory, but OpenAI knows, they know. I can tell you firsthand, they've got lobbyists on Capitol Hill. Uh, we, we often walk into congressional offices shortly after the OpenAI lobbyists have left.

And they know just as well as anyone that action on Capitol Hill on this is not likely to happen anytime soon. And so frankly, it becomes challenging to take particularly as good faith, um, an assertion like, Oh, well, this should be done at the federal level. Yeah, you could, you could see it certainly as an attempt to just stonewall and play a delaying action. So anyway, um, long winded way of saying there are very different takes on this.

OpenAI whistleblowers also came out, by the way, with this, uh, this open letter basically calling out OpenAI for what they frame as a kind of hypocrisy, saying that their former boss Sam Altman has repeatedly called for AI regulation, and they're saying now when actual regulation's on the table, he opposes it. Um, anyway, uh, lots of, lots of drama going on here, but it's interesting to see the contrast starting to form now on policy more explicitly between OpenAI and Anthropic.

Andrey

Yeah, the drama is as ever, uh, present here a little more technical now. The drama is, uh, I guess, substantial related to policy. As you said, there was this letter from OpenAI Whistleblowers released on August 22nd, uh, directed to Governor Newsom and the Senate President Pro and Assembly Speaker regarding this, uh, Talking primarily about OpenAI and its, uh, I guess disagreements with the paper.

To be extra clear, the paper had these amendments before, uh, actually passing through California's Appropriations Committee, which is a major step to becoming a law. So next, it will be heading to California's Assembly floor for a final vote. If it passes, it will then go back to California Senate for a vote due to these amendments. And then if that passes, it will go to Governor Newsom, the Democratic governor of California, who will either veto it or sign into law.

So still could be some developments in the case. It's, it's been one of the, I guess, main movements in AI policy in the U S so far, ever since the executive order of Biden. So pretty significant to cover. I will just mention, we, I didn't mention this at the beginning, but we had another negative review regarding needing less doom and it did call us SB1047 cheerleading, which I think is like the most boring topic to surely. Yeah, that's true.

On record, I work in industry, so obviously I cannot surely this bill because who in industry would like to be regulated? Come on.

Jeremie

Yeah.

Andrey

Other than, I guess, apparently anthropic, but yeah. And next up, we have a bit more of a paper. It's titled personhood credentials, artificial intelligence, and the value of privacy preserving tools to distinguish who is real online. So it's basically making the point that. As we get into significantly advanced AI, it's going to be very difficult to tell whether you're interacting with an AI or with a person. It already is to some extent, right? With chatbots, can be pretty effective.

It can be even more of a case once you can do real time audio and video. So, this is analyzing the value of the so called personhood credentials, which are digital credentials that basically prove that you are a real person, not an AI, to online services, but without disclosing who you are to the service. And the idea would be that these credentials would be issued by trusted institutions like Governments can be local, global, et cetera.

And yeah, it's a very long, I guess, analysis of this, so it's not necessarily going to lead to implementation per se, but makes a case for this being useful.

Jeremie

Yeah. I thought the paper was like, it was like 60 pages long and I kept feeling like it was going in circles a little bit and like kind of re explaining the same stuff that it explained before. Um, but, uh, but still nonetheless, uh, I, I do think the, like the idea makes a lot of sense.

Um, there's a whole bunch of thinking in the paper, by the way, about this idea of like You don't want to have a government that has too much leverage over deciding who can and can't access certain platforms, this and that, like, how do you, how do you, how do you reduce the risks associated with having, for example, a single centralized issuer of these credentials? And, um, I think that's actually a really good thing, like, again, as a kind of tech guy, uh, Uh, free market, dude.

Um, I, I like that idea. I like the idea of, you know, really trying to find a way that, cause the big challenge, right? This is going back many years. When you look at Twitter, the obvious solution to bots is to force everybody to like freaking show you their passport or something insane like that before they create an account, right? That's the way you would do it. Now that's bad for so many reasons. You don't like anonymity on Twitter.

It makes it so that people in, you know, Saudi Arabia can come forward and criticize the government publicly, right? It makes it, anyway, it makes it possible to do a lot of important things.

Um, so we want to try to preserve that as much as we possibly can and turning to government, like, are you really going to trust government to kind of, you know, Issue these, these credentials in a way that is going to be, you know, democratic and, and, and free most importantly, um, you know, some people might, might suggest the answer is probably no. And, and I would definitely, uh, sort of agree with that skepticism.

So essentially this is all about techniques that you could use to have multiple issues, issuers of these credentials from multiple services, uh, that would allow you to do this stuff without. It's not like you're giving your, um, you know, your, your, uh, social security number or whatever, when you're signing up, you have distinct credentials for each platform. It's just so they can confirm that there's one, one, um, personhood credential to one person.

And that way you prevent the bot account issue and all that stuff from accumulating. So interesting paper. Um, it doesn't get into the technical side of like, the, the cryptography side of it as much as I might've been interested in seeing, but it kind of makes sense because it's like an, it's an initial thought piece exploring the system side of how this might all work.

So, uh, yeah, really interesting, especially cruel, crucial, as they point out, given the agentic AI users that are going to be on the internet very soon that already are really, but then, you know, they're going to dominate more and more of the internet. We're just going to need a way to tell who's human and who's not.

And this kind of thing, you know, it relies on the fact that the one thing AI is can't do and, and may not be able to do for a little bit of time, at least is mimic a human in the real world, right? They're not there yet. And so we can rely on the fact that humans, can be asked to go in to kind of confirm physically their identity.

Once that happens, you can rely on cryptographic protocols to do the rest of the work and associate that, um, that once verified identity to a whole bunch of other kind of verified accounts on various platforms. There you have

Andrey

it. And next, another technical work, this one dealing with interoperability and the title is showing SAE latents are not atomic using meta SAEs. So an SAE is a sparse auto encoder. And that is a way in which you can understand how a large language model works. So what you do is you feed a large language model some data, you let the data propagate from input to output, and you get a bunch of intermediate outputs.

And then you can take the intermediate output and basically train another model that compresses it. So you compress the outputs at some layer and what you end up getting when you compress it is what people call a dictionary of features. So you can tell within the like crazy big set of intermediate outputs that you get, there are these patterns of activation where a certain feature might relate to like German.

And another feature might relate to questions and another one might be coding, et cetera. And when you do that, it's becomes very useful to explaining the behavior of your model because you can then attribute, you know, uh, behavior to certain intermediate outputs and you can even affect how a model performs.

And so what this, uh, paper is doing is exploring the question of, when you do this training, are the, uh, features you get, the dictionary you get, in fact, sort of the base level of a so called atomic, in the sense that they cannot be further divided. And they are the base layer. And they argue that that is not the case. That you can take a sparse autoencoder, and then you can train another sparse autoencoder on top of it, resulting in a meta sparse autoencoder.

And that sparse autoencoder then splits up the features more. roughly speaking. So that is the main conclusion of the paper. It has a fair number of discussion of kind of the nature of sparse encoders in general, and I think deals more with the sort of, I guess, hypotheses of people with regards to what is it that you get when you train a sparse out encoder.

Jeremie

Yeah, I thought this paper was really interesting. Um, so yeah, as I think you mentioned, Anthropic in particular has been really big into using essays, the sparse auto encoders to do interpretability on very large models.

Um, you know, their concern is about all kinds of things, including, uh, Um, this concept of deceptive alignment, where you might have models that pretend to be aligned, that pretend, uh, because they know they're being evaluated, they're, they know they're being assessed, uh, they pretend to generate outputs that are safe and blah, blah, blah, and pretend to be well behaved, but actually aren't.

And the hope is you can use techniques like this, interpretability techniques, to understand when that sort of thing is happening. And the setup here, like again, the kind of essay setup is, is pretty crucial. The way this works, uh, to take another analogy, right? You know, you talked about these intermediate outputs that, that, um, that show up in the model when it's processing its inputs. Um, you know, in your brain, when you consume some kind of input, your neurons will get activated, right?

Different neurons will flash and get activated and so on. And so that pattern of activation is what you're going to feed into your. Um, sparse autoencoder, you're going to try to take that in, um, force that pattern of activations to be represented using a small number of numbers. Okay. You're going to try to compress it down to a small list of numbers and then reconstruct the pattern of activations based on that small set of numbers.

So you need to have is essentially you're going to try to minimize the reconstruction error. So you're going to encode those activations and decode them, encode them and decode them. And over time, you're going to get really good at doing that compression. And the hope is that the small list of numbers that you're hiding all that complexity in that you're compressing that complexity down to is going to be interpretable.

That basically when I look at one of those numbers, that number will be associated with a human interpretable concept like a box or the color red. So certain pattern of activations might then be revealed through this list of numbers to be associated with an encoding that tells you, ah, the model is thinking about the color red, something like that.

And what they find in practice is that if that list of numbers that you use to encode all the activations in your neuron, if that list is too long, then some of, there's going to be a lot of redundancy in what that list holds. So you'll have like.

Red box in there and like blue box and those two concepts are actually quite similar They're related by the concept of a box instead of having just the concept of a box and then separately the concept of a color They'll get merged together because if you if your list is long enough, you can afford to use a bunch of those Entries to be lazy about it and not actually bother thinking about, you know, is this truly an atomic concept?

And so they'll essentially repeat this process and use another sae on top of the first one to run the same process again But have a smaller list of numbers that they're forcing things to get compressed down to and by doing that they're able to force These abstractions that are being captured in that list of numbers to be more atomic and kind of control the resolution You of atomicity of these concepts in

a way that, you know, anyway, gives, it gives you more control over the degree of interpretability. So I thought this was a super interesting paper conceptually, again, very simple idea, but it results in something that could have meaningful implications for alignment and safety.

Andrey

And just a couple stories in synthetic media and art before we close out. The first one is offers sue Cloud AI chatbot creator Anthropic for copyright infringement. So this is a class action lawsuit by offers Andrea Bartz, Charles Graber, and Kirk Wallace Johnson. And, uh, it is similar to lawsuits that have already been filed against OpenAI. I believe the first one against Anthropic in this case, and is again, making the case that Anthropic trained on these offers data.

And therefore, this is copyright infringement, and they should be compensated, added to a stack of lawsuits that are ongoing.

Jeremie

Yeah, first one I guess for Anthropic, so I guess that's a, that's a change. Their war chest is going to be smaller than Opening Eyes 2, so maybe can't afford to buy the same, the same licenses to, to license their way out of it, but uh, we'll

Andrey

see. And the last story, also dealing with a lawsuit, it is that the artist's lawsuit against Ability. ai in the mid journey gets more punch. So, this is saying that Judge William Oreck did allow for a direct copyright infringement complaint against these companies. But had dismissed other claims asking for more detail.

Now, in a more recent ruling, the judge approved an additional claim of induced copyright infringement against stability and allowed copyright claims against deviant art and runway AI, which used a model's platform. based on stable diffusion, and also allowed copyright and trademark infringement claims against Midjourney. And the company had this Midjourney style list of 4, 700 artists whose name could be used to generate works in their style without the knowledge or approval of those artists.

So some Progress there, the judge did dismiss claims about the Digital Millennium Copyright Act with regards to altering copyright management, but still, you know, it's moving forward clearly and there is a bit more clarity on the contours of the case.

Jeremie

Yeah. And interesting that that notion of induced copyright infringement being a particular problem for AI. Um, so the, apparently the definition of induced infringement hashtag, not a lawyer, uh, is if, if you've got a company, um, that provide, for example, provides instructions for using their product in a way that violates. A copyright or something like that, um, then, uh, then that's that can be induced copyright infringement.

And you can imagine how that might apply to models that when prompted can generate copyrighted material. You know, Andre, you're talking about earlier today, um, I forget what it was, Sonic the Hedgehog or whatever, you know, or Mario, you know, you give it like a, you know, uh, an Italian plumber who's blah, blah, blah. And then it'll generate an image of Mario. Um, you know, that's, that's an interesting, uh, case potentially for induced copyright infringement.

So, uh, I guess we'll be seeing more of that.

Andrey

And with that, thank you so much for listening to this episode. We are finished. And as always, you can go to lastweekend. ai to subscribe to the text newsletter. You also get emails with each podcast, including those links to the articles and, uh, papers. So that's another way to get to the links. As always, please do subscribe, please share the podcast, and please give us any corrections, any suggestions, and any five star reviews. One star reviews.

If you really, if you really feel we deserve it, you can give us a one star, although I think that's a little harsh.

Jeremie

And I will say it true. It is, it would be, you could argue a little harsh, but, but still fair is fair. I will say also in case, in case I am, um, fortunate enough to have a, uh, a baby in the family. You know, at the end of the week or something like that, I just want to say, uh, I'll see you guys, uh, I guess in a, in a couple of weeks when I'm back, uh, from my pat leave, but, uh,

Andrey

yeah,

Jeremie

go, uh, go AI world in the meantime. We will

Andrey

not make you host a podcast. On lack of sleep. Yeah,

Jeremie

that'd be

Andrey

great. All righty.

AI Singer

Welcome back. It's episode 180 last weekend. ai. The news is never id all drawing visions, not just c5. Can't deny getting crazy on rise Heck, like AI 2030 vision. We be Future in fire Updates episode one eight. Seeing clear through the fall. Dream Machine 1. 5, jogging minds like a log. Jamba's mixing minutes, tech twist like flop fizz. AI in 2030 with a future by the glitch. Stay vocal, stay tuned, the intel ain't dull. It's last week in AI, where we keep you on the pulse.

Policy papers drop, ain't no regs to bathe. Governments grappling, can they legislate fate? Forever to the states, frameworks start to lace. In this tech race, every move's gotta ace. With tools on the rise, video creation's clean. It lives it's life like it's a stream, dream scene. Welcome to the show, it's last week in AI. Where we take you high, beyond the cloudy sky. Episode 1A0, let the data fly. Brainwaves connecting, updates in the keeping. Open source surge, AI coming like the fever.

Versions drive fast, sec moves like a mob reader. Personhood debates, credentials growing deeper.

Speaker 4

You

Transcript source: Provided by creator in RSS feed: download file