#178 - More Not-Acquihires, More OpenAI drama, More LLM Scaling Talk - podcast episode cover

#178 - More Not-Acquihires, More OpenAI drama, More LLM Scaling Talk

Aug 16, 20242 hr 6 minEp. 217
--:--
--:--
Listen in podcast apps:

Episode description

Our 178th episode with a summary and discussion of last week's big AI news!

NOTE: this is a re-upload with fixed audio, my bad on the last one! - Andrey

With hosts Andrey Kurenkov (https://twitter.com/andrey_kurenkov) and Jeremie Harris (https://twitter.com/jeremiecharris)

If you would like to get a sneak peek and help test Andrey's generative AI application, go to Astrocade.com to join the waitlist and the discord.

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/

If you would like to become a sponsor for the newsletter, podcast, or both, please fill out this form.

Email us your questions and feedback at [email protected] and/or [email protected]

In this episode: - Notable personnel movements and product updates, such as Character.ai leaders joining Google and new AI features in Reddit and Audible. - OpenAI's dramatic changes with co-founder exits, extended leaves, and new lawsuits from Elon Musk. - Rapid advancements in humanoid robotics exemplified by new models from companies like Figure in partnership with OpenAI, achieving amateur-level human performance in tasks like table tennis. - Research advancements such as Google's compute-efficient inference models and self-compressing neural networks, showcasing significant reductions in compute requirements while maintaining performance.

Timestamps + Links:

Transcript

AI Singer

Last weekend, AI cling news. Never slow to a I buzz with a new tech growth departure swing and open at Coast Skill. Hello Cam.

Andrey

Hello and welcome to the last week AI podcast where you can hear us chat about what's going on with ai. As usual. In this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our last week in AI [email protected] for articles we did not cover. In this episode. I am Andre Reko, one of your hosts. You may have just heard the AI version of me doing the intro. We'll see if I swap it in and uh, you know, maybe it is, maybe it isn't.

You never know . Uh, but for a bit of context, I did a PhD studying it. Stanford now to work at a generative AI startup.

Jeremie

I think you might've found a loophole there, Andre, where everyone's always saying you need to disclose when you use AI, you're saying, I'm not, I'm not saying I did, I may have, that may have been, you know, you're kind of, you're kind of sneaking through. That's pretty good. Um, actually that announcement makes me just a little bit nervous too, because one of the things I wanted to mention was, oh, sorry, but my name is Jeremy Harris.

You know that if you listen to the podcast, co founder of Gladstone AI, AI national security company, all that jazz. So I also, so my wife is expecting. A baby, uh, with a due date in mid September. So, um, it's very comforting to hear, as by my human, uh, requirement for some time to deal with, uh, you know, the immediate aftermath of the birth, uh, is gonna intersect. With A. I. S. That can replicate my voice perfectly.

So we're feeling very, very comfortable with our jobs here at last weekend. A. I. Um, and we will absolutely cover this stuff in a neutral way because we are not biased by the fact that these A. I. S. Are coming for our jobs, and it must be said.

Andrey

I do hear having babies is a bit time consuming, uh, when they just arrive, it takes a bit of time to take care of them for what I

Jeremie

think is Andre. My, my wife, um, is she's very reasonable on almost everything, except when I proposed that we have GPT 4. 0 stand in for me, uh, through the whole, you know, the first three weeks or so, just as, you know, And I, like, I was like, look, it's literally drawing on vast information reserves that I don't have. It can answer questions that you have about, you know, market economics and geopolitics that I just can't help with. So, you know, why not give it a shot?

But she, she didn't want to. I even showed her the figure two.

Andrey

I know, exactly, as we'll cover, there's a humanoid robot that has GPT now. I don't know why you would oppose to that being your babies personally, but, uh, I guess not everyone is that excited about it.

Jeremie

And our listeners say that I'm biased against mothers. I don't know where they get this. I don't know. I don't. Sorry, guys. That was a joke that came from last episode. Uh, yeah, what is up

Andrey

with these in jokes we got, uh, moving out to be a quick mention of listener comments and feedback. Actually, we had a review on Apple podcast title is very pro Marvelhood.

Speaker 4

Yeah,

Andrey

keep it going guys, uh, also the AI stuff, but not at the cost of covering the Marvel land experience. So there you go. Here's your Marvel land content for this week.

Jeremie

I really appreciate that. I'm glad that we're getting this. Listeners duking it out in the comment section about whether or not we're pro mother. That's, that's where we want to be. That's

Andrey

what we care about here at Last Week at AI. And actually next one, uh, also an Apple review. Thank you for that. Says the most substantive AI podcast out there. If you're a new listener, may not seem like it. But when we do get to the news, perhaps we'll justify that. Uh, and just two more, uh, YouTube comments. And I've loved seeing YouTube comments come in. Please do keep that going. First one is saying that apparently Jeremy's positivity is why I watch each week.

So Jeremy, do you have a reason? Some people, uh, Watch, uh, so I'm, I'm sure this listener would be glad to have you back. Uh, this episode, by the way, will come out, uh, pretty quickly after the previous one. I'm still a bit behind, so I'm going to try and catch up.

Jeremie

We just need that figure three to come out. Podcast editing capabilities, Andre.

Andrey

Yeah, that I wish we had that. And the last comment is, uh, that was a question on YouTube. Uh, do we have any tips that would integrate with other services? So for instance, ask NAI to generate a business plan, and then it includes a website, creates a Facebook and Insta page. So, Basically, it sounds like an AI that can do not just one thing, but multiple things all together.

And, uh, this is a tricky one for me because, uh, I do think for individual things like generating a business plan, starting a website, creating a Facebook and Insta page, you know, we have. AI for each of these things individually. So far, I don't think we have kind of a unified way of doing this. And I think that's what a lot of the, uh, agentic startups and efforts are going toward is you spit up an agent that is an AI agent.

You tell it, you know, generally a business plan, start a website, create socials. make AI posts on these socials, and then the agent using APIs, using something like that, goes ahead and does all those things for you by itself. So we're not there yet. As far as doing this, you have to basically create your own script with Python to use various services. And you can, you can use chat GPT to go ahead and write such a thing, but off the shelf, I think we're not quite yet, quite there yet.

Jeremie

Yeah. And if you're going to set that up to one of the problems, he'll run into pretty fast. We've talked about this a lot, but you know, these agents often have, they may have fairly low failure rates on individual steps in their process. Like they may, you know, 90 percent of the time nail the steps in whatever the, the overall goal is to set up a website in a business. But, um, When you add up those probabilities of failure, like it only takes one to kind of like interrupt the whole flow.

Now I'm being grossly oversimplifying just by, you know, calling this one flow, but you get the idea. So anyway, this is an ongoing problem. We'll be talking about it today actually in the context of some of the audits that were done of GPT 4. 0. Um, the model card came out like this morning or yesterday or something. So we'll be talking about that, but that's definitely a hot topic here.

Andrey

Starting with story. Google's hiring of Character. ai's founders is the latest sign that part of the AI startup world is starting to implode. So I'm not sure if we've covered this news last week, but there was a pretty big deal when Google did hire the, uh, leadership team of Character. ai. So, uh, these are former Google researchers, DeepMind researchers, Noam Shazir and Daniel DeRosa. Freitas, who actually were pioneers of the chatbot technology.

They worked on Lambda, which was a sort of precursor to GPT 4. Oh, it was a large language model optimized for chatbot type activity. And, uh, famously had the whole drama with the claim at the time from, uh, Person inside Google that it was conscious leading Google to, uh, kind of back down on ideas of releasing it publicly. And as I've said before, I often suspect that had that media thing about consciousness not happened, maybe Google would have been first before GPT 4. 0, who knows?

And so, uh, this is, uh, kind of. Covering that story a bit more, uh, in the trend that we've seen, so Google hired the founders of character. ai. Supposedly that came with around a 2. 5 billion investment in character. ai. Similarly, we've seen, um, Microsoft buying out essentially Inflection. ai. And with that, the, uh, hiring of the founders of that team and, and that led to the investors of Inflection. ai receiving about 600 million, covering these things.

So the trend here is that it doesn't seem like these companies are making it work in terms of being profitable. And so these kinds of moves are indicative of basically them needing to continue to take in cash to probably survive. And, and that does make sense as we've covered many times with AI models, it's hard to get good margins. They're expensive to run much more so than, you know, any sort of other software.

And with a business model of 20 per month subscriptions, you may or may not break even, uh, depending on how much you use it. And Character. ai users in particular are very engaged. People who are spending 20 a month are likely chatting with chatbots a whole lot, possibly enough to offset that cost to not be that profitable. So interesting trend going on and continuing on.

With in fact, there's going to be another story later this week about a later episode about a debt adept being hired by Amazon. So yeah, interesting. I don't know what you think about it, Jeremy.

Jeremie

Yeah, I think you're right. I think it's part of a number of trends, one of which, which I don't think was covered in the article. I could be wrong, but this idea of, um. The motivation behind these kinds of weird pseudo acquisitions, right? We're not seeing Microsoft come in and say, Hey, we're gonna buy Inflection out, right? They do this weird thing where they bring over the CEO, and they bring over the staff, but they kind of leave this hollowed out husk of a company to just die on the vine.

That's what's happening here. With character. ai, that's what seems to have been happening lately. And the reasoning behind it, uh, seems to be because of antitrust concerns. There's a concern that governments are going to go after, uh, these companies for these big acquisitions, sort of for consolidating the space. This makes it easier in a sense to say, well, look, you know, we never acquired the company. So there's no actual antitrust situation here.

Um, another trend is that I get, I don't think this was highlighted in the article either, but I think it's important to note like. All of these companies, so many of these companies that are in this position, um, are founded by former authors of, or authors of the attention is all you need paper, that famous transformer paper right from 2017.

Um, so we have in this instance, um, so Noam Shazier, the CEO of character dot AI was one of the authors, um, whole bunch of other examples, including the co founders of adept, which you mentioned obviously had a similar experience earlier in the year. And then of course there is. Aiden Gomez at Cohere, right? Cohere is right now another one of these, I would argue, I have argued for years, they're going to be in trouble. Um, and I, I keep saying this as they keep raising more money.

So the, the stakes keep getting higher and higher here, but I really do think structurally Cohere is, Is really, really, um, kind of in, in some hot water long term. They need to find a way to turn a profit. Yes, they've raised 500 million, but they're in fundamentally the same sort of circumstance as all these other companies.

How are they going to turn a profit if they're not coupled Very closely to one of these cloud super scalers like Microsoft, like Google in the way that open AI is in the way that Google DeepMind is and in the way that, you know, in some ways, Anthropic is the, you know, they may arguably fall into that bucket to a certain extent as well. So I think there are a whole bunch of key structural challenges in the space for all the mid cap companies, the ones that basically aren't.

The, you know, deep minds, the opening eyes, um, to try to sort out, how are you going to make that buck? You know, this space is getting flooded with more and more competition. Character AI has a whole bunch of competitors. I think Kevin Roos from the New York times was doing some, uh, play test of about half a dozen different comparable companies. And anytime you're in that business, it's like, you know, you're, you're in trouble now. Worth noting character.

ai was at least nominally an AGI lab. Their goal was actually to build AGI. So this is another AGI lab that's now been folded into, uh, into Microsoft. And, um, their cap table was impressive, right? We had Andreessen Horowitz, uh, jumping in as part of 150 million round that they led, uh, with a billion dollar valuation. That was just back in March of 2023.

Um, so, you know, these are really good investors that are making this same mistake that Hey, if, if, if Mark Andreessen listened to the podcast, you know, maybe, maybe you'd have a few million dollars more, but anyway, uh, not actually serious about that, but anyhow, this is, uh, I think the, the big trend in the space, they were kneecapped, uh, character. ai, by the way, by, uh, Meta coming out and saying, look, we're, we're debuting our own family of AI characters.

This was, uh, back in October last year, so, so about 10 months ago or nine months ago or so. Um, and then they allowed, they created a new feature recently that lets users create their own. So you're really seeing the proliferation of this capability. People don't have to sign up to a specialized service anymore. Meta is already offering it. More people will as well. You have to assume.

So. I think this kind of makes all the sense in the world, but it is an indication of this underlying trend that seems to be pretty persistent.

Andrey

That's right. And, uh, in addition, another component of this trend that has been active for, you know, a while, at least a year and a half, is intense competition to acquire AI talent. You know, all of these companies are Paying a lot either in actual salary. I mean, the salaries at OpenAI supposedly are reaching into a six figures easily for the top researchers. Seven figures. That's right. Six or seven figures, which if you don't know, that's like in the millions.

Um, and so this, uh, kind of hiring also is indicative of the competition where, you know, just to hire the leadership and most of the talent of, um, Inflection AI, Microsoft paid something like 600 million. So that's, and a lot of money to, for an Acquihire, right? And it could be another reason for this as well is the competition for data. where character AI, inflection AI, they have a lot of users chatting with their chat bots.

And that is a very, very strong source of data that no one else has access to. So very, yeah, a lot of friends kind of, uh, Coming out of these kinds of Acquihires and search for talent. And speaking of that, we do have an update on the Acquihiring of Adept by Amazon. So this happened earlier this year. Where the top employees of the company was hired by Amazon back in June, and now investors in Adept will receive some reimbursement.

So Adept will receive $25 million while investors will roughly recoup their investment. So very much like the inflection AI situation where it was also the case that when Microsoft recouped, uh, sorry, hire a way of a company. The investors in inflection did recoup their investment via that amount. Uh, yeah, so there you go. Pretty much a similar story and it is quite weird as a trend for the tech sector. For a long time, they were hiring outright in general, rather than accrue hiring.

Accrue hires were generally with smaller companies, were partially for the tech. Uh, and. For the employees, uh, generally to my knowledge, actually hires at this scale at this amount of money. We're just not happening. And now they are.

Jeremie

Yeah. The typical structure for an aqua hire is like you said, you know, the, the acquiring company really doesn't care at all about your, your, your IP, your product. Uh, the view is you have good talent. They want that. So they'll buy the company wholesale and then just basically hollow it out, gut it out, take the employees, but also own the IP.

In this case, what's happening is I'm sorry, but one important note to, um, in Silicon Valley, when this happens, it usually happens for early stage startups. So we're talking about companies that you have raised, say on the order of, you know, tens of millions of dollars tops, something like that, right? The idea is here. If you're raising hundreds of millions, there's, there's no way that your value isn't in part your product. Like that's just, it'd be a very weird notion.

Um, it's just that again, as you say, the talent here is so sought after that you're seeing what is effectively an acquihire happening here. Um, when acquihires happen, by the way. Very, very, um, unusual for investors to get their money back. One of the key things about an acquihire is it's usually structured, you know, so that often employees are taken on board and they're given these very generous compensation packages. And that's the form that the acquiring capital takes.

This allows the acquirer to get around having to pay off investors for their equity in the company. Now that you can view that as a negative outcome for investors for sure. And maybe a bit of a, I don't know, like a dark strategy. It certainly gives the founders of the acquired company a really nice soft landing, but it leaves investors holding the bag. And so in this case, uh, they're doing it at such a scale.

And they've got great investors like gray lock and general catalyst who've put in, you know, over 400 million into adept. So at this point, they're saying, well, look, we're going to make you whole, which is again, really weird for an aqua hire. They're in a hurry as well to not call it that and explicitly saying, look, we're not interested in any of the IP here. We're not interested in the product. Um, you know, it almost to an insulting degree.

If you're on a debt founder, they're saying, look, we don't care about this shit. We want the people. And so, you know, that's, that's what's happening. That's just the weird nature of these things. Part of it really is, again, that, that goal of avoiding regulatory scrutiny on the basis of antitrust. Don't make this look like you were just straight up like acquiring a company.

Um, you want to have this weird kind of strange process that gets around, uh, these merger notification rules, um, that are going to trigger potentially, you know, FTC interest and things like that. Yeah, very interesting time. I don't know how much this is going to hold up because the obvious, um, let's say strategy here is to acquire like the, the end result here is the same.

So yeah, I'm, I'm curious how regulators would look at this and what tools they even have to, uh, to poke at this kind of deal.

Andrey

That's right. And we have seen, um, antitrust being enforced more strictly. At least it seems to be that in recent years, certainly the EU is very active on the antitrust side, but also the U S as we'll cover later in this episode, Google had Big development for them, uh, this week. And, uh, just to read a quick quote covered in this article, uh, last year, VFTC did make a statement about this trend.

Uh, so here's a part of a quote firms hoping to compete in the generative AI space need expertise and companies that acquire both engineering experience and professional talent will be better positioned to gain Since requisite engineering talent is scarce, powerful companies may be incentivized to lock in workers and thereby stifle competition from actual would be rivals to ensure a competitive and innovative marketplace. It is critical that talented individuals will be permitted to move freely.

So that's talking more about, uh, factors in your contract that basically forbid you from going to competitors, uh, but it could also apply to these quasi Huck rehires.

Jeremie

Yeah. And actually one, one last note, just to, for, for context on why the refund is happening in this case, like one big source of pressure. Um, every time I've invested in a, like a early stage company, when there's an acqui hire, there's always this question from the founders of like, do we? Payback our investors just reputationally because those founders are going to want to go off and found another company someday.

And they may want to raise from those same investors or Silicon Valley is a small world, man. Like the, like early stage startup investors, like a lot of us know each other and that sort of thing. And then the later stages, it gets even more like that because general catalysts and, you know, these big firms, um, the partners move around and all that stuff, they talk to each other. So. The goal really here is reputational control as well. Don't mistake this. There's no good outcome here, right?

Getting a refund is not, is not the intended goal for an investor, especially in a company like this. But you know, reputation management is important and um, essentially Microsoft here is allowing Um, these, uh, or sorry, Amazon in this case is allowing, um, the, the reputations of the founders to be managed with respect to their previous investors to allow them potentially to go on and do this again.

So, um, anyway, that's an important ingredient that I'm sure is part of the decision making here.

Andrey

That's right. And, um, for companies that aren't public, uh, to my knowledge, it's much harder to do a sort of, uh, aggressive acquisition via buying of their stocks, right? So you kind of need the agreement of the founders to be acquired or to be hired in these cases, right? So the founders could very well be laying out these conditions that to allow themselves to be hired, they have to have this happen.

Jeremie

Yeah, that's always a question of leverage there, right? Like, how much does Amazon really want Adept? How much does Adept, can Adept survive on its own? There's, that's all part of that negotiation. You're absolutely right. What form does the comp take and all that, for sure.

Andrey

Moving on, we've got the story that AI chip startup Grok has its valuation rise to 2. 8 billion, and that's after they have closed a 640 million funding round led by Uh, BlackRock. They've also announced two new appointments, Stuart Penn, an Intel serial executive who has joined as the COO. And they also have Yen LeCun, who has joined the company's newest, uh, technical advisor. And Yen LeCun is a very famous AI researcher, leads the AI research efforts at Meta. So pretty big deal.

here for them. They had previously closed a 300 million funding round back in 2021, at which point they're valued at 1 billion. And earlier this year in March, they were reported to have deployed around 4, 500 chips. So presumably will help them scale, move faster. And, uh, certainly I think Grok, we've covered them a good deal and out of the companies that are trying to make novel chip infrastructure to compete with NVIDIA and so on to make sort of AI native chips.

So far it appears that they are in the lead and actually Might be finding a decent amount of success.

Jeremie

Yeah, they're big. Um, their, their lead is in that niche, right? Of inference. So their chips, their so called language processing units or LPUs that we've talked about a lot on the show, um, are just for inference, not for training. So in that sense, it's, you know, one slice of that market. Um, and, uh, yeah, the, the big question here is going to be, can they scale production?

They've shown, at least through demos, that they can achieve outrageously fast inference speeds that compare very favorably to things like the H100, even the B100, B200s that are coming online. Um, what, right now the question is, can you achieve volume production? That's what this is about. You know, Andre, you mentioned 4, 000 of these LPU units produced as of March this year. Well, As of March 2025, um, their goal is to roll out more than a hundred thousand. So 25 X increase in production.

Um, for context, when you look at the scales that NVIDIA operates at, you're looking at millions of chips per year. So we're talking about an actually significant level of scale coming from Grok a hundred thousand versus millions. You know, you're, you're sort of an order of magnitude off for a company at this earliest stage. That's a pretty impressive, uh, pretty impressive play.

Another thing to note is, you know, you always have to wonder what are the margins like, especially as you're scaling up, you haven't optimized all your processes, right? You're, you're doing a shakedown cruise for every new kind of product line here. Um, what they say is they're aiming for quotes, a full dollar return for every dollar we spend on hardware. We don't intend to lose money. Unusual for a play like this, but when you're in hardware, you kind of have to do that, right?

You can't burn money in quite the same way because things are just so expensive. As people say in Silicon Valley, hardware is hard. Um, so yeah, it will be interesting to see if they can actually make good on that intended scale of 108, 000 of these LPU units. Um, and you know, if they 10 X again, right there. They're really like in NVIDIA scale territory.

Uh, they've got a bunch of partnerships, by the way, these include Meta and Samsung, a whole bunch of sovereign nations and, and, uh, sovereign wealth funds, things like that. So, um, you know, a lot of serious people with smart money taking these guys very seriously, and we'll see if the scaling, uh, comes out the way they hope it will.

Andrey

Right. And, uh, also to me, it's interesting to my knowledge, Grok isn't quite like NVIDIA in the sense that we are using a lot of their. Uh, AI chips, perhaps all of them. I'm not sure to have a cloud offering with which you can do inference on open models that are released. So Mistral, Lama, et cetera, that's their kind of moneymaking machine. They are competing more so with Amazon and Google in that sense, when NVIDIA are not selling off the chips.

And, uh, as we covered before, the big differentiator is their inference speed, their lighting fast about, I think the latest benchmarks are maybe three, four times faster than if you were to run it on NVIDIA chips. So very significant. And, uh, I think. still not much competition on that front in terms of inference speed. Of course, a lot of people are trying to compete on that front, but Grok is a clear leader.

So that's another advantage where if you're not trying to sell to customers, you're just scaling up internally. Lots of factors that make that easier to move fast.

Jeremie

Yes. Some, some that in fairness, do you make it harder as well? Right. Cause you've got to have A large enough user base that you can have these really large batch sizes. Like one of the things that drives down cost of inference the most is being able to process large batches of data in parallel at the same time, right? So you're paralyzing the use. You're getting more value essentially out of your existing units of compute.

And One of the challenges for Grok is then, then you got to be in the distribution game as well. You've got to actually host these models. You've got to convince people to use your thing instead of Amazon's thing. You don't have natively that distribution, those massive enterprise deals and all that stuff. So, you know, I think this is for them, especially given the way their hardware is set up, the memory is really limited. And so you need many, many of these units.

to be able to hold one kind of, say, llama scale model and, um, and, and that's an issue. So it increases even further their need to have very large amounts of throughput to kind of amortize across all those devices that they need to have in their, uh, in their sort of server farm. So, yeah, uh, I think all of this is making rock a very distinct company and it could be a banging success or we could find out there are weird issues with this, this model as it scales.

And. I think the next couple of months, we're going to learn an awful lot about this. You know, March, 2025 is not that far away. And

Andrey

one last comment, there's a lot to say here. Uh, they're also competing against OpenAI and Entropiq, right? In the sense that if you use an API for an, uh, chatbot in LLM, you kind of choosing between OpenAI and Entropiq and in this case, open source models. So for them, it's a very, very big deal that Llama3 has come out and is on par with GPT 4 and Anthropic. And perhaps that's a bit of a reason why we've seen GPT 4 costs just plummet down, right?

We're in commodity areas, OpenAI very much trying to compete on price to a significant extent. So that's another challenge. Next story is a bit more drama, you could say, from OpenAI and another trend we've seen. John Shulman, a co founder of OpenAI, one of our research leads who joined OpenAI Basically out of a grad school has been there since the beginning and was quite impactful in a development of Chad. GPT has left the company to join on tropic.

And once again, I think the reason cited was wanting to focus more on AI alignment and safety. And alongside that, we've seen direct. Greg Brockman, the OpenAI president and co founder, taking an extended leave until the end of a year to, quote, relax and recharge. And finally, Peter Dang, a product manager who has been at OpenAI since last year, has also left the company. So, uh, you know, Maybe a coincidence, maybe not, right?

Certainly, it seems like a lot of people in AI safety and alignment want to be at Entropic, which is not surprising. Entropic is more focused on that compared to OpenAI and just about any other player. But some of these other high profile departures are perhaps a bit more surprising.

Jeremie

Yeah, absolutely. And, and, you know, it shouldn't be lost on us that, uh, we're now on the fourth generation of open high alignment leadership. That's not a good sign, right? We had Paul Cristiano originally was like the head of alignment. He left to start meter, or I guess it's become meter. It was arc.

Uh, back in the day or archivals then, um, and then we had, uh, yeah, and then we had Jan Leica who left, uh, following the, um, very public disagreement that he had about the course of opening eyes, uh, kind of progress and super alignment and the extent to which Sam Altman seems not to be living up to his commitment, uh, on the compute side, 20 percent of their committed compute as of about a year ago to super alignment. So he left and now we've got John Shulman.

I remember watching him on the Dwarkesh Patel podcast. Um, there's, he gave this like long interview. Uh, this was shortly after I think, um, Jan like, uh, was it after he departed? Anyway, it was in the context of all that. And I remember being really impressed. I was like, wow, you know, like this is not bad. This is a guy who seems to take, you know, the issue pretty seriously and may have different views from, from Jan or whatever, but, uh, but generally seems quite engaged with it.

Uh, and I was like, yeah, it's sort of surprising given, uh, Sam Altman has sort of seemed to make moves that suggest, at least to me, that he's not terribly serious about the issues that he has claimed at one time to be serious about. Um, and now, sure enough, John Shulman leaves. So not a great sign, I think, for a lot of people who are, uh, concerned about OpenAI living up to their word here again. Um, Not a great sign, necessarily, as well, given that he's a co founder of OpenAI.

A lot of people have been talking about that, you know. What, what do you do if you genuinely believe that OpenAI is on the cusp of building a superintelligence? Do you actually leave the company, right? That seems like a weird move for somebody to take if they genuinely thought that OpenAI was on course to do, uh, what they seem to think they're on course to do. Now, I, I personally don't take stock, uh, or take, Too much talk of that. That's not the wrong expression.

But anyway, I don't take that argument too seriously. Um, I think, uh, just based on talking to folks at OpenAI, um, they very much seem confident. Uh, and I haven't seen that change. Um, I think you gotta wait. You know, the thing with scaling is, We get to, uh, discover what the scaling curve can do only every two, two and a half years, right? It was two and a half years between GPT 3 and GPT 4. Um, so we've got a year to go before GPT 5 on that schedule.

Uh, so we're actually not behind schedule, but people seem to forget, like, the scaling curve just, you know, it takes a while to get to the point where you can actually discover what that next point looks like. So I suspect things are actually going technically better at open AI than most people realize. I wouldn't be surprised if we see something impressive come out from them in due course.

Um, but this does seem to reinforce that narrative that on the safety side, uh, open AI is sort of less than, uh, less than serious on, uh, or at least at the level that they've claimed they would be. At least that's my impression personally. Um, the, the product departure is interesting. The Greg Brockman departure, again, a lot of people reading a lot into that. I'm not so sure that that is an indicator either. You know, he's been there nine years.

Uh, you can see somebody like that needing to take, you know, six months off or five months off or whatever. Maybe, you know, he's got a training run that he's got to wait for to complete or something and he's taking a breather, but, um, hard to know what to make of all this stuff. Definitely opening eyes. Uh, kind of upheaval is, uh, is going to be the narrative that's around. I think it's partly accurate, as I've said, especially on the John Shulman side.

That one I think is the most interesting and informative piece. I do know there are people at opening eye who've known that this was coming for some time. Um, but, uh, But anyway, uh, yeah, I, I don't know how much to read into the, the Greg departure.

Andrey

Right. I do agree. I think we'll be in Greg part of this, uh, that may not indicate very much. It is pretty realistic, but he needs some time off. I'm sure, you know, we can be certain that it's been a very chaotic Time at OpenAI for the past year and a half, we've grown, I don't know by how much, I believe I had maybe about 150 employees as of 2022, if I recall correctly. Now I wouldn't be surprised if they're four times, you know, six times that, something to that. Oh, I think they're over a

Jeremie

thousand now.

Andrey

Yeah, exactly. So maybe like 10x scaling. And when you do that, you know, in a short period of time, one year for a startup, it's, it's insane. It's crazy. You know, it's, it's chaos. So, uh,

Jeremie

John, it is scale does, does wreck a lot of things in startups. One of the things that usually doesn't. Because it tends to be correlated with success. You usually don't see founder level people leaving like this. Um, but, but you know, opening eyes, not normal, right? Like, I mean, this is, this is a company that's trying to build AGI. So, uh, you know, expect the unexpected. And, uh, to your point, I mean, this is, Uh, these departures mean something.

It's not clear fully what, but, uh, some of them, like again, the John Shulman one, I think that is absolutely, uh, to me, highly suggestive of the kinds of problems that young like a highlighted, the kinds of problems that, you know, whistleblowers said, open AI left, right, and center have been talking about, uh, with respect in particular to Sam Altman's leadership and the extent to which he's been, um, sort of manipulating his way around and through the board, um, uh, at least allegedly.

And that seems to. It's to stack up with all this stuff.

Andrey

Right. And, and again, Greg Brockman isn't leaving. He's taking a leave. But, uh, four to five, uh, months leave is pretty significant, right? So that's worth noting. Out of the lightning round, another story about OpenAI drama with a story that Elon Musk has filed a new lawsuit against OpenAI. And this is. Kind of the same story. He filed the first lawsuit alleging that OpenAI has breached the founding agreement of the company.

Musk, of course, was one of the co founders and put in a lot of money at the beginning. He left, uh, uh, parted paths with OpenAI in 2018. He said that was due to also working on AI at Tesla. and not wanting to have conflicts of interest. Sources and reporting has told us that perhaps it was more so because Musk wanted to take over, wanted to be the leader of OpenAI and was not able to take a position from Altman.

So the previous lawsuit uh, argued the same thing and was, uh, uh, basically taken back by Musk. It was, uh, yeah, dismissed by them as opposed to a judge. Uh, it, uh, was pretty flimsy from our knowledge. It basically said that there was this founding agreements. to be a nonprofit, but it was not even a contract. It was sort of an informal agreement. So, uh, at that point, open AI had dismissed Musk claims as incoherent and frivolous in a blog post that included Musk's.

This lawsuit is about, uh, twice the length of the first, uh, it was filed in a federal court in California and also claims that OpenAI is engaging in racketeering activities. So there you go. Uh, Elon Musk continuing to be on the attack against OpenAI.

Jeremie

Yeah, this is, um, this is pretty interesting. So, uh, you know, the, the narrative, as you said, is like, look, I put my money in says open AI on the basis that this is going to be a nonprofit. Um, and then Sam Altman says, Oh, we just discovered AI scaling as a thing.

And that means we're going to be need a ton of money if we're going to achieve our goal of building AGI, which means we need to become a for profit or at least tap for profit through kind of weird structure that they have over there. And Elon says, well, you can't just do that. Um, that's a violation of blah, blah, blah.

Now the, the emails that OpenAI then leaked were essentially showing, uh, showing Elon acknowledging this, acknowledging the need for the company to make a ton of money in order to fund all the compute they need to, to scale their, their AI. So their argument is, look, you knew about this, you were okay with it. Um, you know, you, you, you recognize that need. This. is kind of interesting. I mean, I don't, I'm not a lawyer. I don't feel remotely equipped to assess the validity of this lawsuit.

Um, it's got some bombastic language in it. I mean, it refers to, uh, the perfidy and deceit of open AI as being of Shakespearean proportions, which I really, really like. Appreciated that it alleges that Elon was, quotes, betrayed by Altman and his accomplices. So you know, some, some very strong language there, obviously there's no love lost between Elon and Sam. Uh, this is a rivalry that goes back quite a ways to the kind of first split when Elon was, uh, by some accounts forced out.

Um, And, I mean, just look at some of this language, right? Musk's case against Sam Altman in Opening Eye is a textbook tale of altruism versus greed. Altman, in concert with other defendants, intentionally courted and deceived Musk, praying that he would win. On musk's humanitarian concern about the existential dangers posed by AI. So, I mean, this is like really intense stuff. Um, and you're right, the racketeering piece, right?

This is like when I read racketeering, again, not a lawyer, but I've seen enough mob movies to know racketeering is an interesting freaking thing to be alleging. So where does this come from? Well, I think again, not a lawyer, but I think this is, this is the quote that brings up the racketeering concern and how they're justifying it.

So they say, In partnership with Microsoft, Altman established an opaque web of for profit OpenAI affiliates engaged in rampant self dealing, um, seized OpenAI's board and systematically drained the nonprofit of its valuable technology and personnel. So this is a really interesting angle for the racketeering case, right?

The case is like, look, OpenAI itself, yeah, technically might not super be a for profit or entirely be a for profit, but it's making deals with hardware companies and those hardware companies are owned or indirectly owned by Sam Altman. It might be making deals with energy companies that are owned or indirectly owned by Sam Altman. So now you've kind of got this indirect way that OpenAI is fueling this growth. The idea of its technology and power is that it's going to make a lot of money.

Even especially personnel being siphoned off. Um, you know, maybe this is an allusion to that kind of, uh, power, uh, play where, you know, Altman threatened to leave opening eye for Microsoft because Satya invited him over and bring all the researchers with him. Um, that didn't actually end up happening, but certainly was threatened. I don't know, but I just thought the racketeering piece was really interesting.

And it seems like as, as good of a case as you could make, uh, in this context for, for that kind of, uh, charge.

Andrey

And, uh, next again about OpenAI indirectly, but, uh, a bit less dramatic, the story is about FIGUR, the humanoid robot company, which has unveiled the FIGUR 02, the successor to to their figure zero one humanoid robot that was unveiled just last year. This one was developed in partnership with OpenAI, which helped, uh, figure raise, uh, 675 million series B around back this February.

And, uh, as we sort of knew back when this was announced, the idea is that figure two is powered by a chat GPT when it comes to intelligence. And now we know also for, uh, generally natural speech conversations, of course, which plays in nicely into a GPT 4. 0 announcements from, uh, open AI and the development of, you know, real time voice chat from that. model.

So pretty, you know, again, a trend we've seen for the last year is a lot, a lot, a lot of investment in humanoid robotics, a lot of rapid progress in humanoid robotics. Uh, there's quite a few competitors, one X agility, others. That are trying to build a community robots. Of course, even a Tesla is investing a lot in this direction. So they go, it's a pretty exciting for me. And apparently figure two has an underground, a ground up hardware and software.

Redesigned, has a bunch of cool hardware on there, and certainly looks cool.

Jeremie

Which is the most important thing. Um, the hands, yeah, apparently hands are a key differentiator. And so I, I'm always amused. Like I, I, um, you know, I, I'm more on the kind of the model side, um, the capability side, scaling, all that jazz. Uh, so when it comes to, to the extent that I, I focus on hardware, it's all, all this about, right. The processors and, and that sort of thing, not the robots. And so. I keep surprising myself at how little I know about robotics.

One of the things apparently is that there's a camp of people who think human inspired hands are way too delicate and just over engineered and um, so there's this debate as to whether you should kind of optimize for human like hands or not. Figure is making their whole thing human like hands. We're just kind of diving into that, going to try to make it a thing. Um, so Figure 2 has, you know, there are a lot of videos, you can kind of see them to emphasize heavily that.

hand dexterity, that sort of thing. The other thing that I didn't really clue into it was just in this article that they put enough of this together where you're like, ah, okay, I can see the trend. Um, so cars, car manufacturers seem to be for whatever reason, the go to early use case for these kinds of robots. Um, figure is beginning pilots with BMW apparently. And apparently the, so figure two robot is already been over to a bunch of BMW plants, um, in North Carolina to kind of collect data.

But they mentioned agility, aptronic sanctuary AI, all of which have similar pilots with car makers. And when they said that, I was like, Oh yeah, they do. I'd never connected that those dots. So that's, you know, and then you said Tesla, well, yeah, Tesla optimist. That's kind of where that comes from. Um, so anyway, for whatever reason, that is the, you know, the high margin, low hanging fruit. Um, application, it seems for these kinds of systems, which I thought was pretty interesting.

Andrey

Yeah. And I think that makes a lot of sense, right? Because we do have a lot of humans doing physical labor at these plants. There's, uh, in some cases a shortage of talent because you do need talented, uh, people on those lines. You need to move fast. You need to Keep working. And that's one of the benefits you can go 24 seven with humanoid robots. Can't we do that with humans?

And one other thing I'll note is this is an interesting case of competition because in figure one X, et cetera, when you're at, uh, Car plant, you don't need necessarily, uh, kind of general purpose reasoning capabilities. You don't need necessarily a lot of knowledge about politics or other things that chatty PT and so on provides you. What you really, really need is a very, very robust control. Of the robot.

You need to be able to move, uh, use your arms, uh, do all of that quickly and, uh, accurately. And that's been a challenge, a longstanding challenge in robotics. And that is kind of the point of competition. In addition to a hardware, uh, among these companies making the software that allows robots to work well on par of humans, that's not been the case.

Yeah. I don't think it's the case yet, but a trend in robotics has been collecting more data and being able to sort of scale up models and really train machine learning models over, uh, classic control techniques. So, uh, I'd be surprised if we see human robots being able to replace humans fully. on, uh, plant lines, uh, until maybe a few years from now, but it does seem to be heading in that direction.

And next, another story related to hardware, this time about chips, and it's about how ASML and Tokyo Electron have dodged new U. S. chip export rules. So the U. S. is considering invoking the foreign direct. Product rule, as we've said, uh, with that, I want to regulate foreign products made with American technology. And, but as the story says, so far, ASML and Tokyo Electron have been able to continue producing. Selling to China in part because they have a potential exemption from the rules.

U. S. may be excluding this Dutch and Japanese semiconductor equipment companies. So there you go. Partially, this may be because the countries themselves are likely to conform to the stricter export policies without the U. S. invoking this FDPR policy. And uh, yeah, ASML, very, very important for being able to do, uh, cutting edge, uh, Chips, which China is very much behind on, uh, very are, you know, not on the scale of TSMC, which is the provider for NVIDIA.

So very, very important for China to try and catch up and the U. S. is trying to curb that possibility.

Jeremie

Yeah, that's right. So ASML being that Dutch company that makes the machines that make the machines, right? So essentially they make the photolithography machines that TSMC in Taiwan uses to make the chips that NVIDIA uses to make the GPUs. So they are way, way, way at the very top or bottom of the supply chain, the very beginning of the supply chain. Um, so this is actually about, yeah, not necessarily just the, Export of ASML machines, but the ongoing maintenance as well of these devices.

So one thing that not a lot of people realize is that, um, when you, when ASML sends a machine, a device over to say China, they don't just send the machine. They send a team of about a dozen personnel to maintain the system, to integrate the system, to ensure its proper use and functioning. And so, If the US were to come in and say, Hey, you know, like you can't do that.

We're they, they essentially can not just, um, uh, prevent China from receiving new machines, but significantly hamper, maybe even halt the operation, the ongoing operation of ASML machines in China right now. So that's a really, really powerful tool.

Um, you know, to your point, like there's this question of, okay, U. S. comes in and imposes this rule, um, this rule essentially says, if you have a product that is made with even the slightest sliver of American technology, it falls under this rule. So in other words, U. S. the U. S. can tell you, no, you're not allowed blanket rule. You're not allowed to export to these kind of firms that are on this list that they're making or expanding rather.

So essentially we're in a situation right now where it seems like the U. S. might be banking on these companies, just adhering potentially to those. Um, those constraints, uh, voluntarily in a sense without having to impose that constraint, or they're okay with the FDPR not actually being applied. It's a little unclear how that's going to fall out. Um, but just to give you an indication of how much of an impact this is having on ASML's bottom line, uh, their shares went up 11 percent.

Uh, Tokyo electrons went up about 7 percent on this news. So essentially the market feels that the value of China to ASML is so significant that the U. S. invoking this rule is worth, uh, you know, more than 10 percent of the company and expected value, right? With, with no guarantees, of course, that, uh, that these rules won't in practice be enforced because the companies might kind of be having their arms twisted to do this pseudo voluntarily.

Um, we've got to see how it plays out, but this is an interesting, uh, uh, Uh, next move, it's, it's unclear, you know, we talked last time about how, uh, ASML and Tokyo Electron were pushing back really hard against Washington, making this argument to invoke the FDPR, uh, and trying to pressure their respective governments to push back too. But you've got at home, American companies, by the way, who, because they're in US jurisdiction, they are already forced. To adhere to these protocols.

So companies like Applied Materials, LAM Research, KLA, all these companies have been complaining, look, we're shouldering this burden. Meanwhile, ASML, they can ship to China. Tokyo Electron can ship to China. We're facing a huge disadvantage here. They're basically lobbying the US government to make this a unilateral play. So everybody plays by the same rules. That's all kind of part of the political and economic game that's being played here.

We're gonna have to wait to see what in practice is the effect of the US seeming to choose not to invoke this rule at this point.

Andrey

And to provide some context, ASML, as you said, provides the hardware necessary to make cutting edge chips, the, you know, the latest generation of the smallest chips that is necessary to develop NVIDIA, uh, quality GPUs. And, you know, these machines, Cost, I think about 400 million. They are not easy to make. I think ASML is to my knowledge, the only provider of the latest generation of these, uh, extreme. What is the extreme ultraviolet or something? Lithography.

So very big deal here to, uh, think about this. And the US does have some leverage in the sense that as we've covered a lot of investment in domestic chip production, you know, on the order of billions of dollars. So, uh, these companies certainly do want presumably those, uh, Uh, new efforts to also buy from them. TSMC is partnering with some, uh, us companies on making plants in the U S. So very interesting dynamics going on here.

Maybe a bit, you know, if you're not interested in geopolitics and, uh, Hardware production may be less interesting, but for us, very interesting. And finally, the last story of the section, uh, business taking up half this episode, surprisingly, uh, the story is that OpenAI has reportedly led a 60 million round for Webcam startup, Opal. So this company is developing consumer electronics for high end webcams. These are, uh, 150 webcams.

They have one that's called Tadpole and it's designed to clip onto a laptop monitors and, uh, really small, really, uh, kind of pretty and very high resolution. Now that is not. An AI product. So the suspicion is that they will now work on an AI product given the investment from OpenAI.

Jeremie

Yeah, apparently, so, Opal has actually done some, you know, AI integrated devices in the past. So this, you know, wouldn't be completely out of left field for them. Uh, not that this tells us anything about what they're planning to do next, but, uh, prior to launching their kind of latest thing, which is, uh, yeah, webcam called tadpole. It's 150 bucks. Um, kind of their flagship product before that they had a webcam called the C1.

It was twice as expensive and used some, some AI tools basically on board to optimize the quality of, of the video that it captured and to do some background blurring effects, that sort of thing. That's about as. As close as it gets to AI natively at this company. So that leaves us with a lot of open questions, you know, obviously those don't seem like obvious open AI flavored things, but we'll see what ends up being made of this.

Andrey

Right. And, uh, it again follows on a trend. Uh, Sam Altman in particular seems to be a big believer that AI powered hardware is important. So we've seen them invest in Humane for the infamous Humane AI pin. And, uh, here's another example of that investing in this a device. We've also seen, uh, I guess, reports of, uh, the designer of the iPhone journey. I, um, possibly working with open AI on another AI. Consumer device. So, so far hasn't really worked out.

We'll see if one of these companies finally nails it. And we are now done with business. Moving on to tools and apps. Starting out again with OpenAI. And the story is that OpenAI has cut GPT 4. 0 prices again and has also launched structured outputs. So this is a big price reduction. It makes it 50 percent cheaper for input tokens and 33 percent cheaper for output tokens. And this is again, after another price cut, after GPT 4 already being the cheapest model on the market.

to be able to be usable by companies. And that's a big deal because now I think the main source of income for these companies is the users of the API is the customers making use of these products. So certainly it is a big deal. that they are able to do this structured outputs. That means that you can, uh, output, uh, more, uh, let's say, uh, yeah, as it says, structured, more sort of programmatic outputs that are easier to parse. for programs.

So things like spreadsheet formatted things, uh, database formatted things, et cetera. Uh, we've seen that already be launched by other companies, so not fully novel, but certainly an important capability.

Jeremie

Yeah. And is the, um, this came as well with a, an announcement from open AI. Oh, and she should mention too, right. The closest comp to this is probably Gemini 1. 5 pro, um, which, you know, rather than opening eyes. Two and a half bucks per one million input tokens is three and a half for one million input tokens. And rather than being OpenAI's 10 per one million output tokens, it's 10. 50 per one million output tokens. So just trying to sneak under that threshold to be the cheapest model.

Uh, mission accomplished relative to Gemini 1. 5 Pro in any case here. But obviously those prices are going to keep going down as hardware gets better and software improves. Um, yeah, one of the cool things that came with this was a blog post where opening. I talked about, you know, how did they actually train this model to, uh, to do this? How did they train it to do the structured outputs, uh, that, um, that are generated here now for a little bit of context. Structured outputs.

One of the key structures you might want to play around with is something called JSON. JSON roughly is the format of data, the format you want to put data in. If you're going to move it around the internet, let's say if you have a, some sort of like web service and somebody wants to ping your web service to get data that they're going to, for example, show on their website. So you might package it up in a JSON file or you basically share it in that format. Um, JSON is.

Just a way of formatting data. You have curly braces, you have a data type, you have a colon, and blah, blah, blah. So there are some basic rules of formatting that you have to, have to adhere to if you're going to make valid JSON. Um, opening, I first tried to just fine tune their latest model, GPD 4. 0, to understand some of those more complicated schemas, and make sure they produced outputs that matched. Those schemas.

The problem is that no matter how much you do this, your language model's behavior is never going to be guaranteed. It's always going to be a little bit random. And so you're going to see some failure rate. So what they found was through fine tuning, through all these operations on the base model, they were only able to get up to 93 percent on their benchmark.

So that's still, if you're a developer and you're pinging an API and you want to get back a data package that has a certain format, 93 percent is just not good enough, right? That's, It's kind of just not going to cut it. You need every single time you ping that thing to get your data back in a format that's not going to cause everything to crash, right? So what opening eye does is they go, okay, well, 93 percent not good enough. We're going to have to figure out another approach.

What they do is essentially use a, um, a technique to constrain the models outputs in a kind of blunt way. So By default, when you try to generate an output from a model, the outputs are totally unconstrained. In principle, the model could choose any next character and the model just has to be smart enough to choose the right next character. And that flexibility is what leads to mistakes.

So what OpenAI ends up doing is saying, look, um, we're going to have a separate kind of parser that determines Essentially, given what has already been outputted, what are the appropriate, the acceptable next characters that would continue to make this output a piece of valid JSON? And they're going to force the model to choose only characters that fall into that category. So this requires more computing. Right.

There's more compute going on in the background to run this assessment to determine that in fact, you know, okay, this is a valid JSON package. Um, and in fact, they say like the first time that you try to, to kind of get a result using this API, it's going to take 10 seconds to process that first request. Um, or as much as a minute for more complicated schemas, just because the system's got to kind of figure out that rule set and make sure it's being adhered to things get faster thereafter.

But, um, I just thought this was really interesting. It's kind You can think of it as a mix almost of like symbolic reasoning, right? Because you're imposing actual hard constraints, deterministic constraints, uh, that really kind of try to restrict what that model can in fact output. So, uh, anyway, this is a really interesting tool.

A good, you know, if you're like Like us, if you're a developer that uses these tools and you need a guarantee of a certain format or structure, this is actually going to unlock some use cases.

Andrey

Exactly. It's in particular a big deal for, uh, you know, companies, uh, and the enterprise customers. We've seen other companies like Lamina, uh, Lamina already offering this guarantee. Uh, and as OpenAI has been competing for enterprise customers, I wouldn't be surprised if FAT could. came out of the requirements of those types of customers. Final note, um, on the cutting cost side, there, the price war hasn't just been between OpenAI and Google, it has also been between inference providers.

So there's a bunch of these, such as TogetherAI, they have so far done a lot of the undercutting of costs, and uh, this is gonna probably hit them. Pretty hard. Uh, you know, for them, it might be harder to stay afloat. And next, uh, someone we haven't touched on, Apple. And the story is that Apple Intelligence could get a 20 plus version. Wow, where have we heard that before? So, this might be named Apple Intelligent. Intelligence Plus would offer extra features for a monthly fee.

Now, this is still kind of a, you know, reportedly, we don't know if that's the case, but seems probably not unlikely. And yeah, this seems to be the universal model, 20 for better models and more features.

Jeremie

Apparently, the challenge they're looking at here is if they're going to introduce a paid subscription, um, most iPhones are not going to run the Apple Intelligence capability. Uh, so as a result, they're going to have to find a way to convince people because people with the, you know, the older, um, uh, only users with one of the kind of newer phones are actually going to, you know, have to pay.

So anyway, it's going to have to find a way to Convince people to upgrade, to cover all that, all that costs. So we'll see, I think Apple, you know, playing, playing from behind here. Uh, we're seeing them lean on a lot of third party products, which is not necessarily a bad play. You know, there, there is a world where the value accrues at the hardware level, um, and that becomes the least fungible thing in the ecosystem. And maybe that's it.

Maybe Apple ends up winning without having to play this game, but. Uh, you know, we can't blame them for trying to spin up this sort of, uh, business model, just if only so that they can have part of the company that responds to those incentives, that's heavily incentivized to push forward on, uh, on AI capabilities and product integration.

Andrey

And by the way, uh, the other subscriptions at 20 per month, I was referring to that's open AI, that's on tropic, that's Google, that's perplexity. It seems like now you kind of have to go at 20, you can't go above it. just because everyone else is going for 20. On to just a couple more stories a bit more quickly. We go back to Amazon and the story is that Audible is testing an AI powered search feature.

This is supposedly called Maven and will be used to help users find specific Audiobook titles based on natural language processing. They are also playing around with some other features like creating AI curated collections and AI generated review summaries. So, there you go, I guess probably not too surprising. This is currently only available to a select group of U. S. customers on your phone. And is limited to a subset of the IDBook library, so they're just starting to roll this out.

But likely going to be a feature that will continue to expand.

Jeremie

Yeah, we, we don't know what models are, are powering this feature. So there's just a lot of uncertainty here. Apparently a spokesperson said that Maven is going to leverage quotes, the strengths of multiple models and will quote continually continuously evaluate as models improve, which obviously makes sense. But it's clear that they're, you know, anyway, not tying themselves to any, anyone in particular.

I think the interesting thing, if this is an Amazon owned company, so you know, if they don't end up using Amazon internal models, they end up using, you know, like. Open AI models or whatever, you know, could, could be, uh, seen as an interesting sign. Um, and, and there has been apparently a report recently that thousands of AI voice audio books are being listened to by audible users. So, uh, they're, they sort of add this as a side note in this article.

I thought that was kind of noteworthy, you know, it tells us something about the, uh, the actual market value of this, this tech at this point. So lots of, lots of jobs under, uh, under the gun these days, it seems.

Andrey

Next, a very similar story. It is that Reddit is going to test AI powered search result pages. So this is seemingly coming later this year, and these AI powered search results will provide users with AI generated summaries of the top search results. top search results. Uh, I guess now that's everyone, right? We got Google, we got Bing. Flexity, of course, is there. That's our whole product. Audible is seemingly doing this. Now Reddit is also jumping on the train.

And this is perhaps not surprising given that Reddit has partnered with OpenAI back in May. So, you know, pretty clear application of, uh, ChatterBot there. Onto research and advancements, and the first story is kind of an exciting one. I think it's generally the most buzz in the researcher technical AI circles. The title of the paper is Scaling LLM Tests. Time compute optimally can be more effective than the scaling model parameters.

So basically what this means is the inference time of a test time inference time compute is, uh, after training when you actually run your model, you know, uh, how much do you allocate to that? And we've seen many, uh, ways in which you can allocate more compute. You can run additional models and combine their outputs. Uh, here they analyze two mechanisms. So they say you can search against dense, uh, process based verify reward models.

So not dissimilar to the open AI structured output technique. And you can also update the models distribution, uh, over response using that reward model. Uh, so, um, Yeah, the story is apparently that the compute optimal strategy can improve the efficiency of test time compute scaling by more than 4x compared to a best of n baseline.

And in a Flop's matched evaluation, an evaluation that has the same amount of compute, they did find that on problems with, uh, They found that even when you use a smaller base model, uh, for a problem, if you do use the right test and compute, you can outperform a larger model by up to 14 times.

Jeremie

Yeah, this paper, um, is, I think, It's badly needed. It's the paper that I personally was waiting for for a long time. I know we've talked about on the podcast a lot, this idea of a kind of exchange rate between the compute that you use during training and the compute that you use at test time or or at inference time, depending on how you want to phrase it. Um, and there've been a lot of papers that show That you can kind of trade them off.

You can choose to invest your marginal operation or your marginal flop in training the model more or in making the model think harder after it's been trained, but focusing on the particular problem you've just fed it, the particular prompt. And that's it. What we're learning here is that the picture is actually quite complicated. So some of this research has been contradictory, by the way, um, the research that talks about this trade off.

You'll find that in some cases they'll say, Oh, yeah, it works great, you know, kind of blanket, uh, exchange rate between the two. And then others say, Well, actually, this works really poorly, especially on complex logical tasks. And so the question that they're going to try to answer here in part is, Which one of these stories is true? How do they break down? And what does it mean to scale test time computing optimally?

Now, when we think about scaling training time compute, it's easy to think about what that might involve, you know, train the model, for example, on more data. Okay, you're just gonna spend more computing power, more flops to actually dial in the weights of the model, get that model trained up. But what does it actually mean to scale up test time compute, right? What does it mean to get a model of think harder, if you will, when contemplating a particular prompt at inference time?

Well, there were a couple of different strategies to do this. And the first thing this paper has to figure out is how are we going to kind of identify strategies that achieve this? So one naive one is called best event sampling.

And this basically means You give the model prompt and then it's going to generate in outputs a bunch of different outputs in parallel to respond to the same prompt and then find a way to select the one that scores highest according to some reward model or some verifier model. It's going to review them and kind of go, okay, yeah, that's the best answer. And then that's the answer you use, right?

So there you can see how, yeah, okay, that does involve putting in more compute at inference time at test time. It's going to give you more results and then you sort through those results to pick the best one. And yeah, I can kind of see how that would result in a better output. by investing at test time. That's not the only strategy you can use though. So what they do in this paper in particular is they're going to use that, that best event sampling as a benchmark.

So that's what they're going to compare all their strategies to, but the strategies they're going to explore most deeply are going to involve taking, because this is Google, they're going to take a fine tuned version of the palm to model, which is a Google language model. Um, they're going to fine tune it. to revise incorrect answers to, uh, math problems from pulled from the math benchmark.

And so the idea here is, okay, let's make a model that, you know, takes it takes in a prompt for a problem, solve a math problem that we're going to solve. It generates an output. Now this model is going to be fine tuned to look at that output and revise it. And to, to try to make it a bit better once the whole output is produced, that's one strategy, right?

Now, the second strategy would be instead to look at the correctness of individual steps along the reasoning chain using what's known as a process based reward model. We've talked about those before. So you've got the idea of kind of doing a wholesale revision of a fully formed response compared to, um, doing a sort of step by step, uh, correctness verification process. Now, you can do those both, like by investing more and more compute into your model, do more and more thorough job, right?

You can revise an incorrect answer n times, and you're using the same amount of compute roughly as you would to generate n different responses. It's an interesting question as to which one, which approach works best. The answer turns out to depend on the kind of problem you're trying to solve.

If you're trying to solve a relatively simple problem, and by simple what I mean here is a problem that the base language model can already kind of solve in one shot, like it's already getting okay, just like low but okay results. In that case, you can get better, uh, better results from doing n revisions of an initial answer. Then from doing an attempt in parallel, and, and that kind of makes sense, right? Because in a sense, the base LLM is already doing okay.

So revising that answer, it has enough knowledge to kind of gain a lot of juice from that process. Um, you're in refinement mode, if you will. Whereas if a problem is harder, Especially problems that require like searching over a lot of different high level approaches to problem solving. Then kind of re sampling new responses from scratch, like basically re prompting and doing that n independent kind of tree search approach turns out to work a lot better.

And so they use these to create this, this, uh, essentially they call it an adaptive compute optimal strategy. So what this means is, give me a prompt, And in a dynamic way, my model is first going to assess, okay, what kind of problem is this? Is this one of those easy problems where I can have a decent initial solution and then iterate on it?

Or is this a harder problem that I need to try a lot of different solutions in parallel using different strategies that in some cases actually look a lot like alpha go. So it sorts those into those buckets and then it implements that solution. Using that technique, they reduce the compute requirements. By four X by four fold relative to their, their naive baseline. So, um, this is really, really interesting.

It has a lot of implications for the future of AI, because if you're going to take a system and have it run as an autonomous agent over long periods of time, that implicitly means you've already trained the model. You're now looking at test on compute. Finding ways to optimize how the model solves problems in real time. Um, the last thing that's worth flagging, and this wasn't mentioned in the paper, but I think it is a, it is a trade off worth keeping in mind.

Um, you know, the training compute is, in a sense, better because it's a one time expense, right? So it's not just quite this one to one thing. If you're trying to solve a problem, if you pre train a model to the point where it can solve that problem in one shot, it's Well, that's an advantage that the model retains the net with the next problem you're trying to solve.

Whereas any compute that you invest in inference time, unless you have some persistent memory that you're able to sort of store intermediate results in, um, that compute is sort of lost forever after the problem is solved after the session ends. So, uh, other than that though, I mean, I thought this was just a really, really cool paper. It is the paper I've been waiting for, uh, in so many different ways. So it's, it's really exciting to see it out there.

Andrey

Definitely. And, uh, to that, uh, comment about AlphaGo, a little bit more context. So the kind of naive view of how these models work is you have a large language model, you give it an input, and it produces a corresponding output. And that is your answer.

In practice, the way LLMs function can be tuned in different ways, so when you generate a sentence, right, you probably want to be able to look ahead and think about, you know, what I'm going to say later if I say this word next, would it make sense in the context of what comes after that. And that's something called a beam search, where you basically explore multiple possible outputs. And then you look at what do I say after that?

And then similar to AlphaGo is like, what, what is the final outcome is a final outcome, something that looks correct. And that's where you can use that PRM model to explore every possible step and evaluate every possible step. Very much like what AlphaGo does. Uh, so yeah. Very important paper coming from UC Berkeley and DeepMind. So yeah, I like to see universities still being impactful. Uh, in this case, this was a person from UC Berkeley, having an internship at Google DeepMind.

So perhaps technically fully DeepMind, but, uh, universities are still the place where the people who go to DeepMind so on, the researchers originate from. And speaking of DeepMind, the next story is again from them and is on the, uh, robotics this time, not large language models. The title of it is achieving human level competitive robot table. Tennis. And the claim is they present the first learned robot agent that achieves amateur human level performance in table tennis, in ping pong.

It uses a hierarchical and modular policy Architecture, so going back to something like reinforcement learning, has low level controllers with detailed skills, and high level controllers that choose low level skills. Nothing new there, it's a common architecture, you decide what you want to do at the high level, and you figure out how to do that. at a low level. They also, uh, are using something called zero shots seem to real transfer.

So you train it in simulation and then you use it in the real world without having to do any real world simulation. You know, it makes a lot of sense for ping pong. You can simulate your physics very well. You can train, uh, against yourself against You know, AI, uh, that isn't trained pretty easily. So, uh, you know, you know, maybe less of an impactful paper on the surface, but, uh, important to continue training robots and parent to transfers from simulation to the real world.

Now performance was, uh, on 29 matches against humans, player, uh, players, the robot still only beat, uh, 45%. in those games, although it did win a hundred percent of matches against beginner players and uh, some less against intermediate players, 55 percent and then lost everything against advanced players. So we'll see, maybe if it trains some more, it can beat those better players. Pretty sure it's not going to go, you know, Olympic level.

Jeremie

Not in time for the end of the Olympics, or maybe they've ended. I'm not tracking, but yeah. Um, yeah, it's, uh, it's interesting. I mean, like one of the cool things about ping pong, uh, you said it actually, I mean, it's, it's very, It's a very constrained environment, so you're not going to have that many factors that take you by surprise. It's not going to be that easy to go out of distribution.

And in fact, you know, when they talk about in the paper, uh, they trained it, you know, or tested it rather in out of distribution context with new players that it hadn't seen before. Yeah, in a sense, that's out of distribution, but the distribution is like you're at a ping pong table. There's a net, there's a table, there's a racket, you know, everything's fairly, uh, fairly contained and constrained.

So. Um, this in a way I think is an interesting way to learn about generalization because you have such a favorable context for it. So the failure modes you do see might be a little bit more obvious. You can also have physics simulation engines that more accurately capture what's going on, which a is part of what's allowing them in this case to succeed at this out of distribution generalization.

But b will allow you to detect cases of failures about a distribution generalization and make it easier to interrogate those failures. So from an academic standpoint, I think this is actually quite interesting as well. Um, you know, I'm not going to be playing ping pong against a device like this. I'd kick my ass, but it looks damn cool in the images.

Andrey

Yeah. Uh, if you look, we have some videos of this as always, robotics at the very least, you get very fun videos, uh, a little less. Let's say boring than language models and I mean, I'm just saying, it's very fun to see robots doing stuff. And in this case, even more fun seeing them play table tennis. If you look at the robot, yes, it will definitely destroy me table tennis. Next up back.

To, uh, neural networks, it is self compressing neural networks, and the, uh, as the title says, the focus is on reducing the size of neural networks, which is a big advantage in, uh, you know, being able to have a cost less and consume less power. That's been a real trend with, um, big LLMs. We've seen the 7 billion, 2 billion parameter models.

Typically being done via distillation and here they are proposing a method called self compression which aims to remove redundant weights and reduce the number of bits needed to represent the remaining weights. Quantization also very popular. And they say that you can maintain the same accuracy with as few as 3 percent of the bits and 18 percent of the weights remaining in the network.

Uh, don't, I don't feel it's necessarily an overly novel approach or result, but, uh, you know, a big important trend and important results on that front.

Jeremie

Uh, yeah, I mean, it's like, at least for me, I, I thought this was really cool paper. Uh, so one, one of the things that they do that's distinct here is, so when you train a neural network nowadays, right, you cause them, you get the model to make a prediction of some kind. If it's a text autocomplete system like GPT 4, you're predicting, you know, what's the next word in the sequence.

And, uh, Uh, based on whether or not you were right or wrong, uh, you're going to update the values of all the weights, all those parameters, those numbers in the neural network, right? Now, um, this allows you to do a really good job, but what if not all of those weights actually, A, need to be there, or B, need to be represented With like full resolution, full, full at floating point precision, let's say, right?

It's, it's often the case that we want to, as you said, quantize, we want to reduce the resolution of that we're using to capture our weights, to represent the weights in the neural network. Um, if only because, Hey, I mean, it makes it easier to fit these things on edge devices. It reduces the cost of inference and so on and so forth.

Um, the challenges historically you've had to choose, you've had to choose, are we training a neural network with weights that have this many bits of representational accuracy or precision, or are we. doing training at this level of representation, uh, precision. And then you lock it in, you train your model. Well, in this case, what these guys are saying is why don't we also make the resolution, the precision of the representation of these weights, the level of precision Trainable.

So why don't we make it so that the model over time learns not only the values of the weights, but also the level of precision with which each weight needs to be represented. And the really cool thing about this is that over the course of training, it sometimes dynamically will learn, okay, um, let's say weight number 112. Well, weight number 112. Um, maybe, maybe I can represent it with, you know, four bits. Oh, okay. It's still works pretty well. Okay. Uh, what if I try with three bits?

Oh, it still works pretty well. It goes all the way down to zero bits. And when that happens, you just remove the weight. And so baked into this process is a natural weight pruning process that is implied in it, essentially. And that's what they're going to do. So they're going to find that automatically just by making the representation, accuracy, train, or precision trainable, they're going to prune weights that they don't need. They're going to reduce the weight.

the amount of, uh, bits that they need to represent the remaining weights. And one consequence of this, and this is really cool, I've never seen a plot like this, you end up seeing as you train your model from one epoch to the next, so an epoch of training basically is one full run over your whole training set. As the model trains over multiple epochs, the amount of training time for each epoch actually decreases Because the complexity of the model is dropping over time, right?

We're losing those weights, we're reducing the complexity that we're using to represent the weights in the network, and so the model gets cheaper and cheaper to train as the training proceeds. This is just a really, I think, interesting strategy. Essentially, you can think of it as the model dynamically re architecturing itself. It's kind of going to decide, like, how Should I be a convolutional network today? That's an extreme example, but like, should I look like that?

Okay. Well, if so, I'm going to learn that, you know, these weights need to just disappear and these ones don't and blah, blah, blah. So this takes away the responsibility of the developer in a sense to specify the full architecture of the model.

In addition to giving this really interesting additional degree of freedom, um, they use a, Anyway, an interesting strategy to create this so called quantization aware training process QAT that's based on a, actually a paper that Yoshua Bengio put together back in 2013. Um, it's really cool. Anyway, I, I just, I found this so, so interesting. One challenge with this kind of, uh, architecture, not architecture, but the strategy is just that, uh, it is kind of short term greedy.

So if you end up finding that, you know, there's a situation where, Your, your model, um, is given a batch of training data and just for that batch, it turns out that you don't need a particular weight, the model might get rid of that weight. Um, and, but maybe it was just that in that batch it wasn't useful. Then the next batch turns out it would have been. So there's this kind of risk that you discard things a little too eagerly.

And I forget the term that they coin or use for that in the paper. They have a sort of, uh, something like permanent forgetting or something. Um, Anyway, so I just thought this was really interesting. Uh, could be, could be scalable. I, I don't know the, the big question as always with these new paradigms is how scalable will it really be, but just as a concept, you know, just one of these beautifully simple things that makes you go, why didn't I think of that before?

But that's, you know, that's the beauty of, uh, of human for the moment of human innovation.

Andrey

Definitely. And, uh, they do compare to a 2022 paper that had a similar idea of figuring out the right bit depths at runtime. They seem to, uh, they have a couple of modifications, including be able to round down to zero. So remove weights. And, uh, when you compare They, uh, have seemingly much better performance. They prune a lot more and therefore are, uh, training a lot faster. Although, as you said, unclear if it's actually scales.

We are testing this on a relatively small model from back in 2018 on a small data set, CIFAR 10, uh, not surprising. This is not from DeepMind, not from Meta. This is from two researchers. at a place called Imagination Labs. I don't imagine they have a ton of compute, uh, but, uh, also wouldn't be surprised if there's follow up research from places like DeepMind.

Jeremie

And, and I do think that that rounding to zero piece is, in a sense, that's, that's the part of it, right, that that makes it so the model is dynamically re architecting itself. Because when we think about model architectures, At least a big part of it is deciding which weights truly go to zero. And that, that just one little kind of innovation, um, anyway, I think is behind a lot of what makes this thing so interesting, but you're right.

Like we, we got to see in, and also we think about fine tuning downstream could get a lot more challenging if, if fine tuning would otherwise end up revealing the need for weights that had previously been discarded. So anyway, I think it's, it's wobbly in some ways, but really interesting in others. So, uh, to your point, let's see if it scales.

Andrey

And for some more context, uh, what makes this special is that typically these kinds of things, compression, quantization, these are done after training. So these ideas are not novel. In fact, it's very common to quantize to four bits, three bits, even two bits. These days to be able to fit your models on less compute, but you do that all the time after training. Just a couple more research papers. The first one is actually related to structure outputs as we covered with chat GPT.

It's titled, let me speak freely. A study on the impact of format restrictions on performance of large language models. So they are saying if you limit VLM to output in a certain format as opposed to whatever it wants or whatever you tell it, uh, does it perform as well? Does it actually solve your problem as well as if you did not constrain it to a certain output structure.

And surprisingly, perhaps, they observe a significant decline in LLM's reasoning abilities under structured format restrictions. Now, to be clear, um, one of the ways they do this is, uh, they compare to, um, On the first case, so on a problem like Eliza's rate per hour for the first 40 hours she works, each week is 10, uh, some more. If Eliza worked for 45 hours this week, how much are her earnings for this week? They say the first prompt is reason, step by step give a final answer.

The second prompt is just output the answer in JSON, although you can include step-by-step reasoning in that JSON. So seemingly not to dissimilar, but in practice, uh, performs a fair bit worse. Uh, so indeed very interesting and, and practical result.

Jeremie

Yeah, and I think one of the immediate implications of this is that you want to break up those steps of the problem solving and the formatting. Um, you know, generally that's not going to be that computationally expensive just because, you know, the formatting is a pretty simple operation.

You, you could, you know, probably do it symbolically in different ways, but, but certainly in most cases, you're going to want to first start by asking your, you know, your AI system to generate, a step by step reasoning chain and then integrate it into some kind of JSON format or XML or whatever you're trying to ship. Um, they do show some interesting side by sides, uh, for different models, for different, uh, for different, um, uh, formats as well.

So they compare, you know, XML to JSON to, uh, to FRI to anyway, a whole bunch of different things. And then they, they look at what does it look like when you force your language model to generate outputs in those formats? Um, and the problem you're trying to solve. Takes different forms. You know, what if it's a math problem? What if it's a, a kind of logical, uh, natural language problem? What if it's a problem that involves identifying like the last letter in a sentence or things like that?

And you see the task and the format both play important roles. Like I would not have expected that, you know, that, you know, you, you ask for the result in XML versus you ask for it in Jason. And, and one is a lot easier for the model to solve for than the other. Um, you know, that, that's kind of interesting.

It suggests that some of these, um, these formats come with what you might think of as like a higher or lower formatting tax, like a computational tax at runtime, at test time, uh, that's, that is different and causes them to then stumble when they actually go to solve the logical reasoning piece. Cause they've invested so much of their reasoning, uh, into just the formatting piece. So, Yeah, thought that was pretty interesting. And another example, I guess, of a more of X paradox, right?

Where tasks that are really easy for humans are sometimes hard for AIs and vice versa. Well, this is a pretty counterintuitive thing. It, uh, definitely wouldn't have, wouldn't have occurred to me that this is going to be the case.

Andrey

Right. And on average, it appears that XML is the best model to me. Surprisingly, I would have expected it to be JSON or something like that. Yeah. Also kind of sad, you know, nobody likes XML, so why is that good?

Jeremie

Yeah, I don't understand, like, what the hell is going on, because JSON to me conceptually, right, is like, it's just, it's It's just friendlier, more intuitive to use. I don't know. I would,

Andrey

my guess is, uh, JSON is less flexible. So restrictions there, it has to be all dictionaries, XML, you can do a lot more with. So could be the reason. And the last paper is titled Berkeley Humanoid, a Research Platform for Learning Based Control. So this is a new, reliable, and low cost, mid scale humanoid research platform.

platform built in house specifically designed for learning algorithms with low simulation complexity anthropomorphic motion and high reliability across falls very important for human eye control and they say that there's a narrow simulation to reality gap so you can train in simulation and then, uh, deploy it in the real world. And they are able to do reinforcement learning to be able to control the robot. So there you go. Another humanoid robotics story.

Uh, pretty fun once again to look at the videos.

Jeremie

And moving on to policy and safety, we're going to start with this. Um, I guess you'd call it a story. It's really, it's a Twitter thread and then a report. Uh, but all the, all the most salient stuff I think is in the Twitter thread. So there's this company called meter and this is an offshoot. Um, so formally was known as arc evals that stood for the alignment research center.

Um, there still is, I believe arc, which is the kind of main org that, was founded by Paul Cristiano, who is the outgoing head of alignment at OpenAI back in the day. So this is a very, very talented org. Um, and, and I believe well funded as well, uh, that does audits for language models. And that they actually famously conducted, uh, the original GPT 4 audit that showed that GPT 4 could like, you know, convince a human to solve a CAPTCHA anyway, all that jazz.

So, uh, Um, here we have meter coming back. They've done a bunch of audits of anthropic models and of, um, opening eye models and they're just kind of introducing us to their their approach, their strategy. A lot of what they focus on is these autonomy evals, the survival and flourishing evaluations, basically seeing can an agent, for example, self replicate. Can it extract its own code, its own weights? Can it run or spin up another agent or things like that?

So, um, what they're basically saying is. Uh, their initial results here as they look at cloud 3. 5 sonnet and GPT 4. 0. Uh, these models can complete if they're turned into agents with some basic scaffolding, they can complete a good proportion of tasks that are similar to what humans can do. In about 30 minutes.

So that seems to be roughly, and there are all kinds of exceptions to this, but, uh, these models are able to, yeah, perform a good chunk of tasks again on the order of what humans can do in 30 minutes. They have a great plot that shows that they're using kind of, you know, these, uh, the amount of time it would take a human to perform these tasks as a bit of a way point to kind of gauge how effective these agents are. Um, they've got a whole new suite of autonomous capabilities evaluations.

Uh, we don't have to get into, but the broad areas are cyber security, machine learning, software engineering, all the things that would be involved in basically exfiltrating, um, in breaking out. If you were an AI agent trying to, well, trying to take over the world is kind of the idea here. So, uh, they, they look at that.

Um, they, Yeah, they focus on essentially scaffolding improvements, um, to, to language model agents that work with a fixed token budget and fixed compute budget and just see how can, you know, how well can it do, uh, when it's trying to, uh, perform a given one of these tasks. The interesting result or any interesting result is that they show that again, back to this question, right? We've just been talking about inference time computer or test time compute.

Well, they experiment with like, what does it look like when you increase the token budget You know, roughly increase the computing budget of these models, um, at test time as agents. And what they find is this steady increase, a scaling law, essentially of a sort, um, that starts to plateau. As in this case, you get to around 10, 000 or sorry, a hundred thousand or so tokens. Um, so you get diminishing returns beyond a certain point with current models. I thought that was really interesting.

Um, it's also noteworthy that the, uh, scaling law. Seems to start to plateau independent of the scale of the underlying model. At least that's what it looks like from here. Um, okay. So sorry, I should be more specific. Uh, it looks, it looks from this as if you actually could keep increasing the fraction of tests completed simply by scaling the model.

Like, anyway, you have to look at the curve, but they have a curve that shows this kind of, the, the, the, um, A fraction of tasks that are completed as you increase the token budget, um, goes up faster for larger models. So you could keep increasing the size of the underlying model and actually get like, I mean, presumably if the scaling law just continues, um, some, some decent success. So interesting to see how far that goes.

One of the results they also share, which I'll add as last note here, when an agent can do a task, typically They'll do it at about 1 30th of the cost of the median hourly wage of a US bachelor's degree holder. That's an interesting result. So claw 3. 5, for example, the sonnet agent, um, fixed bugs in some, some library at a cost of under two bucks. Whereas a human baseline took over two hours, there's obviously a lot of variation there.

We've seen versions of this before, especially in the context of concerning capabilities like offensive cyber attacks, where we saw that GPT 4, GPT 4 Turbo that is, could automate the discovery and exploit of one day and zero day cyber vulnerabilities in both cases with high success rates, and in both cases at or below the cost of a human baseline. Um, sort of cyber, cyber attacker, if you will. So, uh, this is kind of interesting as these things reach that economic escape velocity.

Um, but, uh, yeah, Arc, uh, Arc, sorry, uh, uh, Meter doing a great job of laying the results out here.

Andrey

A few more details. They have 50 automatically scored tasks. On the simple end, you have converting JSON data from one structure to another. Very easy. On the harder side, writing CUDA kernels to improve Python performance and training a machine learning model to classify audio recordings. On the hard end, they evaluated this with 200 people who have STEM undergraduate degrees and about three years of technical work experiences. So, you know, very, uh, let's say impressive.

People here to solve the tasks on the most difficult side of tasks that requires 16 to 64 hours. I will say that those are machine learning and software engineering tasks. So maybe my job is secure. Who knows? Um, And they do have a lot of tasks here that take four to 15 minutes, 15 to 60 minutes, which to be fair, is not sort of what senior software engineers do. That's more on more junior kinds of things. Next story, we are again covering responses to the California SB1047 AI regulation bill.

This time the response is by the God, mother of AI, Fei Fei Li. So we've seen responses by many of the most important figures in AI, Andrew Yang, Yen Lecun, all opposed to these models. Now, Fei Fei Li has also come out and argued that this AI bill is wrong, will hurt the AI ecosystem in the US. And that would be by, uh, they say harming public sector, academia, smaller.

tech companies and open source communities, and even academic AI research, uh, because they are more, uh, potential liabilities. They are li there's a liability clause, which would hold both the party that's using a model responsible for misuse, but also the original developer.

And there's also more things you have to build in you meant they mandate a kill switch for models over a certain threshold, which could deter developers from writing a big kind of models and would especially impact the open source community. So yeah, a lot of opposition going on against this bill that is very much concerned with AI safety.

Jeremie

Yeah, I think, you know, if we're going to talk about the opposition, we obviously have to also talk about the people in favor and even more considerably, more high profile, um, sort of AI researchers, including academic ones have come out in favor of it, including Joshua Bengio and Jeff Hinton, who very prominently said that this bill is actually really well designed. Um, you know, there are, there's all kinds of views. It's sort of the usual camps, right? Fei Fei Li, Yan LeCun.

Uh, you know, okay, no surprise, you're going to be against any kind of, uh, approach that takes, uh, existential risk seriously. Um, Jeff Hinton, Yoshua Bengio, okay, no surprise, are going to take it, um, uh, are going to be in favor of approaches that propose things like licensing and compute based thresholds and, and kill switches and things like that. There's really nothing new under the sun here. Fei Fei Li's position has been consistently this, uh, for as long as anyone can remember.

Um, I think one of the interesting things here too is like the, the, the, um, position piece that Fei Fei Li has here doesn't really touch on the catastrophic risk uh, argument that's at the core of the motivation of this thing. She kind of talks about other issues like, hey, this would be bad. Yeah. For ancillary considerations like academia and all that, that's really valid. That's important.

Um, but the challenge is, you know, if you want to be heard, if you want to have a constructive dialogue, you also have to engage with the reasons why the bill is being proposed in the first place. And there are really good reasons why the bill has been proposed. Um, it's unfortunately, I mean, I think it's it's a there's a lot of pork in there too. It's this sort of trying to tick all the boxes and get everybody on board. There's stuff about your workplace displacement and things like that.

Um, and it's not as targeted as. As really, perhaps it ought to be on the core issues that motivate the bill, which makes it possible for people like Fei Fei Li to write this article and circumvent that entire argument, which is the core of the bill itself.

Um, look, the reality is, and a lot of the objections that we've heard, um, have, have skipped the fact that there is a, Very significant, um, cutoff, a hundred million dollars, uh, required before you're in the business of, um, of actually regulating these models. The model has to have a hundred million dollar budget, uh, or above.

And, and so, you know, the thesis there is if you can afford a hundred million dollars budget, you can afford regulatory oversight of the sort that is being asked for in the bill. And we can have arguments about whether or not the underlying threat model is accurate, but those have to be debates about whether or not the underlying threat model is accurate. I think the.

idea that we're just, you know, going to say, okay, well, it would be bad for all these reasons fails to account for the pros and cons that any reasonable policy discussion has to include. So I think this is a bit, uh, sort of disappointing as a, as a write up, not too surprising because, you know, Fei Fei Li, I think is not like Yan LeCun.

I mean, you know, This is, this is my own bias showing, of course, everybody who's tracking the show knows, um, you know, I've done a ton of work in this space. Um, there are a lot of really compelling and interesting arguments for the, uh, sort of more, uh, catastrophic end of the risk spectrum here. It's not a guarantee, but like we have to deal with uncertainty. In all WMD circumstances, including nuclear war, uh, including chemical weapons. Um, so that doesn't mean that we do nothing.

Uh, it means that we have to have a robust discussion. And unfortunately, I think, you know, when we're not talking about, uh, the sort of, um, the, the pros and cons together here, uh, you know, I, I think, um, I think we're missing an opportunity to have a more productive dialogue, if you will.

Andrey

Good points there. And, uh, as you said, unsurprising, uh, given that Yann LeCun, Andrew Ng, to some extent, Fei Fei Li are more on the industry side of AI development, uh, more on the big company side. And wow, big companies oppose regulation. Who would have thought? I was, uh, you know, Yoshua Bengio, Geoffrey Hinton, not affiliated directly with any companies and also not in Silicon Valley or the U. S. in some cases. So to be fair, that does, uh, matter.

And if I were to quickly give my take, I think probably this. Is a fairly reasonable, uh, bill, although I do think a few of these things like holding the original developers of the AI model do seem a little unreasonable on that front. Um, you know, in some ways you want to hold them accountable. Uh, certainly the, you, AI act does put restrictions on the certain categories of risk. And I think that approach makes a lot of sense.

Uh, but it's unclear to me at least for now how the bill addresses that point. And, uh, certainly open source also is a much trickier question on how you regulate that.

Jeremie

Yeah, actually to, to that point, um, maybe worth noting. So the, Uh, the bill to my understanding is saying, look, uh, you're going to be held liable for catastrophic harms for, for failures, essentially of, uh, of process and failures of outcome. If, if your model leads to a catastrophic outcome, like you're going to be, you're going to be liable for that. And. Um, and for doing things kind of before that happens.

And one of the positions that labs like Anthropic have taken, which, um, I'm very curious to hear, you know, because they are very safety minded and safety conscious. I'm very curious to hear what their actual justification would be for this. But certainly I talked to a lot of researchers, including researchers at Anthropic, uh, who don't necessarily agree with that position.

Um, the, the idea being that if we're going to wait for a catastrophe to unfold, To then hold companies accountable, that's a little concerning, especially given the scope of the catastrophes that labs like Anthropic themselves seem to consider quite plausible. Um, you know, if you're talking about a WMD saying, hey, you know what, like, we're not going to have a process in place that you're held accountable or held liable for, for upholding.

Um, uh, like don't worry about it, but you know, we're just going to worry about it. If the catastrophic outcome actually happens, that's a little tough. If you're talking about potentially millions of lives that could, that could be under the gun here. So, uh, that's part of been part of the pushback back and forth. Certainly understand the anthropic argument as well that, Hey, you know, we need some latitude as well. We need a bit of a safe Harbor.

We need, you know, uh, guarantees that we can continue to do our work. Um, but I, I think that that. That core argument, I've just heard enough opposition among people who would normally agree with Anthropic on a lot of the, the work that they're doing, um, to be, uh, to be very interested, let's say, to hear more, uh, at this point.

Andrey

Right, on the topic of Anthropic, their statement was that, uh, SB1047 has substantial drawbacks that harm its safety aspects and could blunt America's competitive edge. edge.

They argue that the bill should focus on frontier AI safety and away from approaches that aren't adaptable enough for a rapidly evolving technology and also what the bill to shift to outcome based deterrence as opposed to pre harm enforcement, meaning that AI companies develop and deploy safety protocols and be held liable for catastrophes they cause. So yeah, a little bit nuanced.

And the people who do oppose the bill, again, who we've mentioned, Yann LeCun, Andrew Ng, Fei Fei Li, are also the people who are to some extent dismissive of catastrophic risk, and in some cases, oppose AI safety to an extent.

Jeremie

Yeah, and I think that's kind of the disappointing thing about Fei Fei Li's article, is like, um, we're not going to make progress if we just pretend that the arguments, the core arguments that motivate this whole position Don't even exist like this seems to me like a fundamental failure to engage with, uh, the, the good faith arguments that have been put forward on all sides. Like we all lose in this situation. There are interesting arguments to be had.

You know, we had a whole episode back and forth about whether this risk Set is is plausible and I think there are arguments for and against that are really interesting But when we don't even have them, I think we we do lose an opportunity and it's a bit unfortunate

Andrey

The lighting round of the first story is a spicy one It is about how a judge has ruled that Google has a monopolized search through illegal Deals, this is drama

Jeremie

week Andre.

Andrey

I don't know I know I love it Uh, drama with all the big codes. And, uh, in this case, the reason that they have monopoly search is by making deals with Apple and Samsung to make their search engine the default option on smartphones and web browsers. Obviously, this is a big deal. This was a 286 page ruling and to quote, they say Google's distribution agreements foreclose a substantial portion of the general search services market and impair rivals opportunities to compete. Reminds me as well.

of the, uh, ruling in the EU that, uh, the, uh, Apple, the Apple Marker has to allow people to choose their search provider as opposed to making a default. Uh, so kind of along those lines, as we always say, we aren't, uh, lawyers, but, uh, if I had to guess, it does seem to be a pretty clear cut example of the sorts of things that are cited. as anti competitive, right? This is not just about whether you're the only player. It's not just about whether you're a monopoly.

It's also if you're kind of a dominant player and you engage in acts that make it unfair to other players in the market and stifle competition. And you could very well argue that this is what that's doing.

Jeremie

Yeah, you're right. The claim here too is that there's been a very concrete, uh, impact of all this. So, uh, the, the, the judge says the trial evidence firmly established that Google's monopoly power maintained by the exclusive distribution agreements, um, has enabled Google to increase text ad prices without any meaningful competitive constraints. So their claim is there has actually been an impact. Uh, an increase in prices as a result of this, the consumer is actually lost out.

Um, you know, I thought this was fascinating. I, um, I'm so, uh, embarrassed not to have known this lawsuit was going on. Same

Andrey

here.

Jeremie

Oh, okay. It's not just me. Like, I don't know what the hell is going on. So, so Alphabet shares slid like four and a half percent, Apple down 4. 8 percent. Now, the reason Apple's down is that although they're the ones paying Google for this, obviously they're getting a lot of value. Sorry. Although rather. Although Google is paying Apple for, sorry, that's the whole point. Google is not going to be paying Apple their, you know, their next payment.

Assuming that this result sticks right now, there is an appeal process. Google is appealing this no surprise. Um, who knows how far that could go. Uh, but the interesting thing is too, the consequences of this could be like really Fuckin serious. Like, this is another reason why I was like, what the hell? Like, I wasn't tracking this. So, apparently, there's a hearing next month that the judges scheduled to discuss the timing of a separate trial they're going to have on the remedy.

In other words, on what they're actually going to do about this, what the consequences are going to be. Um, the DOJ could apparently demand, and I don't know how, How plausible this is, how likely it is to materialize. But DOJ could demand the separation of Alphabet's search business from other products like Android and Chrome. And if that happens, that would be the biggest forced breakup of a U. S. company since AT& T was dismantled in 1984. So this is like a big, big deal.

Um, there are other alternatives too, though. Judge apparently could also stop short of that breakup and instead choose to just unwind the exclusive search deals, which. To me, as a complete nincompoop, I tend to think of as like, yeah, wouldn't that be like the more, the more, um, sort of like light touch approach, like then dismantling the company, who knows? And then apparently the third option is just to require Google to license its search index.

That's the data that it obviously uses to, to build up its results. So there are a whole bunch of, of alternatives here. I don't know which one end up being kind of realistic or plausible. And of course there's the prospect of appeals. So we'll see if this result sticks, but for the moment, uh, wow, you know, 5 percent drops in, in stocks, maybe an overreaction and a bit of a broader market sell off, of course, but, uh, but still, uh, damn, I, I, I, Can't believe I wasn't tracking this.

Andrey

Yeah, to be fair, it's, uh, kind of, uh, kind of came out of nowhere. Uh, so the aboriginal, uh, case by the department of justice was, uh, three years ago and now, uh, judge Amita Mehta has issued his ruling. For any lawyers in the audience, apparently Google has violated section two of the Sherman Act.

Jeremie

Oh, that's my favorite. Yeah,

Andrey

I know. And, um, it came out through the trial that, uh, Google paid 20 billion dollars to Apple to have this position as the default search engine. So it kind of makes sense that Apple shares slid. Given that's a fair bit of money.

Jeremie

Yeah, I, I wonder if the article says that, uh, it fueled, okay, fueled more than 300 billion in annual revenue, largely generated by search ads. Okay. I guess, I guess that was already publicly known. I was trying to figure out, does this screw Google over for their future negotiations of people though? Oh yeah. From this deal you're getting, that's not the case though. This seems to be, yeah, it must be their overall total because that is the only number that would make sense. But anyway.

Yeah, wild.

Andrey

And speaking of antitrust, the next story is that Amazon faces UK merging probe merger probe over 4 billion on tropic AI investment. This is the UK's competition and markets authority. They became A phase one investigation into Amazon AI research firm on Froppik. Uh, so this has to do with the exclusivity agreements that Amazon has with on Froppik. And the question is whether that is anti competitive.

This is following up on plenty of action in the AU related to antitrust things with things like Microsoft and OpenAI, so will be interesting to see where this goes. And the last story, as Jeremy hinted, has to do with the GP40 system card. System cards are a thing that's been going on for a while, where alongside remodel release, you release a sort of standardized overview of the model and things like its capabilities, uh, things like its safety concerns, training data, et cetera, et cetera.

And now we do have the system card of GPT 4. 0, which comes with safety relations and mitigations that I'm pretty sure, uh, Jeremy has taken more of a look at than I have.

Jeremie

Oh, well, I mean, it's, uh, it's, it's just a, like, actually, it's not pages. I think it would probably be about, like, a 30 page read. It's a big, beefy document, so I don't blame you.

Andrey

It is your job to read these things.

Jeremie

It's my job. Uh, yeah, well, then I'm, I'm pleased to announce, uh, some good news. OpenAI managed to pass their own self administered and self designed safety test with flying colors. So there's that. Um, it is actually, I think, you know, it's a better result than that. Um, the tests do seem interesting and, um, you know, and very legitimate. So they have this preparedness framework, OpenAI does, that they set up a little while ago.

Um, Sam Altman seems to have been, uh, quite intimately involved in that process. Uh, and this involved developing a bunch of requirements, including the system card, uh, which they put out now with all their new models. So this is basically. Specifically for GPD 4. 0, which kind of the, the biggest identified new vulnerability with this model is the fact that, yeah, it, it has the ability to generate audio, uh, to respond to audio inputs and generate audio outputs.

So you got a whole bunch of new risks that come with that, uh, risks from unauthorized voice copying, uh, risks of it being used maliciously in a whole bunch of different ways. And of course it has. These very quick, uh, response times, you know, uh, something like about 300 milliseconds, which is comparable to just your human response time. So you really can imagine using these in, in the wild and serious ways. They go over a whole bunch of stuff. They talk about.

How they went through and like red teamed the model who they hired, blah, blah, blah. It's a very comprehensive document along the lines of the kind of like meter document, the report that we were talking about earlier, um, a couple of, of little nuggets here. So first, how did they make these evals? Um, if you're open AI, you've got a whole bunch of text based based evaluations that you develop for previous generations of models that were text only. Um, and so.

You might think, okay, well maybe we can reuse those, uh, for the kind of voice, uh, speech to text model of the GPT 40 here, and they actually do that. So you use a whole bunch of text to speech models, um, like VoiceEngine that allow them to, yeah, to do that, to convert their, their text evals into audio evals. Um, that was really interesting.

There are a couple of issues there, not least of which is that in real audio situations, there's a whole bunch of background noise, things that make it kind of messier. And that's not present when you're doing those sorts of evals. And my suspicion is that that is going to result in some really interesting jailbreaks that are going to exploit the fact that these evals have been done. And part of the training has been done on just like really clean audio.

Um, and so, Though their training process also includes some messier stuff, but I suspect that that's one thing that we'll end up seeing.

Uh, one possible example of that that was really interesting was that, uh, in the section where they're testing unauthorized voice generation, uh, they casually had a, a user interacting with, um, you know, the GPD 4. 0 and All of a sudden, randomly, the model just goes, it shouts no, like in the middle of a response, goes no, and then it replicates the voice of the person it was talking to, and has it say some creepy shit back, and what the fuck, and this is really weird.

So that's one of the evals that they ran. Um, they just get, it's really funny if you look at the documents, like, just, just kind of, you know, subtly wedged in, they say, um, voice generation can also occur in non adversarial situations, such as. Our use of what, uh, such as our use of that ability to generate voices for chat, dbt is advanced voice mode. During testing. We also observed rare instances where the model would not intentionally generating output emulating the user's voice.

Um, apparently they've correlated, they say in a footnote, uh, some instances of this behavior, which is really weird. Like you can listen to the audio it's it's all over the Twitter verse, but apparently that behavior. Uh, is correlated with short, as they say, often inaudible voice messages made by the user, which are often produced when users are in a high background noise environment, such as using the model in a hands free mode while driving or simply needing to cough.

And so this I think is a really interesting case where, yeah, you're seeing the model break in that context. This really does. Um, I kind of suggest even further that, yeah, there may be a jailbreak possibility here using background noise, you're already finding a way to kind of pass the, the filters here, at least to make it break. Um, last thing I'll mention, they have a section. So, so on the evals, they basically say, oh, we pass all the evals with flying colors.

It's the model is low risk according to their, their guidelines. Preparedness framework on everything except for one category, which is persuasion and what they say is they just narrowly crossed into the medium risk threshold for a little bit of context, low risk and medium risk opening. I says we're happy to deploy those things. Anything that drifts into high risk territory. We're going to work on to reduce it to medium risk or lower before actually deploying it. So this is still deployable.

But they do a bunch of experiments to show that using, um, the text based version of GPT 4. 0, you can change people's political views, if only transiently for a period of time. Um, about a week later, they kind of snap back to where they were before because Humans are humans. Um, but, uh, they were highlighting this as a possible risk. They also test this with the voice, the kind of audio version of GPT 4. 0. Don't see a result there.

I will say in this context, um, you know, we talked to AI safety researchers at a lot of labs, including labs, other than open AI, a lot of them are concerned about Uh, you know, persuasion becoming a thing fairly soon before we're able to kind of figure out how to deal with it. Um, but, uh, but anyway, that was interesting. They did bring in Apollo research, which we haven't talked about them so much. We've talked about them a couple of times. I'm a really big fan.

Uh, they're another company like meter that does evaluations for deception. Um, and they came in and evaluated capabilities related to what's known as scheming in GPD 4. 0, to see whether GPD 4. 0 can do things like. Uh, model itself like reason about itself as a separate entity or others like kind of theory of mind and a whole bunch of other other tasks and they showed moderate self awareness, uh, in some contexts, uh, less so in a, in a kind of agentized, um, or applied agent setting.

Uh, but anyway, so some, some early results, some stuff that starts to hint at like, maybe we're seeing a little bit of takeoff in some of these dimensions, but still safe enough to deploy.

Andrey

Right. Yeah. To the credit of OpenAI, they do include a section on third party assessments, including Meta and Apollo. So very good model card, you know, good job OpenAI. Yeah, following up by your policies. Didn't say whether you can make a model mimic the voice of scholar Johansen. So still unknown on that front, but uh, otherwise lots of clarity.

Jeremie

Oh, what a wasted opportunity. Yeah, . Andrey: And that's it. A bit of a long episode. Lots of drama. So hopefully people enjoyed it. Uh, maybe I will put out, uh, our first mini episode with AI generated Andre. With, uh, maybe five minutes worth of summary. So we'll see, please do let us know if you like that to exist. And if you like the episode, if it does come out, otherwise, thank you for listening to last week in AI as always. Please do share. Please do review.

We are at 196 reviews on our podcast, get it close to 200, but that aside, please do keep listening and do enjoy our latest AI generated song. Last

AI Singer

week in A. I. Clingy news never slows Turned to A. I. 's buzz With a new tech growth Departure gigs swing And spoke for that cause Scaling LLMs beyond what anyone Knows Electrify, let's fly State you were reaching For the sky Take dreams and false replies Last week in Frontiers edge five years of code with the cutting edge. Neural network spark in a patient's wedge from vice to break through taking the next step. Let's fly. Stay tuned. Reaching for the sky. Take dreams in full surprise.

Let's make today Glowing screens and wide minds. In this digital race, no one falls behind. Decisive. Using our weak trance numbers to survive. Together pushing forward. In this tech world, I am. Keep left your beat. Left side. Stay tuned. We're reaching for the sky. Take dreams in full supply. Last week at A High. We rise high.

Transcript source: Provided by creator in RSS feed: download file