#219 - GPT 5, Opus 4.1, OpenAI's Open Source, Astrocade | Last Week in AI podcast

⁠¶ Intro / Banter

00:11

Hello and welcome to the last week in AI podcast. We can hear chat about what's going on with ai As usual. In this episode, we will summarize and discuss some of last week's most interesting AI news. Also, last, last week, we did skip a week but perhaps in, in a good week to do so, you can check out the episode description for the list of all the articles we discuss and the timestamps. I am one of your regular hosts, Andre Reov.

00:40

My background is of having studied AI in grad school and now being at a Gen AI startup. And I'm your other regular host, Jeremy Harris. I'm the co-founder of Gladstone ai, do a bunch of AI national security stuff. And we were talking about this last week. It felt like a real nothing burger of a week and for various reasons we ended up not recording the podcast that week. And we're kinda like, ah, I mean. People aren't really gonna miss anything.

01:06

there was the odd thing, and there, there is the odd thing. We'll surface it this week, but then all of a sudden boy, I mean, two open source models from open ai, GBD five drops, quad 4.1. We have Gemini 2.5 deep thinking dropped. Just like so many, so many things. And obviously for every one of these drops there is the model card or the system card. There is the obligatory meter, eval suite. There is all the stuff that we have to go through. So there is a lot of stuff to cover today.

01:36

Yeah, it feels like this has happened multiple times this year. Like all the major players like to do things all around the same time, presumably to, you know, get the attention of everyone and, and not be left behind. And this was definitely one of those weeks where we saw cluster of major stories.

01:55

So yeah, we are gonna definitely talk about GP five quite a bit, Opus 4.1, and then on the business front, actually some interesting stories, updates on revenue, on potential raises and yeah, whole bunch of stuff to discuss. one more quick thing before we get into the actual discussion of the stories. Wanna do a quick plug for the thing I'm working on? Aade. We recently rolled out a major update I'm gonna plug it with a link.

02:26

In the episode description, there's the whole article, aade rolls out, AI agent powered game creation experience. We went all in on vibe coding as we had to. So Amazing. if you wanna vibe, code some games check out the episode description to see what I am working on most of the time and why the podcast sometimes is late. Can we, can we do a last week in AI game? Of course I can do it. we need to do a last week in AI game.

02:54

We gotta figure out what, what that even means, but, Yeah, I think I'm gonna do Open the Eye versus Inro, it's like a fighting game where you pick the okay. Ah, there we go. Yeah.

⁠¶ Tools & Apps

03:07

All righty. Well, with all that said, let's get into all the big stories a week, starting in tools and apps. And of course, we're gonna begin with GP five. This just happened yesterday, where was one of Open Eyes, big live streams. Pretty much everyone knew this was gonna happen for most of the past week. And what we saw was I would say an interesting development in a sense that first OpenAI deprecated all their other models. G PT five is the only model for OpenAI.

03:40

Now, what G PT five seems to be is all their models combined into one. So if you're a user on gpt.com and you enter a query, there's a router that takes your query and either takes it to a complicated reasoning model. Or to a simpler model, kind of like oh three versus GT four Oh. And as you might expect, or this kind of thing, there were announcements of various improvements on benchmarks on SWB bench, verified on GP or diamonds, these kinds of things.

04:16

To sum up what I think is the general impression on like the empirical front, the vibe check, GBT five is an all around good model. It's, it's an improvement. It's up there with all the other leading models, Gemini 2.5 Pro, and Claude four and so on. It's not a huge, huge leap. it appears to be a piece in part more of a product update and an infrastructure update for OpenAI than necessarily like a whole new model. The knowledge cutoff is September of 2024.

04:51

So there's a lot of reasons to think this is kind of a mix of training and just infrastructure and development. That's my take. But Jeremy, feel free to say what you took away.

05:05

I mean, I guess the one modification there, a slight thing would be they did release updated versions of all of the kind of not the base models, but all the feeder models, the models that get routed to, so this is both a router, as you said, that takes your query and routes it, but also an upgrade to the downstream models that do the computations, that do the generation. So in that sense, yeah. Is like, what is it really? It certainly is a system more than any one model.

05:30

so they, they say that they plan in quotes, the near future to integrate all these capabilities into a single model. This is a, a vision that Sam has expressed for a long time, right? This idea that it feels silly that people are using a model selector to choose between different models isn't the most natural way to interact with an LLM.

05:47

Just to like pose your question and have the system decide which Submodel to handle it or have one model that decides how much effort to invest in each query in a way that's a bit more appropriate. if he's right with that play, If he's right with that play, then the bit of kind of bumping and bruising that's happening right now because there are people complaining about the fact that they want to use certain models. What happened to G PT four?

06:11

Oh, there's this whole kind of hashtag save GPT-4 oh like mini trend on Twitter. All this stuff. Like that's probably gonna be viewed if Sam is right as a sort of kind of ships the path them in the night thing. People aren't gonna think about it. This sets open AI up in an interesting position to experiment earlier than other labs with that user experience component of, you know, what if we have a single model to rule them all or a single interface. So that's interesting.

06:36

But we don't know how that plays out, right? It may just be the case that people want that level of control. They wanna be able to select their models and that is a persistent thing, possibly. Who knows? The other piece is some of the evals. So the hallucination rate on this system really is a lot lower than what we've seen with previous models. A sixth of oh three's hallucination rate a fifth of oh three's. Error rate in thinking mode.

07:01

So a lot of effort has been put in seemingly to making the models outputs more reliable more truthful and and so on. which is useful, this is all part of a new direction to not quite alignment, but like kind of safety fine tuning that OpenAI is taking where they're now centering their refusal mechanism on the assistance output rather than some kind of binary classification of the user's intent.

07:27

So in the past you would write a query chat, GPT, there'd be a classifier that looks at your query and decides, is this a safe request or is this somebody asking, you know, how do I bury a dead body? How do I make a bomb type thing?

07:39

And depending on that that output that would feed into the response that you got, well, what they're doing instead here is they're saying, okay, let's just like take whatever query, whatever request the user makes, and let's focus on the output that we're generating, the model's gonna generate in response. And sort of like modify that to take out anything that seems dangerous or whatever. So what you tend to see as a result is less focus on a knee jerk reflex.

08:06

Like, no, I won't answer that question. And more the model trying to answer as much as it can while obfuscating the narrow bits that it considers to be dangerous. And so. There's a whole paper on this. I'll kind of just tell you guys this paper dropped yesterday as part of the flurry of releases. I have read all of like two pages of that paper and that's where this explanation comes from. So expect us to cover that more detail next week. But anyway, this is an interesting release.

08:32

Last thing I will say is there is a meter eval suite that came out of course as they do with all the frontier model releases. A couple notes here. So meter received access to GPD five, four weeks prior to its release. That is an interesting improvement over previous rounds of meter evals, which if I recall involved in having like a week of access or two weeks, and there were some complaining about that. And so that seems to have been fixed or addressed.

08:56

And so the key number when you think about the meter evals is this 50% time horizon metric, right? So this is answering the question for what length of. Time that it would take a human to perform a task. Does this model hit a 50% success rate? So think of like a whole bunch of different tasks. Some of them take humans five minutes to complete. Some of them take humans five hours to complete. What is the length of that task such that the model has a 50 50 chance of succeeding at it? Right?

09:27

And so previous models we'd seen hit like, you know, an hour and 45 minutes type thing. Well, this one is two hours and 17 minutes. And the important thing to note here is that this. Really starts to suggest that there is a new trend forming. So historically what meter did was they looked at all the model releases for all the frontier models starting in 2019, all the way up to the present day.

09:50

And then they kind of plotted a line of best fit and that line of best fit told them, Hey we are basically doubling the length of the tasks that these models can complete autonomously every seven months. And actually, this was something we talked about on Rogan, like this is, this is kinda the big trend.

10:05

Well, we also talked about on Rogan at the time, I think it was like one of the Claude models had just been released and we were just getting a sense that, hey, if you actually look at the plot, there's kind of like a, a steeper slope that's starting to appear in just the, like most recent three or four models. Well, we now have like five more data points and it is really starting to look like that line is real. That new line has a four month doubling time.

10:30

And so if you extrapolate that out, you're looking at hitting like a month long fully automated task sometime in like 20, 28 ish. So, you know, depending on which slope you choose, right, the seven month one gives you 20, 31. And if you think of a month long task as being kind of a GI ish that should affect your thinking about timelines.

10:50

And so a lot of ambiguity here, but we're starting to get a little bit of clarity on the, the genuine acceleration that does seem to be to be happening on the scaling side. Yeah, I don't know personally, I think we have a methodology, has a lot of flaws, but we, we are still seeing investments, so that's fair to say. A couple more notes about five, if you're an EPI user. So, Carabas is of course trying to still fight of enterprise.

11:18

For API users of the models, they have three variants of GP five. So they have GP five kind of the normal one GP five mini, A GP five nano, similar to how Claude has opus sonnet and ku and, you know, different sizes, different costs, different speeds, et cetera. Currently they are undercutting philanthropic on price has it. significant margin. So if you look at the input and output costs the output cost per million. Token is about two thirds of what Claude Sonnet is.

11:54

Some other, you know, technical details here, the input context window is larger than before for opening eye at 400,000 tokens. That's still. Not quite as high as Gemini at 1 million. So still there are other models that are leading in that regard, but hopefully, you know, open is to push more on that front. The max output tokens is 128,000 which is quite a bit larger than GP P four oh, for instance.

12:29

And lemme just check here also larger than GP 4.1, although GP P 4.1 could take in a million tokens. So, interesting set of parameters being chosen here as far as kind of the usage profile. And I, yeah, I definitely do think there is some credibility to the idea that this is both an update on the technical front in terms of training and the weights and so on and so on, and kind of more of a product.

13:01

Update where they had oh three GP four oh, G 4.101 mid, like I was getting confused at this point, right? Like it was insane. So for open air to go and just say, forget all of it. It's just GP five. Now don't worry about choosing the model, we'll just route it for you. Seems like it was kind of necessary and, and framing it as GP five makes a lot of sense.

13:31

In fact, I think there was some sense that people being underwhelmed the Twitter AI space, you know, Sam Altman, as per usual was hyping it up quite a bit. Posted like a death star image. The reactions I'm seeing today is, you know, people are like, okay, this is a good model. It's an improvement. In some ways it's not a gigantically forward. Basically everyone agrees on that and so, yeah, there's kind of a mix of some people saying this is a great model.

14:01

It just gets things done, you know, which is, it seems to be the case. It's like, if you wanna use a model, GP five might be the best for a lot of stuff, but it's not some kind of giantly forward that you could have guessed. It would be based on the bump from GP four to GP five. Absolutely. And I think one of the, the metrics that really tells the story here again, I, I do think by the way, the meter evals piece is a legitimate thing. All and meter will be the first to say this.

14:31

You know, all the, all these things are, are flawed in various ways, but it is the single best thing we have as far as like. Or let's say one of the critical tools we have to evaluate human-like task completion for IRD. Another though is this idea ofWe bench verified, which SW bench was originally I think an open source thing somebody put out. But SW bench verified is something that OpenAI cleaned up. And on that, so interestingly called Opus 4.1 hit 74.5%. We'll talk about that later.

15:00

But GBT five hit 74.9, which. You know, where I come from is just called Within Noise, and there's a bunch of ancillary kind of related evals that have been run by, by some labs. I'm just seeing right now on on Twitter, somebody from Princeton just posting their own eval on SWE bench or SWE bench verified. And it looks like, in that case, Opus four, ahead of GT five. So like they're kind of neck and neck back and forth. The price difference is gonna make a big impact for sure.

15:28

but sort of interesting as you say, like, the death star, did it really materialize? I think right now as the dust starts to settle, we'll see more, but, but the vibe check does seem to be like, as you say, solid model. Is it a death star? Maybe not. And an interesting question for for further kind of the future of scaling, if you will.

15:47

The narrow capabilities that matter I will say are AI r and d. For, for the, the kind of super intelligence trajectory stuff, that's where the meter evals look interesting. Hard to know. We'll have to wait to see where the dust settles. last thing I'll mention is there's been like a whole bunch of updates to re usage profile of this. One of them is what you can do with a free account on chat GBT. Now you can send up to 10 messages every five hours.

16:18

And a free tier user has one GP five thinking message per day. If you pay for the $20 per month plan, chat, g plan plus you can send up to 80 messages every free hours, and then chats will automatically switch to a mini version until the limit three reset. I don't know exactly I was compares to before, but it, it seems lower and I do wonder if having reached 700 million users per month for whatever it is. Open the eye is trying to start burning less money with the free usage.

16:55

They allow people to, to have, 'cause literally every free query is, is money being burned? Right? Well gonna move on. Now, let's talk about Opus 4.1 from philanthropic. So this happened just a couple days before GP five. They released this update and as you might imagine going from Opus four to Opus 4.1, a bit less fanfare. What we saw was, you know, a decent improvement on some benchmarks but not a giant improvement.

17:28

In fact, some people were having a laugh at the marketing philanthropic because the leading chart was like. Two bars, you know, the previous one, and the highest one is 72% to 74%, and you could barely see that little jump of 2% because your Y axis was, you know, pretty, pretty large. Oh, we didn't talk about this on the opening AI thing, the chart crime piece. Sorry. That Yeah. We, we could get into it. Well, let's get into it for a little while. This was more of a fun detail, funny detail.

18:00

I don't think it's very impactful. But if you were tuning into a livestream, which by the way, I tuned in early, there were 60,000 people watching this livestream at the beginning. During presentation, there were some demos, there were some charts. A couple of the charts people noticed had some very questionable design decisions. For instance there were at least one where, you know, opening IGP five mark and pink was. Of course the highest bar, you know, the value itself wasn't the highest.

18:34

There was another bar that should have been higher. And there were actually a couple instances where it seemed like either an honest mistake or just as you say a chart crime by the way, I, I don't mean to imply intentionality behind the char rhyme thing. I like to be clear. I think it's so obviously stupid for a lab to do that intentionally. 'cause like obviously people are gonna see that and they're gonna jump on it. Like, I think this was just an honest mistake, but I think it, it happened.

19:03

On multiple plots, like I don't think you're right, it was at least one, but I seem to remember it like a couple where it's just like, what's going on? Did like, did we, did the marketing team start with like, okay, this is what the bars have to look like, and that evals guys were like, here are the numbers. Okay, just slap 'em on. Like, is that what happened? Because that, anyway, Yeah I think we, what people saw is the charts seemed accurate on the blog post.

19:29

So it was probably more of a representation thing. And yeah, it, it's, it's pretty funny. There some interns having a bad week. amusement on that one. So to bring it back to Opus 4.1, it was kind of the opposite. The folks on Twitter had some fun being, like, the marketing team at philanthropic should get the opposite of a raise or something like that. But big picture story, right? We're going from Opus four to 4.1. It's a minor version bump and.

20:02

Kind of corresponding to that, you know, slightly better at coding, slightly better at tool use, at multi file code refactoring real world software engineering tasks. That's kind of about it. It's, it's not a huge deal obviously, but it's still is a decent improvement Pricing here. Doesn't change or anything. It's all, I mean the, I guess the way to think about Opus 4.1 is it's like solidly beats O three. And then it's like competitive with G PT five.

20:35

I think you just have to try both on your task and see which one works best. I don't think it's obvious at this point. Yeah. GB five, by the way, also still using AM dashes which is another thing people on Twitter love to make jokes on. If you don't know, for some reason Chadri just loves to use AM dashes. And in fact, that makes it very easy to spot when there's output, if there's bolding and if there's a lot of dashes, right? It's definitely a chat bot output instead of a human written thing.

21:06

Well, we got OpenAI and RO so far. Let's go on to Google. Actually about a week ago now, Google rolled out Gemini Deep think ai, which actually might be the biggest deal of all by some metrics at least. So if you are a subscriber to where 250 per month ultra subscription, you are now able to start using this deep think model, which is kind of the most advanced reasoning model you can get. This is the equivalent to something like.

21:41

One or super rock heavy, you know, this is where you throw a ton of compute at a problem. This they say is the model that let them achieve. An IMO win that we did discuss previously also achieves state of out performance on human's. Last exam we are at 34% on human's la last exam. So, you know, getting up there soon enough this exam will be beat. Seems yeah, all around very impressive performance.

22:17

I think interesting to see this 200, you know, 250, $300 per month becoming more regular and this kind of paradigm of seemingly, you know, just tests, time scaling to the max. Probably running multiple instances of the models in parallel and comparing and, and combining their outputs. We know Super Rock heavy does that from the little we know of how the IMO accomplishment happened with this model, it seems like that's also the case here.

22:52

Yeah, pretty remarkable and also pretty remarkable how consistent progress has been across the different teams. Like, you know, back in the day of GD three, it seemed like opening AI had this insurmountable, you know, eight. Team month lead or whatever on all the other labs and, and now very much not the case. it's an interesting challenge economically, a race to the bottom on pricing.

23:12

A a lot of competition, whether it be, grok chat, GBT, Claude or Gemini, like it, it's really unclear which model is best for your, your task right now. And that's, that's a really challenging position for these labs to be in because the margin that you can then charge is a lot more limited. Right. So, I think this is gonna be one of the big questions as we eye the 100 billion, $500 billion infrastructure build outs for these big data centers.

23:39

How long can that be sustained when you have, not a single but three or even four frontier Labs? We'll see what Meta does as well. But like, this is a, this is a structural challenge now for the Frontier AI sort of research world and open source as well. We'll talk about that later, but it's sort of like cannibalizing a lot of this too, so, yeah, we'll, we'll have to see where things go, but it's definitely a multi horse race. Mm-hmm. And we've got just one more story in this app section.

24:08

We've spent quite a while covering the big news. So onto Grok, another one of the big players on that front, no new LLMs. What did roll out is Grok, imagine, which is the image and video generator on the platform. Previously there was integration with Flux. Now there's this update that you can access to if you are super grok or premium plus X subscriber. Kind of the thing that got the headlines and, and the headline we are covering here is rock.

24:43

Imagine x AI's new AI image of video generator lets you make not safe forward content. There is a spicy mode in there that at least, I don't know if this has been removed or what, but let's you make porn basically with very few restrictions. There was another article on someone entering a prompt related to Taylor Swift. There was a history of Taylor Swift and generative content.

25:10

First of all, they easily generated media of Taylor Swift, and then if you turned on the spicy version, it showed Taylor Swift doing some inappropriate things. So. Grok going full on uncensored as usual, I suppose, and, and very much against the grain of these other text to image providers from Google and so on that definitely do not allow you to generate not safe or content. Yeah, it is what it is. It's, it's kind of crazy if you actually look at examples of what this generates.

25:50

Yeah. The, the legal implications of this will be very interesting, especially when you get into I mean, it's bad enough or complex enough, shall we say, with celebrity. But you imagine, you know, non-famous people, people who, you know, just you, you find AI generated pornography of yourself just because you have a couple of photos online or, or whatever. That's, seemingly the world that you're trending towards with this sort of thing. So where you draw that line is really interesting.

26:17

Where you think of the limits of free speech and how that intersects with the right of individuals not to have pornographic imagery of themselves made. Is there such a Right. Man, that's Yeah. What a, what an interesting decade we live in. Let's put it that way.

⁠¶ Applications & Business

26:36

And moving on from all the product updates. Let's go onto applications and business with. Some business updates. First up, meta and Microsoft stocks have risen on strong earnings reports and on their AI spending. So this happened, I believe, previous week, shares of meta rose by 11%, Microsoft by 4%. Both announced better than expected earnings and both are continuing to invest in AI infrastructure.

27:10

So meta revised its capital expenditure forecast to the year to between 66 and 72 billion, up from 64 billion before capital expenditures, meaning like how much we expect to pay for stuff. Basically the stuff in question being mostly data centers and GPUs. Microsoft is estimating over 30 billion in capital expenditures and, and some of our details. So the, the gist of it is investors still are seeming to be on board with giant, giant investments in ai.

27:49

So with mark Zuckerberg's recent, you know, flurry of hirings and then basically promises of going all in on super intelligence, it seems to be paying off in terms of the sentiment with regards to meta. And now the nature of the super intelligence race is that the expectation value is. Sort of reflects. A lot of the value, the vast majority of the value is in, in the future when you hit super intelligence, if you're the first lab to do it right. So, that's what is behind a lot of this.

28:21

It's to some, it's not independent of current revenues, but it is a distinct incentive to invest in this stuff that besides current revenues, which are also strong. So yeah, and no, no huge surprise there. Overall. It's also the case that we're seeing, you know, these massive clusters go up in more energy rich areas like the UAE, like Saudi Arabia for at least inference runs we're told.

28:45

the amount of money that is being spent when you're talking about either one gigawatt, let alone the five gigawatt range is pretty insane. You know, Satya Nadella, I think famously said in that Davos interview, he's like, I'm good for my 80 billion a year. of infrastructure investment. This was a reference to how sort of Stargate seemed to be moving a little bit more aggressively on the spending side than Microsoft wanted, which was kind of part of that rift between opening Eye and Microsoft.

29:07

And it seems like, far from 80 billion, he's now looking at like a hundred, 120 billion of annualized spend on infrastructure That's actually gone up, if anything since then, which is pretty interesting. Speaking of Stargate, the next story is about OpenAI planning to establish Stargate Norway with. Hundred 30 megawatt data center. So they are agreeing to be an anchor customer for this new data center.

29:36

Apparently working with N Scale Global Holdings, LTDA data center company that is going to actually build with facility with some other investors and open air being a customer. So, I guess a big deal in the sense that there's not been too much investment in Europe geographically for these kinds of data centers. It's mostly been the Middle East or the us. And, and another indication of how far OpenAI is willing to go with their Stargate endeavors.

30:10

And we were just talking about sort of like one gigawatt level as being, you know, the sort of the touchstone for certainly late 2026, early 2027 is when you'll start to see those gigawatt clusters really ramp up and, and, and power actual training runs. So the Stargate Norway site is gonna be 230 megawatts of capacity. They say with ex sorry, with ambitions to expand by an additional 290 megawatts. So you're looking at over half a gig once this is all online.

30:37

And they're looking at a hundred thousand they say Nvidia GPUs. Okay. They, they probably mean at this point, lemme see 2026. they may mean Rubens actually at that point. Possibly blackwells and Rubens. But anyway, so this is a very, very large scale thing. Norway is a cold country, so that helps. But a big focus here is gonna be the renewable tie in, and I think that's part of the cost of doing business in Europe at least, is all the focus on renewables.

31:00

So it will run entirely, they say on renewable power, and it's expected to incorporate closed loop. Direct chip liquid cooling. Okay, so closed loop meaning very roughly you can imagine you use, you pump like a liquid into your chips to take the heat away from them. 'cause that's a big problem for cooling. The liquid gets heated up and then there's a pump that, that guides that liquid back out.

31:23

And then one thing you could do is just like, create a fine mist, a fine spray of that liquid to call it, cause it to cool down and then have it collect and then kind of pump it back in. They're not doing that. They're keeping this closed loop. So basically not, not losing any of that liquid to evaporation or anything else. And just sort of like, keeping the circuit closed, that's all that means. Direct to chip, liquid cooling is an absolute must.

31:46

At this point it's just for 2026, the amount of heat that's being put off by these ships is just so insane that air cooling doesn't allow you to do the kind of cooling that you need. So that's the context there. Seems like a really big project. And as you said, like all these projects have like a big funder a data center builder, sometimes a a data center operator that all kind of come together to form joint ventures along with, you know, Stargate or OpenAI or whatever.

32:10

And and they're doing that here. No surprise. Next up for going back to philanthropic. We have a new estimate of where revenue the news is where revenue is nearing 5 billion in annualized income. And it seems that anthropic is trying to raise another round. So, they are seemingly raising as much as $5 billion at 170 billion valuation. And I think this is kind of an interesting trend.

32:47

In all these AI companies, open AI especially, but now also philanthropic in just continuously trying to adjust money. Like this is not usually outside of works, usually was like a round, a major round series A, series B. When you work for a while, maybe a year later or a couple months later, you get your next round, well usually a year or two later. With ai, it's different. With ai, everything is supercharged.

33:14

The difference between browns is several months, and in the case of philanthropic, open AI is just perpetual fundraising very different from. Yeah. And, and this is where already these companies are sort of, scrap, I don't say scraping the bottom of the barrel, but they're, they're looking for the last sources of highly liquid cash that can handle the scale of fundraise that they need. Right. You're looking for like, how do we raise, you know, like $30 billion?

33:45

how do we raise, you know, like an Airbnb, like worth of, of company or, or a Dropbox or whatever, right? Like that's what they're talking about. To do that, you're looking at sovereign wealth funds. There, there aren't that many other places that you can turn to. And then the question's gonna be well after that, where do we go? Right? And there's no real answer to that question as far as I know.

34:06

Apart from governments, like actual governments just coming in and giving, you know, like a hundred billion dollars, which yeah, hey it could happen. But it, it's a pretty wild world we live in this this article is actually quite interesting. It follows report from Menlo Ventures that sort of laid out the competitive landscape on the enterprise side, the coding side. A couple other aspects of the market. This is interesting.

34:28

So philanthropic apparently now holds 32% of the enterprise LLM market by usage. OpenAI is now in second place with 25%. This is a reversal. So in 2023, which what, two years ago OpenAI held 50% of the enterprise market share. And Anthropic had 12. So Philanthropics effectively reversed that in just two years, which is pretty remarkable. Google has also seen an increase in usage, but they're, they're in third place here. Anthropic by the way, does even better when it comes to coding.

34:58

42% of the enterprise market share 21% for open ai. So quite interesting, like Anthropic seems to have come out of nowhere with products that do great at the enterprise level. This is a, an incredible sales achievement among other things, like obviously the models have to be good enough to do this. But it does make you think about why then OpenAI chose this moment to release their open source models. Right. What do open source models do for enterprise customers?

35:27

Well, you can run them on premises, which is something that a lot of enterprise customers want. If open AI is kind of like, I don't know if they're seeing the writing on the wall or if they're just like, Hey, you know what we're not doing as well on the enterprise side. What we can really, like, we know consumers want a deployed app, like everything done for them, so they'll come to chat, you know, chat gpt.com, like, no problem. And we're leading there.

35:47

But what the enterprises want, they want, they have the infrastructure potentially to serve their own models. So they like open source models and so far they've had to turn to like Chinese models basically. Or sort of lagging meta models or misra models that are just behind. And so, you know, this is an interesting way to potentially kneecap the competition. Anyway, this report sort of provides a little bit of context for why perhaps that open source model drop happened.

36:11

Interesting report kind of confirming that philanthropics strategy basically has worked from the very beginning. In terms of business the focus of anthro has been the enterprise customer. If you compare OpenAI and philanthropic on anthropic, you what? You can't generate the images, you can't talk to it. There's no advanced voice mode. There's no video generation. It's just. Kind of a chatbot with some basic image understanding.

36:40

And the focus is definitely more of an anything kind of real world software engineering tasks. If you look also at the spend of the enterprise, it was apparently $3.5 billion in 2024. By May, 2025, that's going up to $8.4 billion. So no surprise there. You can make a lot of money by dominating the enterprise market. You know, businesses are willing to spend the big bucks to get the best of the best. So, yeah, kind of pretty impressive as you said, that ARO was able to take the lead there.

37:19

And we just covered how philanthropic is nearing 5 billion dollars in annualized revenue. OpenAI in comparison is apparently nearing $12 billion in annualized revenue. So this is doubling its revenue from the beginning of 2020 five. And also I believe, almost doubling the active users. So, seemingly we are getting 700 million weekly active users across both the consumer and enterprise products, so, wow. Right. Like, who doesn't know about Che g PT at this point?

38:05

Like, chat g PT is becoming the Google of l lms Which increasingly is the Google of Google. Yes, well, I think the Google of Google is still Google, That is true. It's really something that OpenAI, I think in terms of brand knowledge in terms of, I think consumer usage still seems to be kind of a default. And as a result, this company is making over $10 billion a year and are still no near profitable and, and are still looking for more money.

38:43

So, similar to philanthropic OpenAI is still raising money. Turns out they have successfully raised $8.3 billion in a new funding round at a valuation of $300 billion. And this is part of their attempt to get a total of 40 billion in total funding by the end of a year. Including investments from SoftBank. So open eye still really pushing on the fundraising front and still succeeding in a very, very aggressive kind of rise in usage and growth.

39:19

Yeah. Blackstone by the way, is joining the cap table as part of that there're a bunch of sort of carry on investors who are participating as well. Fidelity Management founder. Okay, so this is like, if you want to know the who's who of like, who the best investors in the Valley are. This is, this is a pretty good, this is a pretty good way to just like, eh, let's see who participated in this round, right? So Founders Fund, which is Peter Thiel Sequoia Capital Andreessen Horowitz.

39:44

And then there's a bunch of like, kind of, other, other investors that are like kind of not. They're amazing, but they're not like s tier that are on the cap table as well. So this is pretty, like they're, they're pulling together everything they can from your traditional VCs. And then we're also seeing obviously SoftBank play, which is a, a really footing the, the lion's share of the bill here with like 30 billion or so already committed. And just one more story on the business front.

40:10

We've talked about nothing but the big players so far, so let's go. One kind of up and coming player. We have a startup called NOMA Security. They've raised a hundred million dollars in a series B round. Their pitch is they're focused on cyber security, on AI and agent security. And this is bringing the total funding of a company to 132 million in under two years, having been founded in 2023.

40:46

So personally, I just think it's interesting timing and interesting that they're able to get this much money. Recently Chad GPT also launched via Asian mode. And I do think as you've discussed previously, there are a lot of potential vulnerabilities when you let agents serve for web and do stuff for you. And there's definitely space there for new cybersecurity players.

41:14

Yeah, I'm, I'm trying to look at this article has a bunch of references for some reason to, like Israeli companies just like in the, the page that I'm looking at. Incidentally, Israel actually is a leader in the cybersecurity exactly, yeah. Yeah. So that, that's, that's why I was trying to figure out are they an Israeli company, because that would make all a sense in the world, right? Yeah. Yeah. Okay. I'm guessing niv Braun and, and Eltron, so, yeah. Most like, oh, yeah, there you go. So, okay.

41:38

Sorry. Founded in 2023 by CEO niv Braun and CTO Eltron, who met in the IDFs. Okay. Unit, unit 8,200. Right. So, unit 8,200 is. The, this intelligence unit basically for the IDF. So, and they, and then they say the company came out of stealth in October, 2024. And when they say, say stealth, they, they mean stealth. Anyway, so that's kind of interesting. Yeah, that's it makes all the sense in the world.

42:02

You see a lot of, yeah, great cyber companies come out of Israel because they are so good at this. You know, you think about who, who did the stucks net thing? want them doing my cybersecurity. That's kind kind of where that comes from.

⁠¶ Projects & Open Source

42:13

Now onto projects and open source. The other really major story of this past week, OpenAI releasing their first open wake models since 2019. Long promised and now delivered. They have released. G-P-T-O-S-S 120 b and G-P-T-O-S-S 20 B. So, two variants of our models on a kind of lower ish size scale license, permissively Apache 2.0. So actually different from Lama, different from most open source models. No special kind of fine print on the usage allowed here.

42:57

Generally seems cent from what I've seen on the reactions to these models, they're quite good, except that they are definitely super, super what people on Reddit are saying now, safety maxed. So this is very likely to refuse a request that even borderline might be doing something inappropriate. And I do wonder if.

43:24

of the reason for the delay for the release from earlier the summer for OpenAI was just optimizing the hell out of the alignment piece and the safety piece of these models to prevent anything embarrassing from happening. Yeah. Well, and one of the things they do emphasize in the release is this understanding that when you put something out in open source, people can fine tune it.

43:48

And so the suite of evals that they ran, they've got a whole separate paper where they talk about the evaluations that they ran and their sort of philosophy on open source model releases and the safety and, and national security implications. something that we called for, I think like two, two and a half years ago or something.

44:05

When we put out our, that, that report, the first one, you can't consider your eval work complete for an open source model unless you've tried to fine tune it for weapon capabilities, because your adversaries will do that. Right? They'll, they'll take the base model and they will fine tune it. And so it's not enough to just look at the model and say, ah, well, you know, it can't help you design bio weapons in and of itself, so everything's fine. Right?

44:28

The question is, sure, but if I specifically train it with data that I would expect adversaries to have, whether they're non-state actors or, or nation state actors you know, what is the, then the, the risk profile? So, these models are text only, by the way. They are explicitly designed to be used within ag agentic workflows. So instruction following is a big focus. Web search, python code execution, that sort of thing. they've got the ability to provide full chain of thought obviously.

44:54

So this is an interesting thing, right? 'cause open ai kind of proprietary models don't allow you to see the, the full chain thought. They give you an edited version but not the full thing. So this is one context where you actually can see that because it is open source architecture wise, we know it's an MOE model. Obviously 'cause we had the, the model itself they say it's built upon the GPT two and GPT-3 architecture. Unclear if that means it is not built upon the GPT-4 architecture.

45:22

Interesting, right? That was not included in that list. So, so, you know, maybe a, a slight hint that G PT four is a different beast and we're not actually getting. A sense of necessarily the architectural elements that, that make GPT-4 tick in this release, but still quite interesting. Two different versions. I think as you mentioned, a 120 billion parameter and a 20 billion parameter version.

45:43

these are MOE models, so they're experts, of course, 128 experts in the 120 billion parameter version, 36 experts in the 20 billion parameter version. And in both cases, you're using the top four. So in any given token, any given inference run, only four out of those 36 or 128 ex experts are actually activated. And anyway, they, they go through a couple of different sort of decisions that they made on the algorithmic side. They use banded window attention in alternate layers.

46:13

So what this basically means is. It's kind of like a narrowed context window where each token only gets to attend to like the say, 128 closest tokens to it. And this happens every other layer, so they alternate it with dense attention. That gives you kind of a, a wider aperture to make sure that there is global information flowing through the pipes as well.

46:35

But this kind of helps you balance just like the compute heaviness of having a full attention mechanism with, with the need to have that global information processing. So, group query attention. So nothing surprising there. I think you can check out God, the deep seek. No, well, they did a variant of it anyway. You can check out our previous podcast on GQA. And we don't know that much about the training data, by the way.

46:57

We train the model they say on a text only dataset with trillions of tokens. Okay. Like, if there is a phrase that could give you less information about the quantity of data used here, of course it's in the trillions of tokens, right? That's just the cost of doing business when you're making frontier models or anything like this. The, the question is how many trillions, where specifically did it come from? They say, well, what the focus on stem coding and general knowledge.

47:21

So kinda interesting that we don't know the specific number of tokens. Not sure why, if that's particularly sensitive. Last piece I'll just mention is on the compute side. So, they did this with 2.1 billion H 100 hours, and so. By my back of envelope math depending on the the precision that they used for training. And if you assume like 50% utilization, something like that, you're looking at around 10 to the 25 flops.

47:51

So, it's, you know, not a frontier, well, not a true frontier amount of flops, but it's a really, like, that's a, that's a highly scaled training run about an order of magnitude shy of what you see with like, like a grok four, for example. Yeah, to give you a sense. So pretty cool. They, they use deliberative alignment. Again, you can check out our podcast on deliberative alignment when we first covered it, this is like opening eyes in a sense, their answer to constitutional ai.

48:17

And they do say these models are a lot more vulnerable to prompt injection attacks than opening eyes proprietary models. It's actually pretty wild. Prompt injection hijacking for these models is like 22% tax success rate for the 120 billion parameter version, whereas 0 4 mini is like 8%. So, it's worth flagging that, that that is a vulnerability with this model. Just a couple more notes to add to that. So, on the MOE front the 128 billion parameter model has a total of 128 total experts.

48:52

That means that although having many, many like six times the number of weight compared to 20 billion. It only has 5.1 billion active parameters per token. A 20 billion parameter model has just a fourth of a total expert. So it has 3.6 billion active parameters. What that means is these models are runable on a single GPU if you have the most expensive GPU around, if you have an H 100 due to those experts, and due to doing quantization, that's pretty aggressive.

49:28

You can use this on arguably kind of consumer hardware if you're a consumer with a lot of money. So, yeah, on the whole, a pretty useful model. I think there is a question mark as to whether this will be your model of choice at this point, or if you would go with Qmi K two or Deep Seeq for, you know, your fine tuning or, or model usage. Needs to me still unclear how those options stack up.

49:57

Probably on the model size range where this is at, which is only like medium scale size, right, like Cmic K two, we saw over a trillion parameters of many experts, I think something like 32 billion active founders at a time. So, as you might expect, these models are pretty good on benchmarks. They are, you know, capable models, but not kind of the cutting edge of what you can get with open source models right now.

50:28

Yeah, I, I think the way to think of this is, or one way to think of it, is it's like an option for western companies that don't wanna be running Chinese open source models so that they have something that's not made in China. That is, it's a pretty big deal if you're in, in open source, especially for age agentic models. 'cause you don't know what behaviors have been trained in, especially now that, that CCP is paying really close attention to this. But yeah. One other thing too, by the way.

50:54

Sorry, actually two quick things. One of which is they do say that they have explicitly not put optimization pressure on the chain of thought for either of these open source models.

51:06

So there was a, a open letter or open paper, you can think of it as that came out a couple weeks ago where all the big luminaries in the space were saying, guys, can we please, please, please, please not train the chain of thought to look nice and pretty because we wanna know if our model is scheming or thinking of evil things.

51:24

And if we just sort of like try to safety wash it and, and make the ugly stuff go away, the model will just learn to think those thoughts anyway, but just conceal them from us. And so this is opening eyes saying, Hey, we're gonna walk the walk on this. And explicitly flagging this paper. It's an important bit of signaling they're doing to the ecosystem. We are not going to put optimization pressure on the chain of thought to make it look pretty for that reason.

51:46

Last comment here is on quantization, you mentioned that quantization was a big part of it. Mixed, floating 0.4 format is what they are using for, for this. That's what's allowed them to compress the model down to the point where the smaller one fits on one GPU, 4.25 bits per parameter is the number.

52:03

Basically this means four bits per parameter, and then there's like one, you have like a block of parameters that have a, a one scaling factor that you use to get the rough order of magnitude, right? And then, and then your four bits do the do the rest. But yeah, so essentially. impressive model that fits right into the ecosystem. They are open sourcing also a framework called the harmony tokenize that I do expect we'll be seeing more of.

52:28

When you make an agentic model, you need new kinds of sort of model self-reference abilities to refer to different kinds of workflow. Instead of just like the model and the user talking to each other. There's like times when the model needs to think, there's times when the model needs to use certain tools and so they have a tokenize that explicitly accounts for that, and they are open sourcing that as well.

52:49

So anyway, this is a whole bunch of shit that is dropped at the same time in this paper and could go on for hours. And certainly opening eye. Would Yeah, whole bunch of details. Last thing to mention as you said, train from a get go to be capable of reasoning. You can specify reasoning low, medium, high in the prompt, and it's. guess explicitly optimized to be able to adjust those reasoning amounts.

53:17

Also trained to support tool use out of the box and, and it comes with several tools like web search and python execution. So yeah, all around you can do a lot of this open source model onto a slightly more outwear open source release. We've got Falcon H one, a family of hybrid ahead language models, redefining efficiency and performance.

53:44

So this is a hybrid architecture in the sense of combining transformer based attention with stake space models with which we've covered a decent amount over years. I think definitely there was the time when state space models were of a rage. Quick recap. State space models are. A recurrent alternative to transformers.

54:06

So you do a sort of loop and can theoretically keep going for as many inputs as you want, as opposed to transformers, which get all the input all at once and kind of get a big chunk of it, and that's all they can do. You can't sort of keep feeding it data precursor. So it's been known for some time that it seems to be the case that going all in on the current state space just doesn't work as well as going on a hybrid architecture like you do here.

54:38

So this family of models there's a whole bunch of variants with 0.5 billion, 1.5 billion, 3 billion, 7,000,000,030 4 billion parameters. giant, giant papers, 70 pages which we are gonna be able to go through. Bottom line with 34 billion parameter model is, decent, they don't seem to be competing too much on performance. If you compare it to QUENT 3 32 B Lama 3.3 70 B Lama four Scout, has some decent eval results, but generally not state of art. So, yeah, which is what we see, right?

55:21

With a lot of these kinda like mamba states-based model type things, right? They, they make a good proof of concept along some axes, and then we just haven't seen them become like a go-to production grade model. What is interesting here is the scale. So it is cool that we're seeing mamba get used at the 30 billion plus parameter level. That that's cool. I agree. It seems so far more on the, I don't call it academic side, but it's it's interesting that they're pushing this direction.

55:48

If Mamba ends up being useful, then they'll have a, a good lead and this is what they have to do. Right? So this is you know, the, the Falcon Series comes out of TII, so the technology Innovation Institute in the UAE, and they're playing catch up. So they, they need a strategy that allows them to kind of leapfrog. So I would imagine that's a big part of the reason why they're investing in mamba.

56:09

It's like kind of the, a good way to learn both the traditional transformer engineering stuff that you need to just kind of try to catch up, I guess. And also to place a bet on maybe we leapfrog because we understand how to scale mamba better than other, or state space models better than other other labs. That's my rough guess There.

56:27

Yeah. And as with some of our open source releases, a very detailed report I guess 53 pages, not counting the appendix going into various details, including quite a bit of detail on the training data and some conclusions. So, you do get faster inference. They say up to eight times faster inference in long context scenarios. And you're able to get better performance while using less training data. So, it's, it's more on the efficiency piece as per the title of the report.

57:01

It also seems to be the case that you get more of a gain at lower scales. So the 1.5 deep models is competitive with seven to 10 billion parameter models. Seems to be, and that's one of the big questions, is if you try to scale up the state space, kind of hybrid option to the scale of what philanthropic and JGPT are based on, is that gonna be better? Well, this is not clearly indicating either way, but is it showing that, you know, you get an easy advantage, so to speak? Onto the next story.

57:40

We've got meta clip. A worldwide scaling recipe. So this is about contrastive language, image, pre-training clip, which is a classic so to speak. I think going back to 2022 you are able to input some text and some image and see kind of how they compare more or less the similarity match. And this is both a model and a bit of a miser scraper. So perva title, a worldwide scaling recipe, what that's talking about is can you train on all the languages all at once?

58:23

What they say is, if you try to train at a smaller scale, you actually get this curse of MultiLing quality. So you actually do worse if you train on both English and non-English. Data as you scale, you are so to speak, breaking the curse so you're able to do better by utilizing all, you know, worldwide data, right? All the languages without having to translate to English. And, and that's what they call this meta clip to recipe.

59:00

Yeah, I, I am pretty surprised this paper makes no reference to the idea of like positive transfer and negative transfer. ' it reads to me as like they're trying to coin a phrase really hard, this idea of the curse of mult ality. But we already have a term for what this is. This is just positive transfer, right?

59:19

So famously when you, like at small scale train a model on say, three different tasks you add another task and what you'll find is the model's performance on the first three tasks will drop. And the reason is the model's kinda like overloaded. It's just like this. Now it's got another thing to handle and it's like, ah, crap. But if you do this at really, really high scales, what you find and with many different modalities, what you find eventually is that the model actually.

59:44

When you give it a new task, its performance on the others will go up. And the reason for that is that it's able to take the lessons that it's learned from that new task and apply them to the others. Sort of like in the same way that I guess if you let me think. Well, like, I don't dunno. Martial arts or something, right? Let's say like you did Muay Thai and you did Greco-Roman wrestling. You're probably gonna be better at picking up Brazilian jiujitsu just because you know how your body moves.

01:00:05

know, you, you, you may have picked up a couple things about controlling someone's upper body from from Muay Hai, like doing clinches and then for wrestling, like, you know, kind of keeping them pinned, whatever. That's kind of the idea of what's happening here. You're just seeing it play out in languages, like what is speaking another language, but just another task that you have to learn to perform. That's just what this is.

01:00:25

I hate to make it sound reductive, but sometimes like, think it kind of muddies the water more because if you actually didn't call this. Something other than that. it would invite you more to look at comparisons with like, how does this work with non-language new tasks?

01:00:41

And that's actually almost like a, it's not a missing plot here, but it would be really interesting to see at least a little discussion of how this contrast, because speaking different languages is more similar in task space than, you know, going from classifying images to controlling a robotic arm, but they're on a spectrum.

01:00:57

And it would be really interesting to kind of see that explored a bit still a really cool paper and, awesome that, this was done more validation for the idea that scale just automatically gives you a positive transfer over time. but yeah, there's just sort of like a, a little nitpick at my end. One last open source story, BFL and crea release Flux 0.1, CREA and open image model designed for realism. So, we've talked about flux before.

01:01:27

This is basically the best text image model outwear both open sourced and not. And the big deal here is this is building upon the existing open source flux variant with a focus on making it so the AI generated images don't look like AI generated images. It remains to be the case that often you're able to spot AI images just based on them having a certain aesthetic, which is kind of hard to pin down, but it's something to do with a softness blurriness a smoothness of the image.

01:02:05

It is like surprisingly global. So Korea has a quite long blog post discussing the topic, talking about how they train it for the images, looking like real photographs and not clearly AI outputs. And I think worth noting because yeah, you might assume you can spot an AI journey. The image still just based on these aesthetic trends don't assume that to be the case for too long

⁠¶ Research & Advancements

01:02:33

Onto research and advancements. First story. Google's newest AI model acts like a satellite to track climate change. We've got another alpha model from Google. This one is Alpha Earth Foundations. It's is designed to track analyze changes on earth using yeah satellite data. So, a bunch of cameras floating up in space, looking down at earth. And now with this model, users can access decal information about any location on Earth. So you can use this to get at various things.

01:03:13

You can get color coded, map that highlight material properties, vegetation, groundwater human constructions, that less you understand ecosystem dynamics, air quality, sunlight. Et cetera. And overall this model is intended to assist governments and corporations in deciding what to do based on understanding of geography and, and climate and things like that.

01:03:39

It seems like a big part of it is just the compression that they apply to the like, terabytes of data that exist about Earth's surface, and just like getting those wrangled down to a manageable scale. So it seems like they're overlaying a lot of different, essentially different images of the earth, about vegetation, about mineral content, about all the things.

01:04:01

I'm not a geologist, but like, you know, all the things that might be relevant to geography and you know, research on, on climate and things like that. So, yeah, I mean, there's like limited information about how this is done in the article itself. More that the artifact itself that you get it does seem pretty cool. in the flurry of things this week, haven't had the chance to look at the actual paper. So this is maybe one to to bookmark.

01:04:25

We, we probably don't have the capacity to go into full detail, but as you said, the gist of it is they're compressing a whole bunch of data and, and the challenge where right, is you get a ton of observations over time from different sensors. You get noise, you clouds moving in and whatever. So the model just takes all of that and makes it usable more easily. The next one, also a story from Google.

01:04:54

They have released Genie Free, an AI model capable of generating 3D environments in real time for user and AI agent interaction. We've discussed and covered the precursors to this through two and other ones. This is another instance of a video generation model with realtime responsiveness to input. So essentially you can be an active agent within an AI generated world. Some examples of this could be, you know, being a human character in a city environment like you would in, in GTA.

01:05:35

But this is entirely spit out by a neural network. There's no sort of like, game code or anything like that. No 3D models. It's all being rendered and. This one is kind of a big deal. If you look at example videos, It's crazy. extent to which they're able to be consistent over time. The extent to which like the world doesn't distort as you move around or look around, but instead stage consistent is unprecedented.

01:06:04

it's, it's actually quite impressive the, jump from previous models of the store to Genie Free. We don't know much about it. They didn't release any sort of research. It's really just a showing off kind of release. It's also not usable via demo or anything like that. But yeah, surprising to me the extent to which this real time attractive model is a leap beyond what we've seen before. Yeah, so we covered genie.

01:06:31

So this since regenerative interactive environments back in, I, I just went back to look it up before the podcast. I think it was in February of 2024. So I recommend checking that out for a breakdown of the architecture or the likely sort of, at least the, the core of this architecture. It's sort of interesting how they bake in this like, latent action model that, allows them to train in basically control you know, like keyboard controls basically.

01:06:58

So you can, you know, move forward, move back, move to the side, accelerate whatever. There's like a limited number. Of actions that you can take in these videos to sort of like, yeah, generate new frames based on your actions. It's, it's pretty nuts.

01:07:11

And some of the examples too, like you'll see, depending on how you prompt it, like there was one of like someone riding a bicycle in a mountain in India and you could sort of see how if you moved your head around you would, you would like see the person's arms or your arms, I guess. it sort of extrapolated the body and the mechanics of what's happening around. Persistence is a really big, big deal here, right?

01:07:32

So when Jeanie first launched they had coherence times where, as you can imagine, like one example they show is they, they paint like a blue line on a wall, and then they turn away from the wall for a couple seconds, and then they turn back and the blue line is still there, right? So that persistence, that coherence from frame to frame over long horizons now is on the order of minutes. So the model will remember that that happened by contrast. the old case, it was, you know, seconds, right?

01:08:01

So you quickly imagine how that makes reality sort of disintegrate fairly quickly. In this case, you actually have these long coherence time interactions. This is really important because it sets up a really effective environments for training agents, right? The big issue with training agents is that you need, at least in the real world, is like the real world takes a lot of time. It's slow, everything is robotic.

01:08:23

If you move into more like digital settings than having a procedural generation of new environments is really tricky, especially with realistic environments. This gives you a way to do that, not for free, 'cause it costs compute but it gives you a way to generate sort of artificially and digitally these environments really quickly. So you can imagine this being useful for training a lot of training, agents, quickly. The other dimension, although you do have a limited action space.

01:08:49

So you can, you know, maybe have, I guess in the old version it was eight different actions that you could take that they were training the model to, to infer. Here you still have a limited action space, but you also can prompt the model in real time to change the environment. They call this profitable world events. So you could like prompt in, you know, make an alien spacecraft land in front of me, right? And like now you interact with it.

01:09:14

So it gives you another way to interact beyond the, the kind of activation space strategy. But they're not yet simulating other agents in the environment. they're struggling with creating clear and legible text. In context other than in, in the text box that you feed in. So unless you explicitly write the text in your prompt, the text that you see in the video that's generated will tend to be kind of garbled and weird. Anyway, so that's it. It's a, it's a wild step.

01:09:40

I'm visually absolutely stunning. and this is certainly something that Google has been working on for a long time. Big advantage to them is that they own YouTube, right? So they're actually able to train on that with no issues, other labs can't. And if it turns out that this kind of environment generation is critical for training agents, that is a really interesting structural advantage for Google and the kind of a GI race.

01:10:02

And another advantage is, as you said, DeepMind has been working on this for a long time, setting up like trainable environments with actual little video games, published research on this, wouldn't be surprised if that combined with YouTube is what made this possible. And yet, on the note of realism they're doing seven 20 p, so like 70 hd quite sharp at 24 frames per second. So compared to Genie two. That was, you know, pretty blurry, pretty obviously, ai, looks good. Like the videos look good.

01:10:38

So very impressive. Still limited in the sense of how much physics we see. So, if you try to see like object collisions, there's not a ton of that, but it's getting to a point where you could definitely see yourself having fun playing with these kinds of things. And next up we've got a paper AlphaGo moment for model architecture discovery. So the AlphaGo moment here, they make a bit of a major claim in the scraper. So model architecture discovery is a pretty well studied topic, basically.

01:11:16

Usually we have just manually. Chosen how to build new networks, you know, how many layers to use, what kind of functions to build with, et cetera, et cetera. And instead you might consider just directly optimizing the architecture, doing a search over the space of possible neur network architectures to discover what is ideal. And there's been, you know, quite a bit of work in the field, but so far at least it seems that there's not been a, any sort of revolution as a result.

01:11:48

This paper claims to introduce a system of that leads to a real kind of breakthrough in model architecture. So if they turn this, their move 37 in model design in the sense that you get a novel discovery, that is a real game changer. they get this. Cream aware router that introduces the query and summary router that leads to reducing compute while preserving some stuff.

01:12:21

I, I won't get into too much of the details of the discovery but this came about because of this a SI arch framework where they have this essentially automated researcher. it's similar in some ways to what we've seen before with AI research systems. Like ANA type stuff. Yeah. Yeah, exactly. There's like a whole bunch of stuff going on. There's a component here of extracting stuff from science.

01:12:51

For papers, there's a researcher agent, an engineer agent, an analyst, and then they take this to explore. A big tree of architecture is a thousand. 773 in a sort of revo evolutionary process. And yeah, I've seen some disagreement as to how big of a deal this discovery is. At least some people think that it's a bit of an overstatement to make this seem like a big deal.

01:13:24

But regardless would imagine your take is part of a thing that people are excited about with regards to AI advancements is self advancement. When are we gonna get to a point where AI can do research on its own to get better? This is at least an attempt at showing that AI can do that Now. Yeah, I mean, you're absolutely right. Like the, this is why there's so much attention on automated AI r and d in this sort of me sense. This paper?

01:13:51

Yeah. You know, we're highlighting it because it did the rounds. It was a viral paper, one of the viral papers of, of last week when there wasn't much to talk about. I mean, honestly, looking at it like, there was one bit that caught my eyes, like, I'm not sure that this is like, as impressive as it seems. I have, you know, short personal timelines on when a GI might be hit. But this does not scratch the itch for me personally. I, so this is a genetic algorithm type of philosophy, right?

01:14:19

That they're applying. So they take a bunch of innovations that they're researcher and their engineer and all of that, that stuff put together. And then they'll like they, they'll take elements of them and kind of mix and match and then see what the fitness score is based on that. They'll decide which to breed in future generations. So if you're familiar with genetic algorithms, that's. That's it. If you're not familiar just know that like the fitness function is sort of like the core of this.

01:14:45

It, it's sort of the thing you're optimizing for in some sense. Here's what they say. Architectures with losses greater than 10% below baseline are considered to have information leakage and are immediately discarded. Okay? So in other words, if you come up with a new architecture and oh shit, it does like suspiciously better than baseline. In this case they're saying 10%. Then we go, okay, there must have been some information leakage and so let's get rid of it.

01:15:11

This is just buried somewhere in the paper. Whoa, whoa, whoa. If we're just like discarding architectures with losses that are 10% below baseline because you think they're due to leakage, how do you know there isn't a loss that's like at 8% or 9% below baseline that's also due to leakage.

01:15:29

And since their big results are reached by iteratively following the loss through many rounds with each round potentially being subject to leakage, doesn't the whole thing potentially just become a giant leakage stack naively, that's what it seems to me. Their own assessment suggests. Like they're obviously not confident in how they're preventing leakage from happening, which is really hard. An experiment like this.

01:15:56

I'm not saying that it's 'cause they're incompetent, it's just that maybe they bit off more than they could chew, like. It's just really hard to do, and you probably ought to guess that this is due to like kind of measurement artifacts rather than something, something we'll see may, maybe I'll look like a schmuck. God knows it wouldn't be the first time. But that, that kind of was the thing that stuck out to me is like, well, wait a minute, why? Like why 10%? How do you really know?

01:16:18

Yeah, exactly. This was, you're not the only one to have pointed out this little brief detail information that could presumably meaning, you know, some published research got through, and that's just what the model did instead of actually being novel. But yeah, compared to other evolutionary type techniques, I guess we key here is using LLMs to come up with the code and come up with the novel, directions to pursue.

01:16:48

Yeah, no sort of huge breakthrough in terms of, but potentially this is another indication of the sort of direction people are taking these kinds of AI research, automated agents where you are trying to combine a whole system and potentially are gonna be able to kinda create a research team, so to speak, to make progress. Nothing huge so far, but you know, GB five is here. Maybe once we get cloud five it'll just work.

01:17:22

Next we have meta evaluates ROC four, another story that we are kind of catching up on from a week ago. So, this is on that time horizon evaluation, showing for any given model how long of a task and it reliably, successfully. And they evaluated graph four, added it to their kind of trajectory line. It's able to do slightly more than oh three.

01:17:52

So the 50% time horizon the time market, which Rockport is capable of doing the thing that takes that long 50% of the time is now about one hour and 50 minutes which is up from all three at about an hour and a half. So, this is you know, yeah, another kind of leap aside advancements, at least at to 50% time horizon for the 80% time horizon. It actually doesn't make advancements beyond O three and Opus Sport. Oh, I, I think that's fascinating, right?

01:18:30

I mean, so I was thoroughly confused about this, and this is one of the reasons that I was so excited about the GPT five launch is like, we'll get another point on that meter graph. we'll be able to see, does this one look as good on the 80%? So that one does. So g PT five continues the trend through 80%. It's consistent with the sort of accelerated timeline with four month doubling time and all that jazz that we talked about.

01:18:53

But interestingly, grok board doesn't, and th that I think is worth, in a weird way, it's a shame that we didn't pause longer on Grok four be before getting GPT five, because I think it would've caused a really interesting conversation where people would go, so which one matters more? Do we care more about the 50% or the 80% replication probability? And I'm not sure what the right answer is there. But yeah, sorry.

01:19:17

There's like a whole podcast we could do about the re the distinction between those in White matters. Bottom line is we don't have to do that now 'cause GVD five gave us our, our extra points. So there you go. Yeah, we we're just continuing to update this chart. I dunno what you can probably search up when we initially discussed this. It's an interesting effort at trying to quantify basically the real world impact or potential usage of these models.

01:19:42

GB five, they say is able to do two hours and 20 minutes, something like that. At 50% success rates for 80%. It's a bit higher, 26 minutes. So as you said now leading the pack quite a bit higher than GR four or O three a cloud four sonnet.

⁠¶ Policy & Safety

01:20:05

moving on to policy and safety. First, we have a paper from OpenAI estimating worst case, frontier Risks of Open Weight LLMs. So they study the worst case risk of releasing G-P-T-O-S-S, and they talk about particularly malicious fine tuning, where they attempt to elicit maximum capabilities by fine tuning G-P-T-O-S-S to be as capable as possible in, bad biology stuff and bad cybersecurity stuff. So, bio risk where they are training in an RL environment with web browsing.

01:20:46

And then for cybersecurity, they're training in an agent coding environment to solve capture of a flag. The gist is that G-P-D-O-S-S may slightly increase capabilities, but doesn't substantially increase the risk factor of lik survey. The decision to release model is informed by the idea that compared to other open source models out there, this is not gonna significantly change what bad people might be able to do with it. and then I, I saw some rebuttal as well.

01:21:20

People saying basically like, if, if the theory of the case is that we're going to allow ourselves to release a model, as long as it only slightly increases bad actors' ability to do damage, then this creates a situation where all the frontier labs are incentivized to incrementally increase the capabilities, the weaponized capabilities of open source models. I mean, this is true. It's, this is a very difficult hair to split. Some might not see it as a hair, but it certainly is.

01:21:49

Like, you know, there is nuance here, part of which is like you have these open source Chinese models right now that are the only ones on the market as open source. Increasingly, those are age agentic models, which means people will be deploying them in. Settings, potentially with dangerous behaviors baked into them that adversaries want to manifest in certain ways. So anyway that's kind of all part of the debate back and forth there.

01:22:12

Yeah, it, it is by the way, pretty pretty impressive on these benchmarks, including, in particular, bio risk tat knowledge and troubleshooting. It is essentially beats so, beats perplexity deep research which is based on R one plus browsing. pretty close to opening IO three with anti refusal and, and browsing. And by the way the two things that they try here when they do their malicious fine tuning this MFT thing, they're sort of, coining here. One is.

01:22:40

You can disable refusals generally and just generally jailbreak a model. The second is domain specific capability maximization or just basically fine tuning the model on the a, the dangerous capability you want it to develop. And these are two different kinds of fine tuning that they have to account for, right? Get rid of the model's tendency to just decline to help you with bad things, and then separately make the model better at doing bad things.

01:23:02

So all part of this sort of framework that you see them trying to sharpen here as they work towards implementing what is ultimately in their preparedness framework, right? This is a, a step that they're tying to that. So there you go. Next one and another safety related piece of research. This one from Enro covered in some mainstream articles. The article title is Ros ai vaccine, train it with Evil to Make It Good.

01:23:30

That's with FunTech on the actual research titled Persona Vectors Monitoring and Controlling Character Traits in Language Models. So, so this is actually kind of related to some of the research I think from OpenAI recovered just recently that looked into this notion of yeah, basically traits that models have in, in terms of being what, sarcastic or dishonest.

01:23:56

And at the time, what we showed with this previous research is that if you just increase being sarcastic somehow, that makes you also misaligned in all sorts of ways. So this research is looking at monitoring the personas that a model has and then mitigating the ability to fine tune personas that are bad. And yeah, basically this, this talks about the pipeline of data that is being used and would make it so it's, it's harder to misalign the entire model with just a, a bit of tweaking.

01:24:37

I actually again, week of week of insane Stuff blasting. I did not see this paper and I'm just looking at the Business Insider article about the paper, which is really confusing and use it as very kind of like, it's very journalist trying to write layman language vibes.

01:24:53

But reading between the lines and, and we'll correct this if this is wrong next time, my sense is what's happening is that basically anthropic like puts in a prompt and they get, you know, obviously the activations in the model that, that result. Then they take up one layer probably in the residual stream and they would inject the evil persona that they have derived from using their sparse auto encoder strategy, which we talked about before and can talk about later.

01:25:20

If people read in that they wanna hear more about this, but basically inject the evil persona, like add it at that layer but keep the training objective such that like, hey, be, be harmless basically. So that even with that injected activation that artificially makes, at least as of that layer makes the residual stream evil. The rest of the model learns to compensate for that. so when you then remove it, essentially the model is.

01:25:47

Like it's forced to behave well under, under the circumstance where you inject evil into its brain. So it's like if I took a slice of Andre's brain and I put in like an evil slice of my brain, and, but then like, I don't know, I, I gave Andre some like reeducation training to be nice and then I took out the evil slice of my brain and Andre's like, oh, I'm extra nice now. And even if somebody put an evil slice of brain in me, I would still be nice. That's the vibe.

01:26:13

Again, just my guess based on this, pretty hard to parse article. Yeah, to be honest, that was kind of what I was doing. As per usual, trying to make it sound like I actually understand diving a bit deeper. The focus of a paper itself is more on monitoring as well as Oh, So, they introduce persona vectors. That's the kind of activation vector that indicates a model has a certain persona.

01:26:41

And then what that means is you are able to monitor the model during training for the emergence of this kind of personality. And then also monitor the fluctuations of a model add deployment time, which means that you can then mitigate that happening by flagging training data preventing it in first place, or intervening when stuff goes wrong, when you're able to detect. So this, this basically tells you another way to detect misalignment in, I guess a more reliable way.

01:27:16

Well, now I'm curious about the thing that I said and whether someone's gonna try that, because that sounds actually kind of interesting and I wonder what would break. Yeah. Yeah. And next up, another result from Monro. And again, we are not as read on this. This is slightly older. This is going back two weeks, but I don't believe we've covered it. So I think it's worth mentioning the research is titled Building and Evaluating Alignment Auditing Agents.

01:27:46

The headline is they develop free agents that autonomously perform alignment auditing tasks. And when these models are tested against intentionally misaligned models, these auditing agents are successfully able to uncover that problem, right? And so. This is pretty much what it sounds like. You know, if you worry about misalignment, if you worry about models being bad, one way to try and prevent that is to have some other ai do a bunch of tests automatically to detect it.

01:28:23

This is Aaro taking a stab at it and showing that at least to some extent you can make that work. On the lightning round we have this research note, so not a paper, but, and again, even papers aren't papers, so let's just call it a paper. Optimizing the final output can obfuscate chains of thought. So. This is by I think Alex Turner was, was part of this. He was like one of the original guys who did research on Yeah, there he is.

01:28:50

Yeah. Sorry. Turner Trout is his, his handle, but it's Alex Turner and a bunch of other, other folks. He's one of the original researchers on the power seeking research that we may have talked about a while ago. it turns out that if you take a model and train it to for example, I mean they so they, they train a model to, what's the word? Oh yeah. Okay. I'll give you a task. Sorry, I'll start with the task. That description.

01:29:17

Give a model description of different objects, like a shiny wooden cube or something, and tell the model whether or not that object triggers a detector. And then you're gonna ta, you're gonna give a bunch of examples of that to the model and essentially have it become a classifier, have it predict, given a new object, whether the detector will be triggered. But now what you do is you penalize the model for using the word detector in its final output.

01:29:45

All you care about is the final output, not the chain of thought. You penalize the model for using the word detector in the final output. What you'll find is that the model uses the word detector less in its chain of thought, which is interesting because you never use the chain of thought as an input to your reward, right? So, they're calling this spillover. Basically the thing that you are optimizing for the property that you're optimizing for in the part of the output that you're evaluating.

01:30:14

Spills over into the part of the output that you are not evaluated. And they do this in a bunch of different settings and not worth going over the details. so, so super interesting. One of the implications here is we want to like, ideally not optimize the chain of thought to look pretty or nice or whatever. Because that will tend to cause models to obfuscate if they're, if they're hatching evil plans, we want them to write those evil plans in plain English for us all to see.

01:30:41

So one solution people had proposed is, okay, fine, well, let's just only use the actual output as the thing that we reward and reinforce on and not use the chain of thought. So the chain of thought can, can, you know, be free and, and honest and all that stuff. This is showing you that that actually is not sufficient or may not be sufficient. It's also testing a proposal for an alignment strategy that Dan Cota proposed.

01:31:06

think a couple months ago, which is really cool, called Mind plus Face which we should talk about at some later point because we're running out of time. Good. I'm running out of time. Lots to talk about. So, moving on to another one of the, our favorite topics compute and China. Got a couple stories on that. First up from Epic ai, why China isn't about leap ahead of the West on compute.

01:31:35

Basically what it sounds like, this is more of an analysis piece than news that covers the various factors, the manufacturing side, the software side of why China is not likely to overtake the West in terms of the frontier of let's say like removing the need for Nvidia to be on the cutting edge of AI compute. It's, it's fairly detail piece that covers probably what listeners regular listeners already know in terms of the factors at.

01:32:08

But yeah, if you're interested in, in getting a bunch of details and getting into slightly more nitty gritty of things like Cuda, this article has that. It's a shame 'cause this is one of those articles where like the actual value is in the details, and then the superficial description is gonna be like, ah, you already know this. The, the one thing I'll say is that like most epic AI research one of the most interesting things about it is that they.

01:32:33

Bother to plot and, and analyze the, like, doubling times or, or like how, how quickly, like flop per second or like compute power is growing over time. how their compute efficiency is growing over time. Comparing it to how compute efficiency has been growing over time in the West, and kind of like drawing lines and having them intersect. So if you're, if you're into that kind of stuff as I am, you'll find this really exciting. You won't find anything qualitatively surprising here.

01:32:58

Photolithography is the big thing that China's missing. No surprise. They're making progress on everything. That's not that, that it'll take a couple years before that, that really has an effect. And, you know, probably we shouldn't be exporting Nvidia chips to China as a result. It's sort of the subtext. And next story inside the summit where China pitched its AI agenda to the world.

01:33:23

So this is covering the release of the Global AI Governance Action Plan from late July, which happened at the World Artificial Intelligence Conference in Shanghai, which I wasn't aware of personally as. A, as a thing that happened, but apparently it's a major event for ai kind of personalities to gather including major prominent western figures like Geoffrey Hinton and Eric Schmidt.

01:33:57

And the gist is that China, at least in the statements and this action plan very much advocated for cooperation coalition of major AI safety players call it by China Singapore Que and vu.

01:34:16

Yeah, in contrast to the US with the US AI action plan, kind of, you know, going the opposite way, very much going us as a loan figure in the space, yeah, it, it seems to be, I don't know, it's hard for me to understand how meaningful this is, but there is a contrast to be seen between the US AI action plan and this global AI governance action plan Clearly.

01:34:45

So the USAI action plan often is kind of like mischaracterized as not caring about safety, whereas what they do is actually like, they call for investments in. Basically AI lender research, AI control and monitoring mechanisms to detect if things are going haywire and contingency plans as well for what to do if they do. So this is like, it's not a government that's going, eh, whatever.

01:35:10

It's a government that you can model as saying, well, like, we're skeptical, we don't know, but we're maintaining intellectual sort of I don't know if intellectual humility is the right term, but like we're, we're uncertain enough that we're gonna keep our, finger on the pulse here. There wasn't, so there was some international engagement stuff on the chip side in the action plan. There wasn't much in the way of like, hey, let's coordinate around safety.

01:35:34

And that's where this sort of sentence from this article comes in when they say with the US out of the picture, because there was no sort of senior representation from America. There. Quotes, a coalition of major AI safety players co-led by China, Singapore, the uk. The, and the EU will now drive efforts to construct guardrails around frontier AI model development.

01:35:55

Now, it's not clear that that actually is meaningful given China's involvement and also the fact that the United States, if you ignore their labs kind of is, like the most important player, arguably. So apparently also, it's not just the fact that the US government was not there, there also wasn't representation from the leading frontier Labs only XI sent employees to the forum, which I mean.

01:36:21

If I heard that a bunch of like open AI and philanthropic people were there, I might be slightly worried that they didn't all bring burner phones and laptops because this is like China. but you know, it seems like this is this is a kind of China sphere sort of validation piece. Another piece there is this guy Brian say who is one of the, the founders of Concordia ai. This is like a Beijing based safety like research company. And he has been in this for a long time.

01:36:49

I remember talking to him like years ago, just when I was trying to understand the lay of the land. so they're claiming many Western visitors were surprised to learn how much of the conversation about AI in China revolves around safety regulations. Quote, you could literally attend. AI safety events nonstop in the last seven days. And that was not the case with some of the other global AI summits.

01:37:09

Part of this is also that China has a particular interest in controlling the outputs of these models for political reasons. A kind of zero tolerance for references to Chenin Square or the Uyghur situation. and so that kind of creates more of an imp an impetus for them to cla down on the narrow behaviors they don't like while making no mistake, having absolutely no guardrails whatsoever for the CCPs own use in government, in defense, in kind of national security of these models.

01:37:40

And that's one sort of distinction that they do weaponize quite effectively. They'll be like, Hey, we're imposing all these like safety guardrails on our models for our domestic companies. There are folks in the West who get excited about this and go, oh, China's serious about safety. But of course, in China, what matters is how will the government use it? And you'd better believe there's not gonna be guardrails on government use of the, the technology. So it's a bit complex.

01:38:05

But you know, people are talking about it. I don't think that's a bad thing. Just gotta keep the caveats in mind. Right. And, and there's no meaningful kind of updates in terms of events. But it does speak, I think, to positioning framing. Lately it's been an interesting development this year with China being the source of the leading open source models now you know, basically in the west.

01:38:30

And this kind of political effort on engaging with other institutions and then collaborating and so on, is another potential effort to sort of, have a sphere of influence in AI beyond being a singular kind of closed off player. Last story, of course, we've gotta talk about export bans and stuff like that. Nvidia, H 20 GPUs are portally caught up in the US Commerce Department's worst expert license backlog in 30 years.

01:39:04

Billions of dollars were for GPUs and other stuff are in limbo due to staffing cuts and communication issues. So, that's very long title, pretty much telling you what's going on. There is yeah, a real delay in, in difficulty in shipping with chips and particularly not just to China, but other regions as there's the need for approval for these expert licenses.

01:39:36

Yeah, my personal take on this is, this is actually a, a mistake in the first place for us to be shipping GPUs to China, especially given stuff like what, what Epic surfaced in their report. What, what everybody's basically flagging is like China actually cannot. Or has no line of sight on, on catching up to On video to the west. Yeah. exactly. That's it. And so, the, the frame has kind of been, well, so let's ship China GPUs that are like considerably worse than the GPUs that we can use here.

01:40:09

But I mean, in my opinion, the frame ought to be let's ship China GPUs that are just marginally better than their best domestic GPUs. It should be more of a, what ceiling than a floor or floor ceiling. I don't know, something like that. And that's, that's just not what's happening right now with the age 20, which I think is a shame. The other thing is that like. China could even fab like effective systems like Cloud Matrix and that, but their challenge is doing it at scale.

01:40:36

So they literally can't, even if Nvidia, like H 20 GPUs, were a lot shittier than they are, they just are benefiting from the excess capacity and having these things fab, for example, like TSMC and, and not SMIC. So I think, I think those are like real issues. I think this is a, it's a real shame, but I understand why.

01:40:55

So the theory here is something like, we don't want Huawei and SMIC to then have dominant market share in China and then eventually ramp up capacity to be able to compete within Nvidia elsewhere. The thing is that's, I don't think at risk of happening anytime soon.

01:41:11

And historically, the way the Chinese supply chain has worked is we've done things like block off narrow parts of their supply chain but not others, which creates like tons of economic pressure on both sides of the supply chain to support the closing of that gap when there's just one part of your supply chain that you're missing, you have buyers and sellers on either side who are willing to like, help co-invest to build up that ecosystem.

01:41:34

Whereas if you just nu the whole supply chain and like carpet bond it which arguably is what we should have been doing like starting five years ago then it just creates a much harder circumstance. I don't think that there's a world where China could possibly be putting the pedal to the metal harder on domesticating their AI supply chain anyway.

01:41:53

So it's not like us selling them Nvidia chips in any way, in any way, reduces the amount of dollars or effort that is going into the SMIC Huawei kind of complex. So I don't see a policy win frankly, in this. Again, I understand the reasoning behind it. I just don't see this materializing in the way that is hoped for. But maybe I'm wrong and we'll just have to wait and see. It, it also doesn't seem like it's in any sort of intentional, this intecho thing.

01:42:22

This is kind showcasing the state of the US bureaucracy more so to me. Yeah. This particular thing. Yeah, you're right. It's just like bureaucratic incompetence, like failing to, yeah.

⁠¶ Response to listener comments

01:42:35

I did wanna address one listener comment real quick. There was a review on Apple Podcast titled The Politics of the Internet where it basically, it begins with, have to disagree with your discussion of internet data causing chatbots to lean liberal. And then it basically goes into various kind of topics related to what typically Democrats versus Republicans in the US say. So I just wanna clarify when we say. Chatbots, lean liberal.

01:43:10

This is drawing on studies that show that if you ask chatbots various questions and see what sort of ethical and moral values they align with, they tend to go more sort of Democrats friendly, even Republican friendly. This is like well known. When you say bias, it doesn't mean necessarily that it's bad or wrong. This is just how these models tend to behave. They're more in line of Democrats.

01:43:41

So this is not to say that they say false swings or, you know, are, are somehow bad, but they are gonna agree with Democrats more than they are gonna agree with Republicans. It, it is something that we have to like face, honestly. Like there's a lot of studies that, that have been done, including from sort of organizations that Lean left a little bit more.

01:44:05

There's there was a famous study, this is an organization that Le Leans left, but the Center for AI Safety did a really big one that was pretty decisive in terms of carving up, like, you know, these models very clearly seem to, and they had really good methodology.

01:44:18

We, we covered the paper on the podcast actually, but a bunch of others have come out since, and it's, it's pretty obvious too if you just interact with the chatbots, you'll sort of see like, there are all these examples of, you know, if you ask a question about Hillary Clinton versus Donald Trump back in the day this is just like, not the labs trying to do this necessarily, as you said, it's just like. Just think for a second about the demographic of who's online, right?

01:44:40

Skews younger skews towards people who live in big cities. Well, statistically that means quite overwhelmingly Democrat voters, right? So, it's, it's just a function of where your data comes from and that makes it, by the way, incredibly hard to come up with. I mean, we don't know how to make an unbiased reporter as a human being, like that's not something anyone knows how to do.

01:45:00

So you can imagine how impossible it is to do with AI systems that it's just kinda like what think it's a good point, right? Like if, if you try to make it perfectly, perfectly neutral, then it's just gonna refuse to say anything on most topics, right? So, again, bias is usually considered a negative term. This is more of a descriptive term, you could argue. This comment also talked about GR a bit, so this is drawing on the whole conversation about grok.

01:45:28

To be clear, I don't know if this was obvious, I am quite critical of the way that Grok has handled misinformation or, being maximally truthful. They've done some things that are questionable in that regard. So didn't want to imply that grok is addressing a problem by kind of correcting it, Yeah, you, different thing, you can think of, yeah, all these models are just applying, or these labs are applying like a different philosophy in a different lens, right?

01:45:58

Like ultimately we don't know what the one true way to do this is gonna be. And to some degree, maybe having different lenses on it is probably a good thing. And the same way that you don't want to consume all your media from like. either like Fox News or CNN, you want to like have a nice balance. Right. Sort of similar idea here. There's, because there is no answer, the only answer is the meta answer of, have a balanced media diet or a balanced LLM diet.

01:46:23

I don't know, that's a half bake thought, but like, this is just a really hard problem. It's Harder than it seems. Well, that's the episode. An exciting week and we'll try to, Stick around next week as well, not skip any more for the next while. Thank you so much for listening. As always, we appreciate it. If you give us feedback, if you share the podcast all of that, more than anything, please do keep tuning in.

Transcript source: Provided by creator in RSS feed: download file