Understanding the Most Viral Chart in Artificial Intelligence

⁠¶ Intro / Opening

Speaker 1

00:02

Bloomberg Audio Studios, Podcasts, Radio News.

⁠¶ Viral AI Charts and METR's Risk Mission

Speaker 2

00:18

Hello and welcome to another episode of the Odd Lots podcast. I'm Jill Wisenthal, and I'm Tracy all the way, Tracy. One thing about AI is that lots of lines that go up.

Speaker 3

00:30

Yes, famously, there is perhaps one line that has captured the attention more than others when it comes to lines going up.

Speaker 2

00:37

Yes, but we're recording as April seven. Did you see the anthropic revenue chart, by the way, Oh.

Speaker 4

00:43

It's just extreme.

Speaker 2

00:44

Okay, it's just the number of lines going up. I mean there are some really.

Speaker 3

00:49

Let me caveat that. Up until recently, there was one chart of a line going up exponentially that became I think it's fair to say the most viral chart in AI.

Speaker 4

01:00

Right, Yes, I would absolutely agree with that.

Speaker 2

01:02

So one of the many lines that go up, or there are various lines that sort of capture This is essentially just measures of AI progress of what they could do, what the models are capable of, and so forth. And you know, there's all different benchmarks out there and hobbyist benchmark creators, et cetera, all kinds of benchmarks out there. Organization called Meter based out in San Francisco, and they measure how well AI models are doing at various sort

01:30

of engineering tasks, et cetera. And they have these charts showing how long, you know, certain tasks, how long would take a human to do them, and then whether AI could do them. And yes, the lines just almost vertical. I think there was someone one of the ones that came out maybe very early this year or late last year, showing the latest Claude model.

Speaker 4

01:49

It, yes, like this is crazy.

Speaker 3

01:51

When I look at these charts, they're called time horizon charts. When I look at them, I like, intuitively I kind of understand what they're saying, and you can kind of see the leap in progress between some of the previous models and Claude, right the latest claud model. And that's what got everyone excited, was you had this big exponential shift up in the capability of that particular AI model.

02:14

But then when I start like diving into what it actually says on Meter's website about what these charts represent, I start getting really confused. I know everyone wants to get excited about AI and charts going up in general, but I think there's a lot of nuance here and we should probably talk about it, because the other thing going on with meter right now is they've become sort of the industry standard benchmark, and so a lot of

02:36

investment decisions are being based on these charts. And if you oversimplify them as just like okay, lines going up and then suddenly it goes up even more, obviously, people are going to start to get like maybe a little over excited.

Speaker 2

02:49

Can I see one other thing too that I'm very curious about, Like, I'm really glad that there are people designing various benchmarks for measuring a progress. Seems like an important thing to get a handle on. But like if I were, like, say, like talented or smart enough to be like doing these things, I would go work for one of the labs and make ten million dollars a year or something like that. And so I'm actually curious

03:11

to a lot of the nonprofits, et cetera. It's like, do you really want to be like working at the cutting edge of AI in a nonprofit? I mean, I guess Open Eyes owned by a nonprofit weirdly enough, but you know what I'm saying, Like I would want the money.

Speaker 3

03:24

We should talk about it with our guests who are currently sitting right here.

Speaker 4

03:28

That's exactly right.

Speaker 2

03:29

I'm very excited to say we have the two perfect guests to talk about the best of viral and maybe important chart in AI. Right now, we're going to be speaking with Joel Becker. He is a member of the technical staff at METER, and we're also going to be speaking with Chris Painter, the president of METER. So, uh, Joel and Chris, thank you so much for coming on oud, lots for having us. Yeah, really excited to chat with both of you. Chris, and you're the president. I'll start

03:54

with you, like, what is METER? How long has it been around? What is this organization? What's its goal? Just give us the sort of six sixty second synopsis of Meter.

Speaker 5

04:02

Yeah, totally.

Speaker 6

04:02

I can try and you know, sometimes I give a long version. I can try and do a short version here. So Meter is a research nonprofit based in the Bay Area, like you said, dedicated to advancing the science of measuring whether and when AI systems might pose catastrophic risks to humanity as a whole, focused specifically on threats that come from AI autonomy or AI systems themselves. So when you talk about there's kind of this whole field and AI

04:30

of dangerous capability evaluations. People seeing ken this AI system assist with a chemical or biological weapon attack, can it advance? Kind of like bad actors' ability to execute cyber attacks on a really large scale. METER is sort of specialized in specifically assessing how autonomous are AI systems, what is the scale and like length and difficulty of tasks that they're able to do by themselves, partially because we think

04:55

it sets the stakes for conversations about AI misalignment. So we sort of see ourselve being on the hook for at any given point in time, giving humanity the bits of evidence that are most informative for establishing the stakes of are we reliant on AI systems as a society in a way that could make it really bad if they are misaligned.

⁠¶ METR's Safety Motivation and Measurement Mechanics

Speaker 3

05:16

I'm going to let Joe ask the question about why you're both working in a nonprofit instead of one of the labs later, But one question I do have is when I think of METER, you guys always come up in the context of these time horizon charts. And I don't mean this as an insult or anything, but I hardly ever hear anyone talk about the actual safety aspect of your mission. Why do you think that is.

Speaker 6

05:36

Yeah, So I think there's some distinction between our motive for assessing time horizons and the kind of how it gets used then by the rest of the world or kind of like what the origin of the rest of the world's interest in it for meter. I think the reason that we work on things like the time horizon charts is because if we're trying to establish the stakes for talking about could AI systems go rogue or one day could they like try to take over and subvert

06:01

human control? Three years ago, if you went back to around when it meters started about fourish years ago, and if you it was started by Beth Barnes Paul Cristiano

06:09

and this was kind of the initial motive. Is if you went back then and you said, why don't I think that AI systems are going to go rogue and like take over or overthrow humanity today the kind of most intuitive you know, you can come up with a lot of abstract reasons debates about the goals AI systems might or might not eventually have, but the kind of most damning in the moment reason is the AI system

06:31

just can't do much right. It doesn't make sense to talk about a question answer system that like can't even reliably answer programming questions saying like is it going to hack my systems or like backdoor me in some way. It just doesn't make any sense to talk about.

Speaker 3

06:44

It's going to write you a poem that you asked.

Speaker 6

06:46

Right, or won't even at the time they couldn't do anything for themselves. And so if you're like, kind of being able to subvert human control depends on agency, And so we wanted to come up with a measure that kind of tracks agency over time to kind of say,

06:58

when would this argument no longer? When are AI systems now able to kind of do long, complex enough actions by themselves that the argument kind of the goalposts almost move somewhere else to like, well, we would catch the AIS or the AIS don't want to subvert human control. And so I agree that there is a distinction between like how I think partially the exercise of trying to come up with these measures throws off things that are very like grounded and intuitive measures of AI progress that

07:26

might be more intuitive than just benchmarks. Right, So if you a lot of people are in the game of making just benchmarks, where you say, like here's my harm bench or something the AI gets seventy percent. That's much less of a kind of grounded or long lasting metric. Like it's hard to say what that means or how that generalizes, but the idea with time horizon is like, maybe it's more intuitive, and I think that helps both for safety and for like business understanding.

Speaker 4

07:49

So let's talk about what this charge.

Speaker 2

07:52

I go the main chart here at meter dot org, right on the front page, it's this time horizon chart, and it shows Claude Opus four point six as a February twenty twenty six able to complete a task length in eleven hours and fifty nine minutes with a fifty percent success rate. I have to admit, the first time I saw this chart or version of this chart, what I assume, and I suspect others assume, is that it was able to go off and work on a task for eleven hours and fifty nine minutes then come back

08:24

with an answers. But apparently it's not that. What do you walk us through what's really being measured here? By the way, the previous high was all was GPT five point three codex. That was five hours of fifty minutes. So I guess part of the reason this charge is blew people in mind because literally that's basically a double But why don't you talk to us about what's really being measured here?

Speaker 7

08:43

Yeah, so fundamentally, you know, in simpler terms, we are plotting the difficulty of tasks the AIS are able to complete over time. And you know, the particular way that we measure the difficulty of tasks is in how long it takes humans to complete those same tasks that we're asking the AIS to do. So in this case, you know, talking about for a PUS four point six, something like, tasks that take humans twelve hours to do, we predict that it will succeed at those tasks around fifty percent

09:08

of the time. And yeah, you know, it turns out that when you plot using this particular difficulty measure how performance AIS are relative to how long it takes humans to complete these tasks, we see an exponential increase in capabilities for AIS. And what that ends up meaning is that you keep on having these doublings of capabilities every let's say four months.

Speaker 5

09:28

It seems on recent trends.

Speaker 7

09:30

Where you know, the next model is not merely going to have necessarily, you know, an hour longer time horizon, but perhaps be having some multiple of the.

Speaker 5

09:38

Time horizon of the previous model that's come out.

Speaker 2

09:40

So then explain how the number of their twelve hours is established. So there is some engineering task and you say, okay, this is a test that would require twelve hours, but humans of all different types of talent capabilities, how do you establish that? Okay, this was a twelve hour TESK, this was a six hour tel whatever.

Speaker 7

10:00

Yeah, So the simple answer is, literally, we get humans to sit down and complete the tasks that we give to AIS and as close to identical conditions as possible. So first we come up with the tasks, and that's you know, that's whole god a killer fish. We can talk about exactly how we do that. And then, using essentially the same tools that we're about to give the AIS, we take talented humans, you know, not people who have seen this particular type of task before, but people who

10:24

have relevant expertise. So if it's a software engineering task, you know, they have software engineering expertise. Machine learning task, they have machine learning expertise, and then we time them, we see how long it takes for them to complete those tasks cescfully and then roughly we call the difficulty of the task as measured in human time to complete, as the average time it took these humans to complete

⁠¶ Task Selection: Focus on Engineering

10:44

the task. Then we'll run the AIS on this same set of tasks. Typically today for the very easiest tasks that they're more or less always going to succeed, there's some mid range of tasks where you know, perhaps they succeed fifty percent of the time, or perhaps for some tasks in that range they succeed zero percent of the time and for others one hundred percent of the time. And so they're getting fifty percent on average, let's say, and then for the much harder tasks, perhaps they're getting

11:06

closer to zero percent. And then the point of which we predict, you know, in the middle of all these zero percents and one hundred percents by task, the point at which we predict that they'd have a fifty percent chance of succeeding. That is, either a fifty percent chance of succeeding on some task or fifty percent of the tasks or of that's difficulty that we think they would succeed on. That's what we're going to call the time horizon of these models.

Speaker 6

11:29

I think one thing also that could be good to explain. Here is the task distribution. I mean, this is not this is not all activities that humans do. We are specifically here interested in or the like. There's some question in what tasks are you know, like Joel mentioned, we're having people come into our office do the task to get a sense of how long it takes. We're not having them come in and like, you know, paint paintings

11:48

or write novels or you know. We're focused here specifically on things that are in the distribution of work that a engineer at a like. We like to think of it as like a frontier AI lab, the tasks that they might be doing. So this is things like software engineering. It's fine tuning at AI models, it is like software machine learning, that kind of task.

Speaker 3

12:07

Wait, can I just ask why did you decide to focus on engineering? Because you could have winded out to you know, if we're talking about AI being capable of, you know, taking over the world, there are all sorts of substantive tasks that would fall under that category. So why just do engineering?

Speaker 6

12:22

Yeah, I think that for one thing, maybe other people in the team or maybe Jolis thoughts about this, but I think my particular motive and being interested in the time horizon on software tasks is that first of all, it's the thing that the industry is very like already even before we started working on this, is very focused on. So it's one of the capabilities that you should expect to come along for the ride earliest. It's the thing that like a lot of optimization pressure is being exerted on.

12:44

And then I think that it is kind of the like thing that you would expect as an early warning kind of sign of this AR and D automation. So to some extent, METER thinks of itself as trying to build you know, science that are advanced science that can say when are we getting to the point that aisism could improve themselves or speed up the pace of AI development. When will AI research kind of feed on itself? And the kind of core capability for that might be software

13:09

engineering and machine le learning research ability. There are other skills that could be relevant to taking over the world. I think other people have done time rising some like cybersecurity sense.

Speaker 3

13:19

But I suppose it is true like the basilisk isn't going to paint its way into like power or something like that. Okay, it might.

Speaker 2

13:25

Deceive you it might be very convincing or cunning in some way, and handover the cues.

Speaker 5

13:32

I always say for your mental models.

Speaker 7

13:33

You know, we don't have perfect evidence of this whatsoever, but my rough sense, sort of colloquially or you know, my prior before evidence comes in, is that if we did study tasks on these very different distributions, you know,

13:45

not machine learning, not software engineering. I'm not sure about painting exactly, but you know, perhaps or other kinds of task distributions that we could enumerate that basically we would see this similarly shaped exponential progress over time where every I'm not sure exactly, but let's say, you know, for month, six months, something like that, the level of capabilities as measured in time horizon would be doubling at something like

14:06

that pace, maybe from a much lower level. So you know, one example that we do have better evidence of is that the ais today are much less performance at you know, anything that requires vision capabilities, seeing what's on a screen, clicking around at a computer, but they're getting you know, tremendously better that sort of thing over time.

⁠¶ Challenges of Human Baselining & Reliability

Speaker 5

14:22

I just do mention quickly.

Speaker 6

14:23

We did actually do a very kind of brief investigation of this another task distributions that's on our website somewhere like cross domain time horizons. I think we looked at data from the Tesla's shared on self driving and forgetting the other there's like os world. Maybe some of these are like somewhat similar, still kind of in the distribution of software tasks, but trying to get further afield into things like vision.

Speaker 3

14:59

How big is the sample size on the humans who are actually doing work? And also is it getting harder getting like human engineers into the room to compete with like Claude Opus four point six versus say, if I was a mediocre engineer, and I'm not, I'm a non existent engineer, but if I was a mediocre one, I would like maybe I would feel good about going up against like GPT three or something, and maybe I would feel a lot worse about myself going up against like Claude.

Speaker 7

15:25

Yeah, you know, on these tasks, I'm in a pretty similar position myself to you. So we have approximately three, although it varies quite a lot across tasks. Human baselines per tasks, so you know, typically we're ever going over something like three. I think the final numbers, it's my impression that they're not going to be so sensitive to the particular baselines that we use.

Speaker 6

15:44

Aren't the longer tests week more weekly baselined.

Speaker 7

15:47

Yeah, So indeed, I think it will get a lot harder to baseline these tasks as the length of task AIS are able to successfully complete gets longer and longer. You know, you might think at some points the length of task that they can complete is longer than the time in four months time, they're going to be able to complete tasks of more than four months, and then it's you know, kind of becomes paths close to impossible

16:06

to get these four months long baselines. Of course we're not at that point yet, but you know, definitely has become more difficult to get these baselines as time has gone on.

Speaker 5

16:15

At the moments, not impossible, but very challenging. Joe.

Speaker 3

16:17

These are the future jobs for displaced engineers, right. It's competing against the codes for benchmark first benchmark evaluation, we found the jobs.

Speaker 2

16:26

So we mentioned at the beginning the most viral CHARLINEI is this chart that you have on the front of your website. Your website defaults to this and it shows you know, this doubling. So if we actually go back to November let's say November twenty twenty five, Gemini three pro three hours and forty four minutes, claud Op was four point six twelve hours. Those are the fifty percent

16:49

success benchmark. If we go to the eighty percent benchmark, which the website doesn't default to improve the price of improvement looks a little less impressive to me. So okay, now it's like it does not have the same gap pretty clearly. Now eighty percent is still not one hundred percent. And I know that this is your meter's goal is about, like you know, human safety and all this stuff. But when we think about people look at this and they use it as a stand in for how performance are

17:22

these models? Even eighty percent, you know, certainly for like any business application. I understand you're not like serving business here per se, but probably businesses care about this. Even eighty percent may not be very good enough. And it does not look as crazy when you look at the eighty percent chart as it does at the fifty percent chart. Why the focus on the fifty percent chart? And given like, why not look at the chart that just does not look as impressive.

Speaker 5

17:51

Yeah, maybe two central things to say.

Speaker 7

17:54

One to my eyes, the eight percent shot it's basically does look as impressive. Well, the doubling time is about the safe cope.

Speaker 5

18:00

On it's the same.

Speaker 6

18:02

It's the same, it's say an afseet of it's the same pace of progress.

Speaker 7

18:07

You know, it's something like five times smaller than the fifty percent than the fifty percent number.

Speaker 5

18:11

But you know that only takes you too doublings.

Speaker 7

18:13

And if each doubling takes around four months, that means that in eight months time you're going to have the same eighty percent success rate roughly as you do fifty percent success ray today.

Speaker 5

18:20

That's one thing to say.

Speaker 7

18:21

Maybe a second thing to say is, you know, remember at the beginning I said, essentially what we're doing is plotting the difficulty of tasks that these ais can complete over time, just with this particular measure that ends up showing this clean exponential trend. And we've picked a particular number as our difficulty number, and you know that is this fifty percent reliability threshold. We could have picked a different one. I think there are reasons for picking the

18:42

fifty percent one. In particular, it's the one that statistically we're better able to measure. For some technical reasons, it's the one that shows up in previously. It's show that there are some couple of a couple of other reasons why we can go for fifty per cents rather than eighty percent. Maybe a final thing to say is that this fifty percent number is sort of equivocating between these tasks.

19:03

It's able to complete fifty percent of the time and fifty percent of the tasks it's able to complete one hundred percent of the time, and fifty percent it's able to complete zero percent of the time. And actually, I think the situation is it's somewhere in between, but it's a little bit closer to the latter, where there are some tasks that it's completing with near perfect reliability and some tasks in that range that it's completing with very

19:22

low reliability. And for downstream economic applications or for applications inside of these maju AI companies or something, you know, you might think that that's more favorable in some sense, that there are some of these tasks where we're getting one hundred percent reliability, even even for very challenging tasks.

⁠¶ Investment Interest and Public Information

Speaker 6

19:37

I think two other things make maybe it could be useful to just explain when you said that they are

19:41

technical reasons why it's easiest to measure it fifty percent. One, Like, it is just the case that it is fifty percent is the point at which it is like least sensitive to Like the distribution is kind of thickest, right, I mean, correct me if this is wrong, But my I mean there are like to resolve something like ninety five percent, you would need way more samples because then you need to have some that are like you need way more simples to be able to resolve that level of precision.

Speaker 7

20:04

I think there are some caveats to that picture. But let's say even more extreme. You know, let's say that we cared about you know, ninety nine percents. In that case, if we had one percent label noise quotes unquote, yeah it is you know, if sometimes we were accidentally grading some of the failing tasks, passing some of the passing tasks as failing, then we just never be able to estimate that reliably right, And at fifty percents, this comes a little bit closer to washing out.

Speaker 6

20:26

And I think one other intuitive thing here, or one intuition, is that if you give me a task and you give me the model, it is the point at which I think that the model, all you tell me is the time or the length of task that it takes a human to do the task. The fifty percent time horizon is the point at which I think it is more likely that the model will be able to do the task than that it can't. And I just find that intuitive.

Speaker 3

20:48

Yeah, how much interest do you get on these charts from potential investors specifically? And the reason I ask is because I was just messing around and like googling some stuff and when the OPUS chart, the latest opis chart came up, someone posted it on Breddit and I think like the second comment on it was someone going, how do I invest in open Ai? And like and like people were they were trying to club together to like

21:11

invest in these companies. So clearly there are people out there who are using these charts as investment tools.

Speaker 6

21:16

I would say, you know, we don't get an enormous amount of inbound from investment firms. I mean, sometimes, you know, vcs or whatever we're based in the Bay area will reach out to us. I think that there's some kind of principle of our goal is to inform the public and give them the best evidence that we can about when we might get to this point of kind of you know, AI being you know, fully autonomous or able

21:39

to improve itself. And there's some principle at play here of like I kind of want to enable people to do whatever they will do with that information, and I think that we don't engage a ton in kind of the like business side or investment implication of the work.

21:55

One kind of thought experiment I sometimes say to myself is if I do believe that at some point we're going to get this AI that's improving itself, and where like AI research is automated, and you have all these fears about a singularity, would I rather that like all of Wall Street like falsely didn't think that was coming when I believed it was coming, Or would I want them all to know that it was coming, given that I believe it's coming, and I think all of human

22:19

Maybe this is more a personal view, but I think if this is possible that we will automate AI research, I think all of humanity being aware of it, aware of where we're heading, is sort of a precondition for us all being able to figure out what to do about it. And so I don't kind of want like certain people or one side or one team to kind of like selectively be in the dark because they might invest on the basis of this or something like that. But we don't, you know, it's not where we put

22:44

our time. We're focused on informing the public. The public includes some investors.

Speaker 3

22:49

So on that note, like what is the actual level at which we're all presumably supposed to panic or at which, like, if you're a policy maker, you would start to get worried about AI being able to automate and improve on itself in a way that eventually becomes detrimental to humanity.

⁠¶ AI Autonomy, Real-World Performance, and Limitations

Speaker 7

23:05

I don't know exactly what the level is on this time, horise and measure. I think, you know, one thing to say is we have made real progress on the science of measuring these AI systems and how capable they are. But I think there's a long way to go, and in an important sense, I think we're behind on this task. We're measuring some underlying technical trend and at some point I do think that implies greater risks.

Speaker 5

23:26

So astonishing things happening.

Speaker 7

23:28

Although Chris can speak more to other arguments that we might back out to for why even if AIS are very capable, we still might not see castrophic dangers emerge in the short term.

Speaker 5

23:38

Yeah, I'm sure, you know.

Speaker 2

23:39

I think part of the reason why the AGI cheddar has really picked up, particularly in the wake of like everyone using Chlord code, is it's very easy to emerge in it. So like you're sitting there, it's like, yeah, do this, do this. It's like I don't even need to be here, right, I think you sort of get a very intuitive feel for like how the human could come out of the loop. What helps today, because I'm

24:00

sure there's been tried. Like if you go to like ch juput and you say, here's a you hear you have cloud code access, go build something, and the AI is what actually happens today when AI is working with A.

Speaker 7

24:13

Yeah, my sense is that at some point, you know, further away points than would have been true some time ago, the AIS will more or less full on their faces that you know, there are some things they're not so capable of today, like collaborative hallucinations.

Speaker 4

24:27

Will they're just like, you know, just like devolved terribles.

Speaker 5

24:30

Yeah, I think all sorts of ways can go.

Speaker 7

24:33

You know, at some point they're going to need to rely on external resources, and today that they're not as capable at managing these external resources effectively. I think they're less capable at sort of ideation and sort of self awareness about where they are in the problem today than they are at these kind of raw software engineering skills.

24:48

You know, you know, as you mentioned, the ways in which AIS are autonomous today or close to autonomous today, is the human has the idea and then you know, submits that idea to cloud code or a code or one of these other agenticie tools, and then they handle the software engineering components and possibly there's still still some intervention after that. I do imagine that the sort of circle of autonomy or something gets larger over time. I

25:12

do think there's no fundamental barrier. It seems to me today as having those ideas and so we moved to a great level of abstraction. But if we were purely relying today on these fully autonomous capabilities, you know, could you manage research departments, any any apartments of your choice inside of a major AI company.

Speaker 5

25:28

No, my guess is probably not.

Speaker 3

25:31

Actually on this note, this reminds me something I wanted to ask. So when you look at the domain specific time horizon charts, so the ones that show like I think you call them task suites or something like that. Like I guess productivity by a specific job, and you see these different lines, So sometimes you see like almost horizontal lines and sometimes you see squiggly or steeper lines. What is actually happening there? Like, how are we supposed

25:56

to interpret that? Like is this a measurement problem? Or is it say something very fundamental about like what AI can and can't do under current conditions.

Speaker 6

26:06

The thing that I think would be good for Jill to explain is that I think that there is a distinction here between will AI like the time horizon charge doesn't by itself, I think, tell you will productivity in one specific kind of job increase because of access to AI?

Speaker 7

26:21

Yeah, maybe one thing to say on that chart showing the time horizon on these different task distributions relative to my guesses ahead of time, You know, I think those

26:29

time horizons are remarkably similar. I think the doubling times the pace of progress in AI seems more similar than I put of guessed to the original trend that we that we published, although you know, imperfectly so on this difficulty translating what we might call raw AI capabilities, in some sense you know, capabilities on benchmarks or something to

26:48

real world productivity. I think there are a number of differences in a number of ways, in particular in which the benchmark results are overestimating what we might see in the wild, you know, not not hugely overestimating. I think we do see that people are getting real utility out of these modern agentic a IOLs, but overestimating to some extent.

Speaker 5

27:06

One is that the.

Speaker 7

27:08

Scoring implicitly is different in real problems, I'm scoring based on something a bit more holistic than these algorithmic scoring procedures, these automatic scoring procedures that we're using at Meter and many other people that are using in the in the benchmark world. There's some notion of code quality if you're if you're working in software engineering, but for other tasks there's there's.

Speaker 3

27:27

Beautiful code, elegant code.

Speaker 5

27:29

People talk that yeah, yeah, yeah, for other tasks that's going to be coding.

Speaker 1

27:34

This is.

Speaker 7

27:36

One more thing is that the tasks that come up in the wild are more likely to be messy in

27:40

some sense. They involve working with other people, They involve working in much larger code bases or sort of more open ended problems, maybe with something even adversarial going on in the in the software engineering context, that might be that someone's trying to make a change to the part of the code base that you're currently working on and you need to and you need to work around that, and we do tend to see that the AIS are less capable working on these more messy problems.

Speaker 5

28:04

I don't want to overstate that.

Speaker 7

28:06

You know, it's not an enormous effect, but you know, that's one thing that gets in the way of these

28:10

productivity increases, you know. And then I do think there's something to the reliability question right where you know, if it was true that for a certain type of task you only had you know, eighty percent reliability, then every time you're going to need to go back and verify the work of these AIS, and not only verify the work of dcais, but without the context of how they implemented the solution relative to if you went about the task yourself, you'd already have that in your head, and

28:33

so this verification step quote unquote would take less time. You know, I don't expect these frictions to be sort of so fundamental in some sense, or I imagine they go up levels of abstraction I think not only as the underlying technical progress real, but I think that the productivity improvements that are also going to show up increasingly.

Speaker 5

28:49

But yeah, there are these frictions.

⁠¶ AI Industry Culture, Competition, and Safety Tensions

Speaker 2

28:51

Tracy alluded to this question when she asked about VCS and investor interest. So people see these charts and regardless of what Meter's point.

Speaker 4

28:59

Is, like, this is incredible. I got to invest in this.

Speaker 2

29:02

But this brings me to this broader thing that I find very strange about AI, which is this kind of odd, sort of Baptist and bootlegger relationship between the AI labs people who are building this stuff and the sort of alignment safety people, and they sort of go back and forth, and like you have the heads of the lab saying yes, this might destroy the world and take all your jobs, and the safety people in the alignment people.

Speaker 4

29:27

Says, yes, this might destroy the world.

Speaker 2

29:30

And like, I'm very strange industry right, Like the only thing that I can think of a cigarettes, where like they warn you that smokey is bad, except they had to do that because they lost a lawsuit.

Speaker 4

29:39

I don't think they were particularly inclined to do that.

Speaker 2

29:41

I can't think of any other industry where the most enthusiastic people about it are also warning and dooming about how bad the thing they're building could be. So I'm sort of curious, like you know, first of all, like and I talked about this in the intro, like who is the type of person that's like working it meter? It is like skilled enough to do like advanced evaluations, and like where's the funding coming from? But like talk to us about like who's behind meter and why they're there.

Speaker 6

30:11

Yeah, totally, So. I think one thing to say on the history of kind of people caring about AI safety in the day area is that this concern goes back like quite a ways, I could say for over a decade. There are many people who got into the field because they saw this trend of deep learning, Like what if deep learning works and it kind of goes all the way to artificial general intelligence and then superintelligence, and if

30:34

that works, then it could affect everything. I think possibly when people worry about this, there's a future that they have in mind with super intelligence that's even more capable than what people who think of themselves as like AGI pill today think of. They're imagining AI systems that can run you know, the entire economy and I think people who kind of a while ago or many years ago saw that vision and were sort of alarmed about the

30:56

stakes of it. Many people had this intuition that the thing to do is go and work in the industry because if you're like helping build it, you know what's the best way to shape the future, It's to build it.

31:05

And I think that there's obviously you could have questions about how sincere that is for many of the people who are in the industry, or if there's kind of a mix of different motivations and like you know, different wolves inside of them where maybe they partially are motivated by that, but also they're like there's kind of this like Oppenheimer, like, it feels good to feel like you're in the position of making something that's dangerous made.

Speaker 2

31:25

Someone wants described Open aiyet of me, this is years ago. Friend said it was like Open AI I was sort of like the Manhattan Project, except the goal was to not build the bomb at the very end, if that makes any sense. So to your Oppenheimer point, it's like very strange.

Speaker 6

31:40

And I think one thing to emphasize is, you know, well, it could be that there's a mix of motivations now there are definitely many people, I think in the Bay Area who sincerely believe that the technology is headed to someplace that will be very difficult for a huge where it will be very difficult for humanity to stay kind of in the driver's seat or like stay in control and kind of a meaningful sense.

Speaker 2

32:00

It does seem is though, like people talk about all the big AI labs have like a pr problem or something like that. They keep bringing this up, and it's like maybe they just believe it.

Speaker 6

32:11

So I think that this concern is quite old, and I think many people have this intuition that they're like, I can influence the thing by building it. But now there's this problem that that logic kind of always recommends that you continue building more advanced technology or like more advanced AI systems. And now you have this problem where there's all of these companies and they all say that they need to build it because if they don't build it,

32:34

another company will. And then even if all the and they could all have doubts about each other's commitment to safety or to these principles. Famously, the leaders of the labs really do not get along. They're not friends. It's not easy for them to kind of sort out the safety thing among themselves. And then even if all the USAI labs kind of agreed to do that, they then have this kind of external bogeyman of China, Right, well,

32:55

what will the Chinese companies do? And so there's this sense in which just like even the concern is real. I think a lot of people then who are in the industry have the instinct that they kind of there's no guiding principle for what they should do on safety other than to like build leverage for themselves for later. And I think that is a concerning state of affairs

33:15

for AI development to be in globally. You know, obviously we're trying to do something different by like informing the public or kind of giving like you know, you could imagine that this situation would be better if or like one gap that exists right now in that picture is that it's the people building the technology who most believe that it's going to be destabilizing and sort of all encompassing.

33:35

Maybe if the public and governments all were on the same page and believed the same thing, if it were true that it was headed there, then there would be kind of like more time for society to figure out a response from people who are not trying to build leverage over the technology themselves directly, or you know, control the technology via some kind of like public action or government.

Speaker 3

33:54

Can I just ask very quickly since you brought up China and I don't want to forget to ask this question. But Quinn doesn't show up on your like main charts. I think you did a preliminary assessment of it a while ago, but like, what's the difference between assessing one of the closed models in America versus one of the open source models over in China.

⁠¶ Policy, Finance, and AI Development Pace

Speaker 7

34:12

I think one thing to say is that the capabilities are lacking behind We think that they're they're lacking behind it. I'm not sure that of it.

Speaker 3

34:19

They just like don't make it onto the chart.

Speaker 7

34:21

So we do try to prioritize just because MITA has has limited resources staff time in particular, that the models that we anticipate being on the frontier and in general, the Chinese models have been something like, you know, nine

34:32

to twelve months let's say, behind the US models. And I think the gap by time horizon is probably even larger than the gap by benchmark scores, where there's some I'm not sure how scientific I can make this, but there's some cloaqu real sense or something that the Chinese models are stronger according to benchmark scores than they would be on you know, truly held out problems in some.

Speaker 3

34:53

Sense like gaming the benchmark. Is that what that means.

Speaker 7

34:58

Or I'm not sure you know exactly how that shakes out, but something spiritually spiritually close to that. I'm not sure that's true for all Chinese models. I'm sure it's true for lots of models outside of China, but I think that's the least more possibility.

Speaker 3

35:26

I'm very curious when you talk to external actors in all of this, and I'm going to group them into I guess policymakers, investors, and the labs themselves, like who are you interacting the most with at the moment?

Speaker 6

35:40

I think that in practice we end up interacting a lot with AI labs because there's some amount of sorting out, getting access to models, working with them to set new precedents and things related to third party red teaming and third party risk assessment. We think of our audience as being sort of like high context members of the public, so the kind of like people, you know, who are maybe like you do, right, people who are kind of.

Speaker 3

36:03

Like people listening to this podcast.

Speaker 6

36:05

So people listening to this podcast people with kind of who have to make important decisions that will be informed by the pace of AI progress or like the kind of profile of AI capabilities. Overall, Because we're based in the Bay Area, I think we like disproportionately end up interacting with people who are building the technology and like

36:22

closer to it. Partially, I think back to Joe's point before, I think this is kind of because it is the case that to kind of care about a lot of these frontier problems, you're kind of selecting for people who are building the technology themselves. There's some sense in which, like the companies in the industry spends more time thinking

36:40

today about frontier capabilities assessment than the government does. I think like one day you could imagine us getting to the point where the government is like very focused on this and dedicating a lot of resources to it, and at that point I would expect Meeter to be spending more time talking to governments.

Speaker 3

36:55

That's kind of what I was getting at because our senses and a lot of the conversations, like we talk to people and they'll say something about like, oh, it's important to have a social safety net for an AI enabled future, but no one seems to be really thinking about it in a lot of detail.

Speaker 2

37:07

And when you say, you know, it's easy to imagine or maybe the government will care more about this, not so easy for me to imagine. It seems like they mostly care about you know, data centers and like where they located and stuff like that. It would be nice if we had policymakers really looking at like frontier capabilities and stuff. Still seems kind of a way off, but

37:28

it is interesting. You know, you're like talking about like the sort of like capitalist dynamic, right, there's competition, and it's like you have a lot of people that are really worried about, oh, what if the other guys get to ASI or AGI first, or what if the Chinese,

37:42

et cetera. How much does the fact of like free market capitalism and the demand you know, the big investors at the VC funds, like they want to return, they want an ipo if we might get some big AI IPOs this year in fact, how much do you find that to be perhaps intention with the safety element?

Speaker 6

38:00

Yeah, I maybe, Yeah, people on our team wou have different views on this. I personally don't feel there's, yeah, there's some thing you're like investors are key decision makers.

Speaker 5

38:13

And you know they're people too.

Speaker 6

38:15

That sounds strange to say investors or people do I sound like Mitt Romney or something. But I think that, like, I think that the element of this that feels like it could be intention is if you build a bunch of financial obligations to keep kind of the pedal to the metal no matter what the risks are going into

38:31

the future. So, like, one thing I think a lot about is if you're like building up a huge amount of debt to build data centers and then say that you do find evidence that you're now worried about about the you know, loss of control from AI systems, you do find instances of AI systems going rogue. Do you now have like a financial commitment to build up those data centers and like continue kind of the pace of progress.

38:52

I think that is one place where I feel the tension pretty acutely, Like you're building these expectations into the market that could kind of force you to continue development when you otherwise would rather invest more in safety or Yeah, like it at least gives you a kind of financial obligation to continue scaling at least compute. I think that like the people themselves being informed about the progress does

39:16

not seem bad to me. I think it's like good in some ways for everyone to be on the same page about capabilities that could be related to subverting human

39:25

control later on. But I think in the world beyond like the information that Meter shares, I do think there is a tension, like the fact that private companies are building this I think could cause really acute tensions in the future where people make these commitments that they wouldn't if they were trying to like slow or you know, maximize social resilience of the technology.

Speaker 5

39:45

Yeah.

Speaker 7

39:46

I'm not sure how these things shake out, but I think there are some forces on the other side, right, Yeah, you know, some safety promoting technologies quote unquotes or techniques do make the models more useful, you know, if they're better complying better complying with your whale in some sense, and so have capitalist incentives standard capitalist incentives to invest in that kind of research. Maybe that doesn't cover you know,

40:06

the broad suite of safety research that seems important. It certainly doesn't rule out capabilities progress as being an important taxis on which you do want to scale. But you know, I think there are some some forces in each direction.

Speaker 3

40:20

Since you mentioned compute just then, can you talk a little bit more about I guess the relationship between like the time horizon improvements and the cost of compute at the moment, and like what you've actually seen and how that impacts it.

Speaker 7

40:31

Yeah, so, so one extraordinary fact from my perspective. I'm not sure how to how to fit these facts together, but something like the R and D spend on compute of these companies has risen exponentially, of course, and in fact it's risen exponentially at essentially the same rate as time horizon progress. You know, I think there's nothing necessary

40:48

about that. You know, it doesn't mean by itself that if computer progress lows then capabilities progress will also slow, but you know, it's clearly an important input into into AI progress. I expect that to continue to be through in future. Sometimes people ask us if we think it's plausible, or how plausible we think it is that that capabilities progress, this exponential capabilities progress might slow down at some point

41:10

at some point in the future. And you know, one reason it seems it's hard for me to consider it plausible that it will slow down in the next at

41:18

least small number of years. Is that a lot of those computes are and the investments basically already bigged in, right, Like the data centers have already been built, you know, plans for data centers even beyond twenty twenty seven twenty twenty eight are presumably you know, coming coming to fruition coming about, and so some of these input investments are already baked in in some sense. So it would be surprising to see capabilities slow to the extent that computes

41:41

has been has been an important input. After that, maybe maybe you need to think about, you know, other arguments for how capabilities might slow.

Speaker 5

41:47

But that's roughly how I think about it.

⁠¶ Critiques, Accelerating Progress, and METR's Role

Speaker 2

41:49

There's a very good or interesting critical subject post called against the Muter grav by someone named Nathan Woodgen who brings up one an interesting point that I wouldn't have thought of heading out Reddit, which is you're paying the software engineers to come in and perform these tasks, right it seems, you know, maybe this will be the last job of humans, is just doing benchmark If I were like a good software engineer and you say, Joe, come in and do this task.

Speaker 4

42:16

How do you prevent me? Oh man, this is taking me a long time.

Speaker 2

42:18

Mean, why I keep getting one hundred dollars an hour for like looking at my computer and time? Who this is tough. I'm gonna have to come back tomorrow and keep working on this. How do you avoid the sort of conflict of interest where the person who's paid to work on this problem may be encouraged to take as long as possible to solve it, and with only three people working on it at times, I don't know, like this does not It seems like a conflict of interest to me.

Speaker 7

42:44

Yeah, So the shulds onset is, you know, in general, we are incentivizing these people to complete the task because you know, it's possible, in particular, to complete the task faster than that is who are attempting the same task the time that it would take for them.

Speaker 3

42:57

They task a bonus if they do it faster than Yeah.

Speaker 7

43:00

Yeah, approximately, there's a bonus if they complete it faster faster than anyone else.

Speaker 5

43:04

You know. Another thing to say.

Speaker 7

43:06

Is I think it just is true that our baselining methodology, or the ways in which we compare to humans in some ways leaves a lot to be desired that you know, ideally we would have invested, you know, one hundred times as many resources in having one hundred baselines human basedlines per task, and those would have come from, you know, perhaps the very best software engineers or machine learning engineers in the world. Maybe that would be the Maybe that

43:28

would be the comparison that we're making. And indeed, we'd be doing all of this procedure over many more tasks, not just many more tasks, many more tasks, over wider task distributions than just software engineering or machine learning engineering. I mean, I do think time horizon still represents progress over over what's come before in the science of measuring AI capabilities. But you know, in some ways I'm sympathetic

43:49

to a lot of criticisms of time horizon. I do think that some of the details, at least for the work we've done so far, you know, aren't going to matter as much as you might naively think. So trueing the shortest baseline time that we end up observing or the longest time you know, it's actually not going to make that much difference to the final measurements.

Speaker 5

44:07

You know.

Speaker 7

44:07

Of course, we do think these people are talented software engineers or cybersecurity people or someone depending on the task. But you know, perhaps we could have found even more talented people. They would have completed it in half the time. And so you know, naively, it would seem like the time horizon that we estimate of these models would be half as long as we actually end up observing. But

44:26

of course that that wouldn't change the doubling time. It would mean you'd get to the same level after another four months. In some sense, the big picture that I want time horizon to point to is less this like Opus four point six is twelve hours in particular, and more that we're seeing this remarkable pace of progress that shows no signs of slowing in the recent past, and I think in the near future as well. You know, in fact, it shows some signs of speeding up.

Speaker 3

44:50

Well, I was going to ask about this because I think recently the statistic that you would always hear was like a doubling every seven months something like that. How fast do you see it going in the near future?

Speaker 7

45:02

Yeah, so I was a doubling over every seven months. Person that there was there was controversy in our team about about what to believe here because when we originally published this work approximately a year ago, you'd see, you know, if you plotted a single straight line, a single exponential you'd get something like, you know, six or seven months, let's say.

Speaker 5

45:19

But if you.

Speaker 7

45:20

Restricted to just the time since I think JPT four to oh, since the twenty twenty four models onwards, you'd see something closer to this sort sort.

Speaker 5

45:27

Of like four or five month trend.

Speaker 7

45:29

And some people believed in that, and you know, some people like me had the intuition that, well, we have so few data points, we should we should really be estimating over this larger number of data points than a large number of.

Speaker 5

45:39

Data points says every six or seven months.

Speaker 7

45:41

There are a couple of things that have changed my mind and made me realize my colleagues were right. Since since then, One is that for the models that have that have come out, since you know, what trends has has better predicted how performance those models would be. And it's very clear that the answer to that is the four month doubling time and not this seven month doubling time. You know that there's some some possibility that could speed

46:04

up again. We've seen it. We've seen it speed up once I think there are some reasons in principle why you might expect it to speed up again. I think there are some caveats about this, you know, these are these are maybe some some takes that my colleagues would agree with, and so you know, maybe maybe you should discard that, or you know, you should think that they're going to commits me in the way that they did with the with the four month versus seven month doubling times.

46:24

I have some suspicion that the tasks that meter is measuring performance on are you know, in some sense more and more narrow slice of possible tasks, and in particular, and more and more narrow slice that is perhaps similar to the kinds of tasks that you'd expect these major AI companies to be training on in the first instance. And so in some sense, we're increasingly more so than was the case before, measuring progress on the exact types

46:51

of tasks that they're trying to get better at. You know, you might think, for instance, the kinds of tasks that would make for good reinforcement learning environments, the kinds of tasks that you can score quickly and cheaply and automatically. I think that progress is real. I think that progress generalizes to some extent to other types of tasks I see. I think we're saying, you know, remarkable progress and these more messy tsks. For example.

Speaker 2

47:10

I have one last question, which is like how big is your team funding? And like also how many people Meter are basically like really rich from AI and they're like, you know what, I'm good. I don't need to pursue like stick around for the IPO or whatever. I'm set and now I want to work on something that like humanity.

Speaker 3

47:30

No.

Speaker 2

47:30

I've seen like there are other independent air researchers and they talk about this. It's like, I want to be able to talk about what I saw. Miles Brundage, someone who has like a little think tank, He's talked about this. What's like, how many people are like rich already and they're like, Okay, now I want to work for something that's public facing.

Speaker 6

47:47

Yeah, so Meter right now is about thirty people that we're growing and hoping to grow fast. We are hiring I should say meter dot org slash careers and yeah, you were touching before and kind of the thing about is it difficult to be a nonprofit?

Speaker 5

47:59

You know, we can't pay people in equity.

Speaker 4

48:02

We got to get an io, right.

Speaker 6

48:04

Yeah, there's no no ibo or for Meter, but we do try to pay competitively on cash compensation, right, So that's an area where we feel we can like somewhat compete with labs. And it's true that I think a lot of our team is just motivated by trying to kind of do something different like not you know, all the companies to some extent or in this business of kind of like building somewhat redundant products kind of competing

48:27

for the same role in the world. And Meter is in a really unique position at the moment where I think that we have like access and the ability to communicate these ideas and explain the state of AI research to a number, like a lot of audiences that might be hard for like individual researchers inside of a company, Like we get to talk to a lot of governments directly. We get to come here and talk with you all,

48:48

And that's kind of different. I think if you look at all the actors that are working on the frontier of AI research or AI safety, you kind of if you compare us to AI lab staff, I think that our work gets to be we get to kind of every day work on whatever research we think will be most informative to the like public decision.

Speaker 2

49:05

Do you have ex AI, not XAI, but ex as a former AI lab staff who maybe there was a tender at some point and now they work at mater.

Speaker 6

49:14

Yeah, we do, okay of those. Yeah, so we do

49:16

have some people who previously worked at AI labs. I do think that as time goes on, I think one hope that I have is that more, you know, there will be more and more researchers who have kind of like made the money that they need from working in the industry and now are excited and kind of like lifting all boats by working on kind of like inside of an organization where the north star can be what is most informative to the rest of the world outside of these like relatively small set of companies.

Speaker 7

49:42

Chris is very polite. I think that's I think that's wonderful. I'm tempted to be a little bit, a little bit more aggressive in this conversation. I think we have spoken through mister's work on some of the most important problems in the world, problems that are going to define the future I think for not just the next years, but you know, coming coming decades, maybe maybe even coming centuries. And we've also spoken about some of the ways in which me to work is not might not what you

50:07

might want it to be. That there's a long way to go in the science of evaluating these ais. Why have we not made more progress? You know, maybe maybe a couple of reasons. I think clearly the central reason is that we are bottlenecked on technical talent, on incredibly

50:22

capable people to come work on these questions. I was on a meter work retreat recently where we were brainstorming, you know, twenty thirty of these what seemed like world important problems, problems that we think no one else is going to get to if we do not get to them, and we are able to conduct research on how many of those problems, I think it's one.

Speaker 5

50:40

Two.

Speaker 7

50:40

You know, maybe if we do an extraordinary job this quarter, it might be three. As Chris alludes to, I think if you're interested in, you know, less working on redundant products at these major area companies and more advancing our understanding on some of the most important questions in the world that are going to shake the world for years to come. Meters is a great place to go.

Speaker 6

50:58

Well, yeah, One more thing to say about that is like the vibe inside of Meter is a state of triage, right, And I think people often tell themselves externally. People might guess, oh, you know, meters A, it's outside of any of the AI labs. So the thing it might most struggle with is things like access to AI models. You know, you can't do the research you want because you don't have you're not building the thing yourself in practice, or that's

51:17

the story that people always tell us. You have to build you know, the future to shape it in practice. I think our experience at METER is that, like when we want to try new types of research that would require new kinds of structured access, our experience at this point has been that AI labs are like pretty game

51:30

to play ball on that. And the thing that is more happening is that we're having to turn down opportunities to do stuff like that because we don't have the staff that we need to make those things happen.

Speaker 2

51:40

Interesting Joel and Chris, thank you so much for coming on odd Laws. Absolutely fascinating conversation and I appreciate your taking your time.

Speaker 5

51:47

Great to have you in the studio.

⁠¶ Concluding Reflections and Outlook

Speaker 6

51:48

Yeah, thank you so much, so much, having us.

Speaker 2

52:03

That was a really interesting conversation to that we're starting from the end sort of the idea of like, Okay, here are some really important questions, like let's just set everything aside.

Speaker 3

52:12

And there's thirty people working on there, there's.

Speaker 2

52:14

You know, and like how many people want to do it, and it's like, okay, we try to match cash comp et cetera. Yeah, that seems like kind of a tricky issue if like, if you accept the premise that these are some big questions we have to get right and you got to land this plane hopefully, Like that's a bit of an issue.

Speaker 3

52:31

Yeah. The other thing I thought was really interesting was the Chinese models not really making it on the charts even though, like we know, in the market itself, like when deep Seak, when that new version came out, that was like this huge thing where everyone started to panic and to not see it even like land on the time horizon chart. It's kind of interesting.

Speaker 4

52:51

I guess it's interesting.

Speaker 2

52:52

I mean, I guess I buy the reasoning from their perspective that the only interesting question from meters perspective is like the most cutting edge slightly adjacent to the most interesting chart for like business, right, So it's like, Okay, we know the deep sea and Quinn and Kimmy and all those are like very impressive. Do they push like

53:12

the very frontier? Perhaps not, but just in general, I find this space so weird because it's like, here you have these people who are like clearly quite alarmed at the potential here, and most people, I think, look at these charts and they say like, wow, this is like I want to invest in this, or this is.

Speaker 4

53:30

Like no, I know, I know.

Speaker 3

53:31

Like that's why my first question was like, you're here for AI safety purposes, but everyone seems to get excited about the line go up charts right, Like there's a disconnect all connected. Like I say, when an industry basically says it's worried by itself, you should pay attention.

Speaker 2

53:48

It's really strange. This gets back to, you know, very It's very strange where you have the CEOs of these companies who are in many cases the most alarmist, and there's this sort of cynical thing. And I don't totally discount the cynical interpretations like oh, they're saying this because they want to get investors and so forth, and they

54:05

need all this money. But look, it was also true that open AI and Anthropic but open AY a little more were like founded with these very exotic corporate structures of like a private company owned by nonprofit et cetera, which they presumably did because they took pretty seriously the fact that this technology is science. It was like very strange and not just like it's not just enterprise office right, Like.

Speaker 3

54:29

They were self limiting in a way.

Speaker 2

54:31

One other interesting thing too, that this idea is like, okay, like, first of all, what's the difference between seven months and four month time doubling?

Speaker 5

54:40

Not much?

Speaker 1

54:40

You know.

Speaker 3

54:40

It's like these people's like, oh, I can't but it's exponential, isn't it.

Speaker 4

54:43

I guess it's exponential, But it's still funny to me.

Speaker 2

54:45

It's like, oh, I think like AI is going to destroy all white collar work in two years, and someone else is like, no, no, I think it's gonna be three years. Is if that makes any different whatsoever? But one thing to consider all sort of alluded to this. You know, you had like open ay shut down. It's like video efforts,

55:01

et cetera. So perhaps part of the story is just this intense focus now on the software engineering side, as what these labs are working in Yeah, and sort of like all these other side quests are not as important, So maybe we will see even more rapid progress on some of these technical benchmarks, because clearly, from the labs perspective, that's where the action is more than some of these consumer things like making making images or videos.

Speaker 3

55:27

Yep, all right, shall we leave it there, Let's leave it there. Okay, this has been another episode of the Auth Thoughts podcast. I'm Tracy Alloway. You can follow me at Tracy Alloway.

Speaker 2

55:34

And I'm Joe Wisenthal. You can follow me at the Stalwart. Follow our guest Chris Painter He's at Chris Painter yup. And Joel Becker He's at Joel Underscore b k R. Follow our producers Carmen Rodriguez at Carmen armand dash Ol Bennett at Dashbot, kil Brooks at Kilbrooks and Kevin Lozano

55:51

at Kevin Lloyd Lozano. And for more odd Laws content, go to Bloomberg dot com slash odd Lots where the daily newsletter and all of our episodes and you can chat about all these topics twenty four to seven in our discord Discord dot gg slash lots.

Speaker 3

56:05

And if you enjoy Odd Lots. If you like these AI episodes, then please leave us a positive review on your favorite podcast platform. And remember, if you are a Bloomberg subscriber, you can listen to all of our episodes absolutely ad free. All you need to do is find the Bloomberg channel on Apple Podcasts and follow the instructions there. Thanks for listening.

Transcript source: Provided by creator in RSS feed: download file

Summary

Episode description

Transcript

⁠¶ Intro / Opening

⁠¶ Viral AI Charts and METR's Risk Mission

⁠¶ METR's Safety Motivation and Measurement Mechanics

⁠¶ Task Selection: Focus on Engineering

⁠¶ Challenges of Human Baselining & Reliability

⁠¶ Investment Interest and Public Information

⁠¶ AI Autonomy, Real-World Performance, and Limitations

⁠¶ AI Industry Culture, Competition, and Safety Tensions

⁠¶ Policy, Finance, and AI Development Pace

⁠¶ Critiques, Accelerating Progress, and METR's Role

⁠¶ Concluding Reflections and Outlook

Understanding the Most Viral Chart in Artificial Intelligence

Summary ✨

Episode description

Transcript

Summary