Big Tech's Tariff Chaos + A.I. 2027 + Llama Drama - podcast episode cover

Big Tech's Tariff Chaos + A.I. 2027 + Llama Drama

Apr 11, 20251 hr 9 minEp. 131
--:--
--:--
Listen in podcast apps:

Summary

This episode explores the impact of Trump's tariffs on tech companies like Apple, Nintendo, Meta, and TikTok, examining their strategies to navigate this new chaotic environment. It also features AI researcher Daniel Kokotajlo, who discusses his AI 2027 report, forecasting AI's transformative potential and potential dystopian outcomes. Finally, the hosts discuss Meta's AI benchmark controversy, questioning the integrity of AI evaluations.

Episode description

This week, with the tech world in chaos over President Trump’s tariffs, we look at how four specific companies are navigating the new day-to-day reality. Then, the A.I. researcher Daniel Kokotajlo returns to the show to discuss a new set of predictions for how artificial intelligence could transform the world in just the next few years and how we avoid the most dystopian outcomes. Finally, we explore whether Meta cheated on an important A.I. benchmark with its new Llama model.

 

Guest:

  • Daniel Kokotajlo, executive director of the AI Futures Project

 

Additional Reading: 

 

We want to hear from you. Email us at [email protected]. Find “Hard Fork” on YouTube and TikTok.

Unlock full access to New York Times podcasts and explore everything from politics to pop culture. Subscribe today at nytimes.com/podcasts or on Apple Podcasts and Spotify.

Transcript

Hi, it's Alexa Weibel from New York Times Cooking. We've got tons of easy weeknight recipes and today I'm making my five-ingredient creamy miso pasta. You just take your starchy pasta water. Whisk it together with a little bit of miso and butter until it's creamy add your noodles and a little bit of cheese Mmm

It's like a grown-up box of mac and cheese that feels like a restaurant-quality dish. New York Times Cooking has you covered with easy dishes for busy weeknights. You can find more at NYTCooking.com. Oh, you know, just binge buying cheap. Chinese stuff online. To beat the tariffs? Making your final Sheehan purchases before that company shuts down? Yes. No, I actually did buy a bunch of stuff over the weekend because I thought this might be my last chance. Yeah. Casey, what a...

What cheap overseas good are you going to miss most after the tariffs kick in? Oh, I feel the thing. I was never a big like, oh, I got to go on to Timu and get like a pressure cooker for six dollars or whatever. Like that was never my journey.

But I know that it's a major pastime for a lot of people. Yeah. Yeah. Well, for me, it's like the ability to buy cheap crap for my kid has been revolutionary. My kid, the other day... start saying the phrase dinosaur unicorn and i thought that's not real and he says i want a dinosaur unicorn And I said, well, that's not a thing. We can't have that. But then this little bell goes off in my mind that says,

Someone out there has made a dinosaur unicorn something. Almost certainly. My wife finds like eight different dinosaur unicorn t-shirts and buys one of them. And now he's got this dinosaur unicorn t-shirt that he absolutely loves. That would not happen in a tariffs world. As of today, that shirt costs over $400. Yes. Yeah. Well, I mean, I'm sure Jude looks great in that. He does. He does. And he's going to have to wear it for a long time. I hope it stretches.

I'm Kevin Roos, a tech columnist from the New York Times. I'm Casey Newton from Platformer. And this is Hard Fork. This week, the tech world is in chaos over Trump's tariffs. Then, AI researcher Daniel Cocotelo returns to the show to discuss a fascinating new set of predictions for how AI could transform the world in just the next few years. And finally, did Meta cheat on an important AI benchmark? For the second week in a row. We have been interrupted by news about these Trump tariffs.

Now, there was a time in the history of the Hard Fork podcast where the only thing that would cause us to rip up a segment and re-record it was if Sam Altman had been fired or rehired. But now we live in this new reality where news can change on a dime. And over the past few days, that is exactly what we've seen. I think it's fair to say Hard Fork has been hit harder by the tariffs than any other company. That's true. That's true. We are bracing ourselves for massive impact.

and getting ready for the new reality. So, Casey, every great era deserves a name, and I think we should call this era in the technology industry the Chaos Meta. Nothing to do with meta the company, but in video gaming, metas are sort of like the overall set of conditions that the players have to navigate. And I think it's fair to say that chaos and the lack of uncertainty surrounding what Donald Trump is going to do on any given day is the new meta for Silicon Valley's largest companies.

Yeah. Remember like how when we were talking about whether or not TikTok would be banned, which also had a lot to do with what Trump wanted, we talked about how it was kind of simultaneously alive and dead at the same time. Now that's just the entire U.S. economy, Kevin. Yes.

As of early this week, it looked like we were going to get these massive tariffs on goods imported to the United States from many, many countries all over the world, larger than any tariffs we've seen in the recent history of this country. Then on Wednesday, as we were taping our episode, we got the news that the Trump administration was pushing pause on most of them.

Most of these reciprocal tariffs on countries like Vietnam and India were going to be delayed for 90 days, and there would be a baseline 10% tariff rate applied, but not the much higher rates that people had been fearing. except for China, which would have its tariffs increased. And on Thursday, we learned that those tariffs would actually be 145% on Chinese goods entering the U.S.

The problem is with a podcast, we can't just have a little ticker on the bottom that shows you what the current tariff is. Yes, but what we saw early this week was that the stock prices of all the biggest U.S. tech companies took a dramatic nosedive that was in response to these fears about these very high reciprocal tariffs.

Now, after the news that these tariffs are going to be placed on a 90-day hold, except for China, some of these stock prices have rebounded. Apple in particular had its biggest trading day. in many years after the news of these tariffs being delayed came out. The stock market whiplash is part of the setting for the tech companies that they have to deal with now.

But the bigger picture scenario is that doing business in Trump's America is turning out to be very difficult, not because the administration is necessarily unfriendly to these businesses, but because there's just... So much fast-moving news that it is hard for businesses to do any kind of planning or strategy at all.

Well, I mean, I wouldn't say this is a particularly business friendly set of announcements that have been made. I mean, sure, I guess it's friendlier to pause the tariffs than to continue them. But the general chaos, Kevin, I think has been really bad for American companies.

Yeah. So even beyond the tariffs, there are a bunch of things that the Trump administration has been doing that have impacted the tech industry. Restrictions on immigration, cuts to science funding, these antitrust cases, many of which are still going forward. So I wanted to kind of give our listeners a sense of how this instability feels on the ground in Silicon Valley to the biggest tech companies.

And you had a really smart idea, which was to look at the new chaos meta of Trump's second term through the lens of four tech companies. So today we're going to take a look at how Trump's new policies and these tariffs have affected four companies, Apple, Nintendo, TikTok. and meta, all of which have faced significant challenges since Trump took office and all of which are now trying to figure out how do we go forward? What do we do? How do we navigate this new uncertain climate? Yeah.

So let's start with Apple. Casey, what is going on with Apple? Well, look, of all of the tech companies, Apple has long been the most dependent on China. That is where 90% of iPhones are made. The company is just heavily dependent on its supply chain relationships that it has.

are now 145% on goods coming out of China has just really sent a shiver through that company. Earlier this week, Apple had its worst four-day trading period since the year 2000. Once the pause was announced, stock has started to come back. But this is a very volatile situation for them. And the underlying dynamics are the same, which is that it is simply going to be much more expensive for Apple to sell goods made in China here in the United States, Kevin.

Yeah, and obviously one of the hopes of these tariffs is that it will drive manufacturing back to the United States. There's some hope among members of the Trump administration that this could even force Apple to consider making the iPhone in the United States. Do you think that is likely and why? No, and in fact, I think it's almost sort of worse, Kevin, because this week the president's press secretary said that the president believes that iPhones can be made in the United States.

despite the fact that we know that it is much more expensive to manufacture things here in this It's very important to remember that whatever the Trump administration might hope that these tariffs accomplish, they have not accompanied it with.

any plan to increase the manufacturing capacity in this country. The whole thing is just a wish and a prayer that at some point in the future, Apple might have a magical iPhone factory stocked with Americans who want to do those jobs. As it stands now, that doesn't exist. Yeah, so I would say Apple is somewhat unique among tech companies because...

It has also been thinking about tariffs and the effect of Trump's policies on their business for longer than many of their competitors. I mean, if you'll remember during the first Trump term, there was some talk about tariffs on Chinese goods. Apple successfully negotiated its way out of those, sort of got an exemption. And in part, they did that by cozying up to the Trump administration, by promising to build and assemble some of their products in the United States.

famous tour that Tim Cook gave Donald Trump of this. facility in Austin, Texas, where he said they were going to start making a bunch of stuff. So they sort of managed to get the tariffs off their back during the first term. But in the second term, it's not at all clear that they are going to have the same kind of success. So Casey, how is Apple dealing with the new chaos meta?

Well, they are trying to get as many devices as they can out of China and into places where it's going to be much less expensive to export them. to the United States. So there was a great story this week in the Times of India that according to senior Indian officials, Apple transported five cargo planes full of iPhones and other products from India to the United States.

which sort of calls to mind those scenes at the end of the Vietnam War when you see the last helicopter leaving Saigon, except it's full of iPhones. Actually, Katie Dantopoulos had a great joke on Threads today. She said that this whole thing is like the movie Dunkirk, but for iPhones. Reuters reported that Apple transported 600 tons of iPhones, Kevin, which would have been about 1.5 million devices. And look, you know, those iPhones will pad Apple's profits a little bit more.

Pretty soon, there's going to be no more planes out of no more countries to escape these tariffs. It is just going to be. a really expensive-ass iPhone. Do you think the iPhone 16 Pro Maxes get to sit in first class on the plane? Like put them up front in the lie-flat seats? Yeah, they should definitely get the upgrade with what they're paying for those things.

Yeah. So, okay, let's move to our next case study of a company trying to deal with the uncertainty and chaos of the Trump administration, Nintendo. Casey, what is going on with Nintendo? Well, so, Kevin, as a hardcore gamer, obviously you know that the Switch 2 is coming out this year. This is the sequel to Nintendo's best-selling console of all time. And it was supposed to become available for pre-orders on this very Wednesday.

But then tariff chaos started happening and Nintendo said, we are going to pause pre-orders because we don't know what it's actually going to cost to sell a Switch 2 in America anymore. Yeah, and now that Trump has paused these tariffs on most countries other than China, Have they said that actually they're going to start shipping the Switch 2 on time after all? Well, what they've said is that they're not planning to change the launch date, which is June 5th.

And it does seem like because they are a Japanese company and make the switch to in Vietnam. they are going to be able to avoid the really tough tariffs that Apple is facing, right? Before Trump initiated the pause, there was going to be a 46% tariff on the Switch 2. Now it's back down to that 10%. But look, the Switch 2 is already planning to go on sale for $450, which is $150 more than the original Switch sold at launch.

So I think there's a very real question here of whether the price of this console goes up over time, which would be a reversal of the usual trend, which is a console goes on sale for a high price and that price comes down over time. So once again, Kevin, there's just real chaos here as we await probably the most hotly anticipated piece of hardware to launch, I would say, in the United States this year.

Yeah. Now, are they bringing in planes full of Switch 2s from Vietnam or wherever they're manufacturing them? They were actually able to put them in one of those pipes and you just sort of warp down. It's kind of a really cool little thing they have there. I got it. Okay. Next company on our list, TikTok. Casey, this is a company we have talked about a lot on this show. They were going to be banned. The deadline for banning them got pushed out by another 75 days last week.

Casey, what is the latest on TikTok and how it is coping with this escalating trade war between China and the U.S.? Well, Kevin, what is going on with TikTok is, of course, the question asked most in the history of Hard Fork. And what was going on with it until tariff chaos was that it looked like we might have a deal. There was some great reporting in The Times this week. that bite dance with the support of the Chinese government.

had reached the rough outlines of an agreement in which TikTok would create a new American entity. American investors would own the majority of it. Chinese owners would have about a 20% stake. And the American company would essentially rent the algorithm from ByteDance. And so by Thursday of last week, there was this draft executive order that outlined the deal.

And then Trump did the thing with the tariffs. And all of a sudden, ByteDance has to call up the White House and say, that deal that you just helped us negotiate, it's off the table because the Chinese government isn't going to support the deal anymore. Right. So this was a pretty dramatic reversal, and it does seem like they got very close to a deal before these tariffs. What is happening now that these tariffs are on? Does TikTok have any options left?

Well, Kevin, along with a 90-day tariff pause, we also now have a 75-day extension that comes after the original 75-day extension that Trump gave in order to force ByteDance to divest TikTok. This man loves extensions, let's just say it. This man loves to come up right against a deadline and say, you know what? You got a little more time.

Yeah, well, look, I don't know what's going to happen over these next 75 days. I imagine that if the tariffs against China stand at 145%, there is no way the Chinese government is going to support the sale of TikTok. And I just want to say... how self-defeating this is, because it was barely more than a week ago that Trump was telling reporters that Beijing, if they would simply go along with his plan to force the divestiture of TikTok...

then he would go easy on them on tariffs, right? Like this was his big bargaining chip of if you don't want high tariffs, you have to let the Americans have TikTok. And to my surprise, it seemed like the Chinese government was actually going to go along with that.

And then before they could even get that deal out, Trump seemingly out of nowhere announces a brand new set of tariffs that completely scuttles the deal. So it is as if the president was essentially negotiating against himself and lost the deal that he had won.

Yeah, it does seem strange that he would not wait until after the TikTok deal was finalized and approved by all the relevant officials to then issue these tariffs if he was actually interested in getting a deal done. Yeah, I think that's right.

Okay, TikTok is still in this frustrating state of superposition where they are both dead and alive at the same time. Do we think that this resolves before the end of the next 75-day extension? Or do we think we will need yet another extension to figure out what we're doing with TikTok? My assumption is that on the day that Donald Trump leaves office, we will still be in the middle of one of these extensions. It'll be sort of like the 15th extension or the 23rd extension.

But no, until this tariff situation gets resolved, I do not expect TikTok's fate to be resolved. It is just going to continue to exist in its weird limbo. All right. So that is TikTok. Our last company on this list of case studies is Meta. Casey, how is Meta dealing with this new uncertain reality? Well, I would say that things turned out a little bit better for them this week than maybe it looked like things were going because tariffs were going to be a huge problem for them, too.

They are a digital advertising business and a huge number of their advertisers are small and medium-sized businesses that buy ads. outside the United States to export goods from foreign countries into the United States. Mike Isaac at the Times had a great piece on this this week.

There's one analyst who estimates that about $10 billion of Meta's revenue from ads originates from outside the United States. So in a world where everyone was facing these massive tariffs, we were just expecting Meta to get hit really hard on the ads front. Well, now that has mostly gone away, at least for the next 90 days. So it seems like Meta is going to get some breathing room.

But there is this one other outstanding question, Kevin, which is that next week, Meta's antitrust case is going to trial, right? So in 2020, during the first Trump administration, the Federal Trade Commission files an antitrust lawsuit and tries to break off.

Instagram and WhatsApp from Meta. It has been in the planning stages ever since. And on Monday, the case is set to go to trial. So why does all of this have anything to do with Trump? Well, Mark Zuckerberg has been giving Trump the full court press. going so far as to buy a $23 million house in Washington, D.C. recently just to get closer to and spend more time with the president.

There's been some reporting that Zuckerberg was in the White House trying to negotiate a settlement with Trump just within the past few days. So there's a lot of questions right now about whether Zuckerberg will able to use this relationship that he's apparently been building with Trump in order to get rid of this case, which is in some ways an existential threat to his business.

Yeah, and we should also just say, like, this shouldn't be possible, right? The FTC is supposed to be an independent agency that has its own enforcement agenda and brings its own cases that are independent from the president. But, of course... Nothing is truly independent from the president in Trump's Washington. He recently announced that he was getting rid of the two Democratic commissioners on the Federal Trade Commission.

And that is historically quite unusual for a president to intervene in FTC commissioner staffing at that level. But now it is sort of going to be staffed with people who are friendly to the Trump administration. And so presumably if he were to go to them and say, hey, let's back off this meta case. I don't actually think we need to proceed with this. They might listen. And we should say that another way that Meta tried to ensure that this happened is that after the events of January 6th...

meta suspended Trump from its platform for three years, and Trump sued them over that. And so after he won the presidency, Zuckerberg came along and said, hey, why don't we settle this too? And paid Trump $25 million. right and i have to say meta was completely within its rights to suspend an account they're allowed to suspend whatever account they want it's a private company with a private platform but still

just as a little gesture of goodwill. Hey, Trump, here's $25 million. So if this actually happens and this lawsuit just goes away, it will just frankly be an example of open corruption. Okay, so that is our four-company case study of how tech companies are trying to do business and survive in this new uncertain environment. I have to ask, after going through all these examples...

Which of these companies would you be if you could be one? Which do you think is in the best position in this new chaotic environment? Hmm. Well, you know, until maybe Wednesday, I think I would have said Apple, right? Apple makes the iPhone. The iPhone is the most lucrative product in the history of the technology industry. And even despite...

Some of the tariffs that we were seeing, it seemed like they were still going to be in a good position to navigate them. I was seeing analysis that they were only going to lose maybe seven points of profitability from all this. But the world looks really different with 145% tariff and in a world where Trump just keeps escalating this fight more and more. And so I actually do think that the picture for Apple just looks really strange. So look.

I feel a little crazy saying this, but maybe I actually would just rather be meta. Their hardware business is still a relatively small part of what they do. Mostly what they do is a digital services business. And it seems like Zuckerberg has been able to make at least some inroads with the Trump administration. Maybe they're about to get rid of this lawsuit against them. So God, I don't know. Maybe I actually want to be meta. How about you?

Yeah, I think, I mean, as venal and corrupt as it would be for these naked attempts at flattery and persuasion to actually work and pay off. I would not underestimate how well this stuff works with Donald Trump. And I think that Mark Zuckerberg's motive here is to win at all costs. And if he needs to buy a $23 million mansion or spend time in the White House... or even, you know, make some policy adjustments to appease the Trump administration and get what he wants.

I think he's demonstrated very clearly that he's willing to do that. My last question on this, Casey, is about this. idea of the tech capitulation to Trump. You know, in the past few months, we've observed, we've talked about the fact that a lot of these tech companies have been really falling all over themselves to appease the Trump administration. Many of them gave to the inaugural. Many of them showed up at inauguration. Their CEOs were seated just behind the president's own family.

The amount of flattery and ass-kissing going on here for months now has been, I would say, notable and historic. Do you think that any of that has worked to the degree that these executives thought it would? Did the tech leaders get what they wanted out of Donald Trump? I think that until the tariffs, the answer was basically yes, and the tariffs are what have changed that equation, right?

If you look at how J.D. Vance was talking when he went to Europe, he was echoing a lot of tech company talking points. He and Trump have criticized European fines against tech companies. saying, like, we need to protect and defend our American...

tech companies against these European fines, which was something that the Biden administration never, ever did. They've talked about getting rid of AI guardrails and just letting these companies do whatever they want with AI, which is like music to Mark Zuckerberg's ear. These companies just rely on stable, normal governance to be able to conduct their business around

plugged into the interconnected global economy as anyone else, arguably more than many companies. And Trump just came along and blew that up. And I think that it is probably dawning on them that they are probably just going to be living in chaos for the foreseeable future. And it is just going to make their lives much, much more difficult.

Yeah, I think that's right. And I think that a lot of these executives have underappreciated how important stability and predictability are in their business models. I mean, these were companies, many of them that had issues with the Biden administration. The Biden administration had issues with them. But at least with the Biden administration, these companies knew where they stood.

There was not this sort of day-to-day whiplash of stock price moving up 10%, down percent, down 10%, tariffs going up to 145%, and then down to 10%. It just was not the kind of frenetic. environment that we're seeing today. And so I wonder if any of them are starting to appreciate how good they had it during the Biden years, where for as much as the Biden administration may have gone after them for various things, including antitrust violations.

at least they could wake up every day and understand what the world was going to look like for the next 24 hours. Yeah, I think that's true. I think that most of them would probably still be loathe to admit it. But let's give it another few weeks, Kevin, and another few tariffs. And then let's check back in with them. Sounds good. Well, that's enough about tariffs, Casey. When we come back, we're going to talk about a terrifying new report about what AI could look like in 2027.

I'm Dane Brugler. I cover the NFL draft. for the athletic spending the whole year working on a draft guide i'm looking at thousands of players putting together hundreds of full scouting reports all the nitty-gritty details the testing data

the stats, but extensive background research as well. Every journey is a little bit different. I'm on the phone with a lot of these guys. Hey, when did you start playing football? What other sports did you play? Tell me about your family. You know, learning more about these guys as people.

picked up the name The Beast because of the crazy amount of information that's included. I have no idea how to quantify the hours I've spent putting it together. I've been covering this year's draft since last year's draft. There is a lot in The Beast that you simply This is the kind of in-depth, unique journalism you get from The Athletic and The New York Times. You can subscribe at nytimes.com slash subscribe. Well, Casey, today we're going to talk about a forecast.

And that's separate from a fork cast, which is something different. Yeah. what we call our end-of-the-year predictions episode, isn't it? I think so. But today we're talking about something different, which is this new report called AI 2027. This is a report that I wrote about last week and that has gotten a lot of attention in AI circles and policy circles this week. It was produced by the AI Futures Project, a Berkeley-based nonprofit led by Daniel Cocotelo.

who listeners of this show may remember was a former OpenAI employee who left the company last year and became something of a whistleblower, warning about their reckless culture, as he called it. and is now spending his time trying to predict the future of AI. Yeah, and of course, lots of people are trying to predict the future of AI. But what gives Daniel a lot of credibility here is that in 2021, he tried to predict what things would look like about now. And he just got a lot of things right.

And so when Daniel said, hey, I'm putting together a new report on what I think AI is going to look like in 2027, a lot of close AI observers said, oh, this is really something to read. Yeah, and he didn't just do this alone. He also partnered with a guy named Eli Lifland, who is an AI researcher and a very accomplished forecaster. He's won some forecasting competitions in the past.

And the two of them, along with the rest of their group, and Scott Alexander, who writes the very popular Astral Codex 10 blog. put together this very detailed, what they call a scenario forecast. Essentially, it's a big report, a website. It's got some, you know, sort of research backing it up and it is basically represents their best attempt to kind of synthesize. everything they think is likely to happen in AI over the next few years into a readable narrative.

Yeah, and if that sounds a little dull to you, I'm telling you, you should just go check this thing out. It's at ai-2027.com, and it's just super readable, and it blows through stuff that... It feels very familiar right now, like just sort of basic extrapolating from where we are today into...

Getting to six months, a year from now, the world starts to look very, very different, and there is a lot of research that they have to support why they think that is plausible. Yeah, and I can imagine people reading this report or listening to us talking about it and saying, Well, that sounds like science fiction to me. And we should be clear, it is science fiction. This is a fictionalized narrative that they have put together.

But I would say it is also grounded in a lot of empirical predictions that can be tested and confirmed or verified. It's also true that some science fiction ends up becoming reality, right? If you look at... movies about AI from past decades, a lot of the things in those movies did end up actually being built. So I think this report, while it may not be 100% accurate, at least represents a very rigorous and methodical attempt.

to sketch out what the future of AI might look like. And here's my bet. If you put this conversation into a time capsule and revisited it in two years, in 2027, my guess is we're going to find that a good number of things in that scenario actually did come true. I hope we're still doing a podcast in two years. That'd be good. That'd be great. Yeah. So my forecast is that this is going to be a good conversation. Let's bring in Daniel Cocotelo.

Daniel Cocatello, welcome back to Hard Fork. Thank you. Happy to be here. So you have just led this group that put together this giant scenario forecast, AI 2027. What was your goal? So our goal was to predict the future using the medium of a concrete scenario. There is a small but exciting literature of attempts to predict the future of AI that use other methods, which is also very important. Things like...

defining a capabilities milestone. Here's my definition of AGI. Here's my forecast for how long we'll have until AGI based on these reasons and stuff. And that's great. And we've done that stuff before. We did a lot of that in the run-up to this scenario. But we thought it would be helpful to have an actual concrete story that you can read. Part of the reason why we think this is important is that it forces you to think about everything and integrate it all into a coherent picture.

Well, I want to ask you a bit more about that. So, I mean, the first thing I want to say about AI 2027 is It's an extremely entertaining read. It is as entertaining as most of the sci-fi that I have read. By the end of it, you get into scenarios where humanity's survival is threatened. And so whether you think it's true or false, it is like really engaging to read.

But my understanding of your aim here is that there is something practical about what you were trying to do, right? Can you tell us about sort of the practical idea of going through this exercise? I mean, important background context, the CEOs of OpenAI, Anthropic, and Google Demine have all publicly stated that they're building AGI, and even that they're building superintelligence, and that they think that they can succeed by the end of this decade.

And that's a really big deal. And everyone needs to be paying attention to that. Like, I think a lot of people dismiss that as hype. And it's a reasonable reaction to say like, oh, they're just hyping their product.

But it's not just the CEOs saying this. It's also the actual researchers at the companies. And it's not just people at the companies. It's also various independent people in academia and so forth. And then also, like, you don't just have to trust people's word for it if you actually look at the evidence. It really does seem strikingly plausible that this could happen by the end of this decade.

things are going to go crazy in some way or other. It's hard to predict exactly how, but obviously, if we do get super intelligent AGI... what happens next is going to look like sci-fi. It will be straight out of a sci-fi book, except that it will be actually happening. You mentioned that if what the CEOs of tech companies say comes true, we will be living in a sci-fi world. And I think for a lot of people, they're content to...

sort of stop thinking there, right? They might be willing to admit, okay, yeah, if you invent superintelligence, things will probably be crazy, but like, I'll cross that bridge when we come to it. you're sort of taking a different approach and saying like, no, you're going to want to start thinking right now about what it would be like if some of these claims start to come true. So maybe we could get into what some of those claims are.

Sketch out for us what you think is very likely to happen just within the next couple of years. Well, I wouldn't say very likely. I should express my uncertainty, right? Past discussion often focuses on a single milestone, like artificial general intelligence or superintelligence. We broke it down into a couple different milestones, which we call superhuman coders, superhuman AI researchers, superintelligent AI researchers, and then broad superintelligence.

So we sort of like make our predictions for each of these stages. I'm only like 50% confident that it will happen by the end of 2027. So I have 50% chance that 2027 will end and there still won't be any autonomous superhuman coding agents. But it's a coin flip. We might also be living in a world where, yes, you do have, yeah. Exactly. So 50% chance we do have autonomous, fully autonomous.

artificial intelligences that can basically do the job of the cracked engineers by 2027. And then you say, okay, well, it's the next milestone after that. After that comes automating the full AI research process instead of just the coding.

AI researchers more than just coding. And how long does it take to get to that? Well, we have our guesses and in our scenario, it happens like six months later, you know. So in our story, get the superhuman coders, use them to go even faster to get to the superhuman AI researchers that are able to do the whole loop.

That really kicks things off, and now you're going much faster. How much faster? We say 25 times faster for the algorithmic progress, at least. Of course, your compute scale-up is not going any faster at all because you still have the same amount of compute, but you're able to do the algorithmic progress 20 times faster, 25 times faster.

Then you start getting to the superhuman regime. So you start getting systems that are just like qualitatively superior to the best humans at stuff. And they're also probably discovering new paradigms. So we depict them going through multiple paradigm shifts over the course of the second half of 2027. vastly superior to humans. by the end. Yeah. Let me just sort of pause and maybe underline a couple of...

things there. I think most people might not understand why the big AI labs are obsessed with automating coding, right? Most people are not software engineers, so they kind of don't care how much of it is automated. But by the time you get to

software that is mostly writing itself, it unlocks this other world of possibilities. And you just sort of sketch out a vision where once we get to a point where The sort of AI coding systems are better than almost every human engineer or maybe every human engineer. Then this other thing becomes possible, which is now you can just set this thing to work trying to figure out how to build AI itself, right? Is that what I'm hearing you say?

Basically, I'd break it down into two stages. So I think the coding is separate from the complete automation, as I previously mentioned. I think that I expect to see systems that are... able to do all the coding extremely well, but might lack research taste, for example. They might lack good judgment about what types of experiments to run.

And so that's why they can't completely automate the research process. And then you have to make a new system or continually train the old system so that it gets that taste, it gets that judgment. Similarly, they might lack coordination ability. They might be not so good at working together in large organizations of thousands of copies.

at least initially, but then you fix that and you come up with new methods and you do additional training environments and get them good at that sort of thing. And that's what we depict happening over the first half of 2027. And we depict it happening in only half a year. because it goes faster, because they've got all the coding down pat. And so even though humans are still directing the whole process,

they just give orders to the coding agents and they quickly make everything actually work. Right. And then halfway through the year, they've succeeded in making new training runs that train the skills that the AIs were missing. So now they're not just coding agents. They are able to do...

the research taste as well. They're able to come up with the new ideas. They're able to come up with hypotheses and test them. And they're able to work together in big sort of like hive mind clusters of thousands and thousands of them. And that's when things really kick off. That's when it really starts to accelerate.

In your scenario, you have this sort of choose-your-own-adventure ending where after this thing you call the intelligence explosion where the... superhuman AI coders get into AI R&D and they start automating the process of building better and better AIs.

You sort of have two buttons that you can click and one of them sort of unspools the good place ending where we decide to slow down AI development and really get these things under control and solve alignment. And then the red button, you push that. And it goes into this very dark dystopian scenario where we lose control of AI. They start deceiving and scheming against us. And ultimately, maybe we all die.

Why did you decide to give people the option of choosing one of those two endings rather than just sketching what you believe to be the most probable outcome? So we did start by sketching what we believe to be the most probable outcome, and it's the race ending, the one that ends with the mislined AIs in control of everything. So we did that first.

And then we were like, well, this is kind of depressing and sad. And there's a whole bunch of stuff that we didn't get to talk about because of that. And so we wanted to then have a different ending that ended differently. In fact, we wanted to have like a whole spread. of different possible outcomes but we were limited by time and labor and we were only able to pull together one other outcome which is the one that we displayed in the slowdown ending so in the slowdown ending

They solve the alignment issues and they actually get AIs that are actually, you know, what they say on the tin. They're not faking it. They just actually have the goals and values that were put into them or that the company was trying to train into them. You know, it takes them a couple months to sort that out. That's why it's a slowdown. They had to pivot a lot of their compute and energy towards figuring that stuff out. But they succeed. And so then in that ending...

We still have this crazy arms race with China, and we still have this crazy geopolitical crisis. And in fact, it still ends in a similar sort of way with this massive arms buildup on both sides, this massive integration into the economy, and then ultimately a peace treaty. I'm curious, Daniel, if the events of the last week in Washington, the tariffs, this looming trade war with China have affected your forecast at all.

I mean, we've been iteratively improving it, but the core structure of it was basically done a few months ago. So this is all new to us and wasn't really part of the forecast. How would it change things? Well, if the trade war continues... and causes a recession and stuff like that. It might just generally slow the pace of AI progress, but not by much, I think. Like, say it makes compute 30% more expensive so that the companies are able to buy 30% less of it.

Maybe that would translate to like a 15% reduction in overall research velocity over the next few years, which would mean that the milestones that we talk about happen like a few months later instead of when they do. So the story would still be basically the same. One of the things I think is most interesting about your project is the bets and bounties section, where you are going to pay people for finding errors in your work.

for convincing you to change your mind on key points or for drafting some alternate scenarios. So talk to me a little bit about how that became part of this project. So like, you know, I come from the sort of rationalist community background, which is big into making predictions and making bets, putting your money where your mouth is. So I have a sort of aesthetic.

interest in doing that sort of thing. But then also, specifically, one of the goals of this project is to get people to think more about this stuff and to you know, do more scenario forecasting along the lines of what we've done. We're really hoping that people will counter this with their own reasonably detailed, you know, alternative pathways that represents their vision of what's coming. And so we're going to give out a few thousand dollars of prizes.

And then as for the bounties thing, already we've gotten dozens of people being like, oh, you say this, but isn't this a typo? Or this feels wrong. And so I have a backlog of things to process, but I'm going to get through it. I'm going to like... you know, pay out the little payments and fix all the little bugs and stuff like that.

And I'm just quite heartwarmed to see that level of engagement. And have you taken any bets on different scenarios so far? I think so far I've done one or two, but mostly there's just a backlog I need to work through. Got it. Now, Daniel, you said you've been getting some good responses from people at the AI companies to this scenario forecast.

I did a bunch of calling around when I was writing about this. And after we spoke, I talked to a bunch of different people, both in the AI research community and outside of it. And I would say the most frequent reaction I got was just kind of... One person I talked to, a prominent AI researcher, said he thought it was an April Fool's joke when I first showed him this scenario because it just sounded so...

outlandish. You know, you've got Chinese espionage and the models going rogue and the superhuman coders and like it all just seemed... fantastical. And it was almost like they didn't even think it was worth engaging with because it was so far out. I'm curious if you've gotten much of that kind of reaction and what your response is. A couple things. So first of all, we'll... Go write your own damn scenario then. I would say you either will write a scenario that doesn't seem outlandish.

which I will completely tear apart as unrealistic and just assuming basically that AI progress hits a wall. Or you'll write a scenario that does feel very outlandish, but perhaps in different ways than ours do. Again, like, are they actually going to get to AGI on superintelligence by the end of the decade? If so... You can't possibly write that in a way that's not outlandish. It's just a question of like, which outlandish thing are you going to write?

And if you think maybe this is not going to happen and it's going to hit a wall, yeah, that's possible too. I think that's reasonable. I don't think it's the most likely outcome. Like I do actually think that probably by the end of this decade, we're going to have superintelligence, but I think it's... Yeah. And then say more about that, because, you know, I assume that a lot of our listeners like think either truly think that it will hit a wall.

Or they're just sort of counting on it, hitting a wall so as not to have to reckon with any of the scenarios that you describe. So like, what is your message to the person that's just like, eh, it'll probably hit a wall? I mean, I don't know, read the literature? Like, there's... These people are not going to read the literature. They listen to podcasts specifically so they don't have to read the literature. Yeah, fair. Well...

I could point to specific parts of the literature, like benchmarks, for example, and the trends on them. So I would say the benchmarks used to be terrible, but they're actually becoming a lot better. Meter in particular has these... agentic coding benchmarks where they actually give AI systems access to some GPUs and say,

Have fun. You have like eight hours to make progress on this research problem. Good luck. And then they measure how good they are compared to human researchers given the same setup. And, you know, line goes up on the graph. It seems like in a year or two, they'll have AIs that are able to just autonomously do eight hour long ML research tasks.

you know, on these sorts of things. And that's not AGI, that's not superintelligence, but that is maybe the first milestone that I was talking about, superhuman coder, right? So I pointed those sorts of trends. And then separately, I would also just do the appeal to authority. Like if you're not going to read the literature, if you're not going to look at the, if you're not going to sort of form your own opinion about this and you're still just deferring to what other people think.

Well, then I will say, yeah, there's a bunch of naysayers out there who are saying this is all never going to happen. It's just fantasy. But also there's a bunch of extremely credible people with amazing track record. both inside the companies and outside the companies, who are, in fact, taking this extremely seriously. Yeah. I also want to review- Including our scenario. Yeah.

You know, Yoshua Bengio, for example, read an early draft of our thing and liked it and gave us some feedback on it. And then we put a quote from him at the top saying, everyone should read this. It's plausible. Right. So he's a pioneering AI researcher. Yeah.

Another genre of criticism I've heard of this forecast is from people who just don't... who are just questioning the idea that if you get AIs that are superhuman at coding, they will kind of be able to bootstrap their way to general intelligence. And I just want to read you a quote from an email that I got from David Autour, who is a very well-known economist at MIT. And I had asked him to look at the scenario and sort of react to it. And with a particular eye on like...

What might this be missing as far as how it sort of assumes this easy and fast jump from superhuman coding to something like AGI? And I'll just read you what he said. He said... LLMs and their ilk are superpowered incarnations of one incredibly important and powerful part of our cognition.

The reason I say we're not on a glide path to AGI is that simply taking this capability to 11 does not substitute for the parts that are still missing. I think that humanity will get to AGI eventually. I'm not a dualist. I just don't believe that swimming faster and faster allows you to fly. What is your reaction to that? I agree. We depict this.

in the course of the story. So if you read AI 2027, they have something that's like LLMs, but with a lot more reinforcement learning to do long horizon tasks. And that is what counts as the first superhuman coder.

So it's already somewhat different from the systems of today, but it's still broadly similar. It's still sort of maybe the same fundamental architecture, just a lot more training, a lot more scaling up, and in particular, a lot more training specifically on long horizon agentic coding tasks.

But that's not itself AGI. I agree. That's just the superhuman coder that you get early on. And then you have to go through several more paradigm shifts to get to actual superintelligence. And we depict that happening over the course of 2027. So a key thing that I think that everyone needs to be thinking about is...

How much faster does the research go when you've reached the first milestone? And how much faster does the research go when you reach the second milestone and so forth? And we are, of course, uncertain about this, like we are about many things. We say in the scenario that...

we could easily imagine it being five times slower than we depict and taking sort of like five years instead of one year. But also we could imagine it being five times faster than we depict and taking like two months, you know? We want to do a lot more research on that, obviously. If you want to know where our numbers are coming from, go to the website. There is a tab that you can click on that lists...

It has a bunch of sort of like back of the envelope calculations and little mini essays where we like generated the quantitative estimates that are the skeleton of the story. One other... Piece of criticism I've seen of this project that I wanted to ask you about was from a researcher at Anthropic named Saffron Huang, who argued on X that she thought that your approach in AI 2027 was... highly counterproductive, basically that you were...

in danger of creating a self-fulfilling prophecy by making these sort of scary outcomes very legible by sort of... you know, burying some assumptions that you were essentially making the bad scenario that you're worried about more likely to actually happen. What do you make of that? I'm quite worried about that as well. And this is something we've been fretting about since day one of the project. So let me just say a little bit more about that.

First of all, there is a long history of this sort of thing seeming to happen in the field of artificial general intelligence research. Most notably, Elias Yudkowsky, who is the sort of like... I don't know, ur-father of worrying about AGI, at least in this generation. Alan Turing also worried about it. Sam Altman specifically tweeted, you remember this tweet? Yeah, Sam specifically said, like, hats off to the LDS Yudkowsky.

for like raising awareness about agi it's happening much faster now because of his doom saying because it's caused a bunch of people to like pay more attention to the possibility and to like you know start investing in these companies and so forth so i was sort of like a I don't know, twisting the knife at him because he obviously doesn't want this to happen faster. He thinks we need more time to prepare and make it safe and so forth.

It does seem like there's been this effect where people talking about how powerful and scary AGI could be has maybe caused it to come a little bit faster. And cause people to like wake up and race harder towards it. And similarly, I'm worried about causing something like that with the I-2027.

One of the subplots in AI 2027 is this whole concentration of power issue of who gets to control the army of superintelligences, right? And in the race ending, it's sort of a moot question because the army of superintelligences is... just pretending to be controlled and so is not actually listening to anyone when it counts. But in the slowdown ending, they do actually align the AIs. And so they are actually going to do what they're told.

And then who gets to say that, right? And the answer in our slowdown ending is the oversight committee, which is this like ad hoc group of people that is some CEOs and the president who get together and like share power over the army of super intelligences. What I would like to see is something more democratic than that, something where the power is more distributed.

I'm also afraid that it could be less democratic than that. Like, at least we get an oligarchy with this committee, but like, it could very easily end up a dictatorship where one person has absolute control over the army of superintelligences. This is yet another example of how I'm trying to not have the self-fulfilling prophecy happen. I don't want people to read this and be like, hmm, I'm a CEO. I can make a lot of money by building the misaligned AI.

But all that being said... Yeah, so any of our evil villain listeners out there steepling your fingers in your lair under a mountain, knock it off. Yeah, so all that being said... We are taking a gamble that, you know, sunlight is the best disinfectant. The best way forward is to just generally tell the world about what we think is coming and hope that...

Even though many people will react to that in exactly the wrong ways, enough people will react to that in the right ways that overall it will be good. Because I am tired of the alternative of like... hush hush, keep everything secret, do backroom negotiations and hope that we get like the right people in the right rooms at the right time and that they make the right decisions. I think that that is kind of doomed. So I'm sort of placing my faith in humanity and telling it.

as I see it and hoping that insofar as I'm correct, people will wake up in time and, you know, overall that the outcome will be better. Yeah. All right. Thank you, Daniel. Thanks, Daniel. Thank you so much. decides to fake it till they make it. We'll talk about the cheating scandal that is rocking the world of AI benchmarks.

Well, Casey, there's one other big AI story we want to talk about this week, and that is about the drama surrounding Llama. That's right, Kevin. Meta has a new large language model. It was hotly anticipated. But I think it's fair to say it kind of stumbled out of the game. Llama Llama Cred Drama. How many times are you going to do the Llama Drama pun? Well, there's a very popular children's book called Llama Llama Red Pajama. Are you aware of this? I am.

So let's get into it. There has been a lot of things going on around this new language model, Llama 4, that Meta released last weekend. Casey, you've been writing about this in your newsletter this week. Catch me up. What is going on with Llama 4? Yeah, so look, Meta has invested billions and billions of dollars in AI, and they're taking a very different approach from the AI labs that we most often talk about on this show.

Companies like OpenAI, Anthropic, Google, their models are closed. You can't sort of download, fine-tune, re-release them under a sort of very permissive license. But with metas, you can. And when Llama 3 came out last year, developers said, oh, this thing is actually like pretty good. Like it's not as good as the state of the art, which is often true of the open models, but it's getting up there. Right.

And so they spent all this money to develop Llama 4. People have been talking for months about how this was going to sort of blow all the other open weights models out of the water. And then they release it. And what happens? Well... Two things happen, Kevin. The first is that Meta trumpets this model in the way that companies usually do trumpet their most recent models as being the most powerful ever or the most efficient.

They show off a bunch of benchmarks. They say this thing is highly capable and it's the bee's knees. They didn't actually say it was the bee's knees. I'm not sure anyone has said that in the past 70 years, but they said things like that. And one of the benchmarks that really got people's attention was LM Arena. You know LM Arena? I know of it, but I haven't spent much time on it. What is it?

So it's this really interesting project. It is a very small nonprofit that includes some researchers from UC Berkeley. And what they do is they get people to volunteer to help, and they'll have people enter a query, and then they'll show them the response from two different chatbots that are not labeled. And after they get the answer, the user will say, oh, I liked this one better.

And they collect those votes over time. And the more that people vote for one chatbot over another, the higher it rises on LM Arena. I see. So it's sort of like a crowdsourced leaderboard. for which of these models people prefer. Exactly. And Kevin, you know, as well as anyone else, that whenever a new model comes out, the question of how good is it turns out to be weirdly hard to answer, right? Maybe it's really good for what you need it to do.

Maybe it's really bad, or maybe it's about as good as something else, but you just happen to like it better because it has a style that matches with what you're looking for. So in such a world, companies are desperate to be seen as good, but they don't have an easy way of communicating that. And that's when LM Arena enters the picture.

Because if you can get high enough on that leaderboard, you can point to it and say, aha, look at how we're doing. Right. The people have voted. That's right. The people have spoken and look how well we're doing. So do you know how well Llama 4 does on Ella Marina? No. Llama 4 comes in at number two just under Gemini 2.5 Pro Experimental, which is the latest model from Google, which has been through a lot of testing and which...

Basically, there's universal acclaim for this model. People think this is a truly great model, not just at this little chatbot contest, but across a bunch of other things, including coding and... you know, a lot of other things. So Llama 4 sort of immediately zooming up to number two on LM Arena would seem to indicate that... Meta has really cooked here. They have built this incredible model. They are releasing it to the public.

under an open-weight structure, and they are one of the leading AI labs when it comes to creating very powerful models. That's right. There's an asterisk. Oh, boy. This version of Llama 4 is an experimental model. Meta on its website says it has been optimized for chat. the version of Lama 4 that is actually available for download. The one that was included in LM Arena was not the one that people could download? That's right. It had a different name. It was named Maverick. 0326 Experimental.

And people start to think, oh, wait a minute. What if what happened here isn't what normally happens on LM Arena, which is people make a new model and submit it to LM Arena and see how it does. What if Meta trained a special version of Llama 4 just to be good at LM Arena? Now, I have spent the past week trying to research whether this is true. And on... Monday, I got Meta to send me a statement, which I guess I should. We experiment with all types of custom variants.

And this experimental version is, quote, a chat optimized version we experimented with. that also performs well on LM Arena. We have now released our final open source version, and we will see how developers customize Llama for their own use cases. So this was really interesting to me because when they say, well, it also performs well on LM Arena.

It suggests that, well, maybe they just made like, I don't know, 15 of these models and they were just like, oh, look, this one happens to do well on Ella Marina. That is like one possibility. I think another possibility is exactly what the cynics think, which is, oh no, they sort of reverse engineered how LM Arena works.

And they built a bot that was just going to beat it. And how would you do that? Like if your goal was to create a model that would perform very well on this one specific leaderboard? what would you do? So LM Arena has released a lot of chats over the years that sort of show which chats are considered preferable to other chats. And it seems that the users of Ella Marina really like it when the bot has a high degree of what they call sycophancy.

So basically, you're just like, what should I have for breakfast today? And the chatbot is like, oh my god, that's such a great question. You're a genius. I love the way you're starting the day off right. That is the kind of answer that people pick. And so you can build a chatbot that essentially just flatters people constantly, and it tends to do really well on chatbot arena.

So anyways, in the aftermath of this confusion, LM Arena, which is a very sort of mild-mannered organization that I think is not used to being involved in public controversies. Puts out a statement, and I have to read the statement, Kevin, because as gentle as it is, I found it pretty damn- They don't go so far as to say Meta cheated. But what they do say is, quote, Meta's interpretation of our policy did not match what we expect from model providers.

Meta should have made it clear that this experimental model was a customized model to optimize for human preference As a result of that, we are updating our leaderboard policies to reinforce our commitment to fair reproducible evaluations so this confusion doesn't occur in the future.

So why is that statement so interesting to me? Well, you basically just have this tiny group of researchers over at Berkeley and meta... violates their policies so hard that they have to change the rules for how this competition even works. just to get people to stop breaking the competition. Yeah. I thought this was a really interesting set of stories. I'm still waiting for someone, ideally you, to get to the bottom of what actually happened inside Meta.

But I think it's worth talking about for two reasons. One, because I think it says something about meta and its place in the AI race. And the other, because I think it says something about the state of AI and these benchmarks and how useful they are or aren't in making sense of the torrent of new models that are coming out constantly from the big AI lab. So maybe let's take...

says about Meta's place in the AI race if it does turn out that they had sort of gamed this leaderboard to make it look like their model was better than it was. Here's what I think. I think if you're winning the AI race... you do not waste time trying to beat LM Arena, right?

What you do is what Google did, which is just release a very powerful pro version of Gemini. And it just happens to float to the top of the arena, not because it's been optimized for conversation, but just because it's a great model that's really good at a lot of things. If you have to make a custom version of your model just to win this rinky-dink competition... It's like hard for me to think of a more adverse indicator for the quality of Meta's AI program.

And we should say, there's been reporting in the information over the past year that the Llama 4 development process has been really frustrating for Meta. That they delayed the release twice because they weren't getting the results that they wanted. And when it finally did come out and people started to put it through other evaluations, they found that it just was not hitting the mark. In fact, Kevin, Ethan Mollick, former guest on Hard Fork.

compared the versions of the experimental chat that was winning the leaderboard to the chats that were produced. by the final open weights model. And what he found was... the open weights model was producing really bad responses. Essentially that the optimized model was performing so much better than the real one that it wasn't even close. So why don't they just release the optimized model then? That's a great question.

I don't know the answer to that, but what I'm going to assume is that whatever fine-tuning is necessary to increase the level of sycophancy in the bot... might be great for this sort of competition, but maybe it's really bad for coding or creative writing.

or the countless other things that we now expect LLMs to be good at, right? You know, fine-tuning is a very powerful process that can take a very general-purpose model that's kind of mediocre at a bunch of things and make it really good at one thing. These days, people have a lot of options to choose from with their large language models. And there are a lot of them that just have very high general capability.

So they're going to use those instead. Yeah. I mean, I have not done my own reporting on the situation inside Meadow with Llama 4, but I will just say from a broad view, if you just step back from this particular scandal... Meta is not one of the top three AI labs in America when it comes to releasing frontier models. They are not.

In the top tier of Frontier AI research, a lot of their key researchers have left the company. Their models are not seen as capable as the models from OpenAI, Anthropic, and Google DeepMind. And I think that really frustrates them, right? I think Mark Zuckerberg and his lieutenants, they really want to be seen as part of the vanguard here. And so I would not be surprised at all if in an effort to kind of... juice their numbers and appear to be leapfrogging some of their competition.

they may have violated the terms of one particular AI benchmark. And that should make us question how well their overall AI program is doing. Absolutely. And by the way, the next... Next time they release a model and come out with a bunch of wild claims, like you think I'm going to believe any of that? No, it's like you're going to have to go, you know, try to verify every single claim they make independently.

And look, I assume some people are going to hear this and think that I'm making a mountain out of a molehill. But I just think about what Daniel Cocotella just told us about how powerful these systems are becoming and about how powerful they're about to become. And you want them to be like sort of loyal to human beings, but you also want them to like not be used for bad behavior. If there is a company out there that is just like cheating to win benchmarks, what else can that model do?

Even though this may seem like a small thing, I think it matters that we have companies building AI systems where we have some level of trust in those companies, where we believe they have some amount of integrity when it comes to how they operate.

And so this was a moment where I thought, wow, my trust in meta as an AI company has just been dramatically reduced. Yeah. So the meta of it all aside, I think this does actually raise a really important question about the broader AI industry, which is... the value of benchmarks in general. Because one thing that I've heard from AI researchers over the past year or two is that these benchmarks, these tests that are given to these models to figure out how intelligent they are.

They all have some flaw built into them, right? There's this issue of data contamination, which is what if... Some of the answers on these tests are being fed into these models during their training process so that you're really not getting a sense of how capable the model is. They're just kind of regurgitating these answers that they've sort of seen already.

That is an issue. There are also just the issue that all of these companies are effectively grading their own homework, right? There's no like... federal program that sort of puts these things through their paces and releases like standardized benchmark scores that we can actually verify and trust. Some of these AI companies are using different methods to even apply these benchmark tests. There's these things called

consensus at 64 and all these different ways that you can kind of cherry pick like the best answer that your model gives if you give it the test a bunch of times and use that for your score. So I think we are just losing our ability to trust. the way that we measure these AI models in general. Yeah, and it's so frustrating. You know, I was thinking, Kevin, imagine like in the early 2010s, And it's not just that Instagram comes out as an app in the App Store.

You have Instagram, you have Instagram 01, you have Instagram 01 mini, you have Instagram 01 deep research. And it's like, download the one that's best for you. You'd be like, why are you making me do any of this? Right? Like, just give me the one thing that works. And while every AI lab is trying to realize that, in the meantime, we're living through this Cambrian explosion of large language models.

And on one hand, I think that makes it really important for there to be benchmarks so that we can look at a glance to have a basic sense of, is this thing even worth my time? But on the other hand, that makes the benchmarks such an attractive target for gaming and outright cheating. And so that's why the researcher Andre Carpathy has said that we have what he calls an evaluation crisis. where when a new model comes out, the question of how good is it is just very difficult to answer.

I've been wondering what we can do as journalists. to try to answer those questions better? Like, is this a place for journalists to actually say, okay, new model came out. We're going to have our own custom set of evaluations. Maybe we're going to keep those private in some way to prevent them from being gained. But what solutions do you see here to this?

crisis. Well, at the risk of scooping myself here, I will disclose that I am actually starting to work on my own benchmark because I think that part of how we are going to make sense of these AI models is that people will just start developing their own set of tech. not necessarily to determine like their overall intelligence, but- to determine how good they are at the things we care about. Personally, I don't care much if an AI model...

is getting a 97% on the graduate level physics exam or a 93%, right? That does not make a huge difference in my life. Because it's still higher than you're going to get. Exactly. And I am not a graduate level physics researcher. So I might care more about whether a model is good at creative writing or not. And I might want a battery of tests. to determine that. And so I think that as these things become more critical in people's lives and work, we will start seeing these more personalized

tests and evaluations that actually measure if the models are good at the things that we care about. What do you think? Yeah, I think that's a great point. And after you told me that you were going to do this, I sort of started to scheme and thought, you know, I want my own benchmarks too, because... There are, I don't know, I'm sure I can come up with a list of like 10 things that I wish AI could do for me today that it still can't. And so maybe it's time that I should start a scenario plan.

What's one of your tests that you want to give AI models to determine if they're capable or not? Well, for example, I have a newsletter that has customer service issues. People email us. They say, oh my gosh, can I change my email address? People say the writing in this is... so bad. People love the writing. That's all I hear about the writing. People are saying, are humanists writing this? That's insane.

But I would love to be able to be able to automate some of that, you know, make it easier for people. Oh, you need to download your invoice, which is a question we get a lot. It's like, okay, yes, actually, we're just going to sort of handle that in an automated way.

That's just one very easy thing. And if you're thinking, oh, Casey, I actually have a product that can already do that for you, please don't email me. It can't. I've been through this. Can I tell you one of the things that I want to test AI on? Yeah. So as you know, I just moved into a new house. So as a result, I have spent like between a third and half of my waking hours over the last few weeks thinking about hanging pictures. Hanging pictures.

is one of my least favorite tasks in the world. You have to do math. You have to bring out the laser level. I mean, it's a huge process. The golden ratio? Yes. And I would love for an AI system to be able to hang pictures for me. That's beautiful. And as soon as that happens, to me, that's AGI. Now, would that involve a robot? Probably. Yeah. So we got to make some progress before we get there. But if you're listening to this and you're working at one of these robotics companies...

Hard Fork is produced by Rachel Cohn and Whitney Jones. This episode was edited by Matt Collette and fact-checked by Ina Alvarado. Today's show is engineered by Chris Wood. Original music by Rowan Nemisto and Dan Powell. Our executive producer is Jen Poyant. Video production by Sawyer Roquet, Pat Gunther, and Chris Schott.

You can watch this whole episode on YouTube at youtube.com slash hardfork. Special thanks to Paul Schumann, Hui Wing Tam, Dahlia Haddad, and Jeffrey Miranda. You can email us at hardfork at nytimes.com with your AI doomsday scenario.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.