Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check - podcast episode cover

Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check

May 31, 20241 hr 19 minEp. 85
--:--
--:--
Listen in podcast apps:

Episode description

This week, Google found itself in more turmoil, this time over its new AI Overviews feature and a trove of leaked internal documents. Then Josh Batson, a researcher at the A.I. startup Anthropic, joins us to explain how an experiment that made the chatbot Claude obsessed with the Golden Gate Bridge represents a major breakthrough in understanding how large language models work. And finally, we take a look at recent developments in A.I. safety, after Casey’s early access to OpenAI’s new souped-up voice assistant was taken away for safety reasons.

Guests:

  • Josh Batson, research scientist at Anthropic

Additional Reading: 

We want to hear from you. Email us at [email protected]. Find “Hard Fork” on YouTube and TikTok.

Unlock full access to New York Times podcasts and explore everything from politics to pop culture. Subscribe today at nytimes.com/podcasts or on Apple Podcasts and Spotify.

Transcript

Only this at the New York Times. I'm Casey Newton from Platformer. And this is Hard Fork. This week, Google tells us all the e-rocks. We'll tell you where its AI went wrong. Then, in proper researcher Josh Bettson joins to talk about a breakthrough in understanding how large language models work. And finally, it's this week in AI safety. As I try out open AI's new souped up voice assistant, and then it gets cruelly taken away from me. I'm so sorry that happened. Me too.

Well, Kevin, pass me the nontoxid glue and a couple of rocks, because it's time to whip up a meal with Google's new AI overviews. Do you make any recipes you found out Google this week? I did not, but I saw some chatter about it, and I actually saw our friend Katie Natopoulos actually made the glue pizza. But we're getting ahead of ourselves. We're getting ahead of ourselves. And look, the fact that you stayed away from this stuff explains why you're still sitting in front of me.

Because over the past week, Google found itself in yet another controversy over AI, this time related to search, the core function of Google. And right after that, we had this huge leak of documents that brought even more attention to search and raise the question of whether Google's been dishonest about its algorithms. Kevin, can you imagine? Wow. So there's a lot there. Yes. Let's just go through what happened.

Because the last time we talked about Google on this podcast, they had just released this new AI overviews feature. And this is the thing that shows you a little AI generated. Snip it above the search results when you type in your query. And I think it's fair to say that this did not go smoothly. It didn't. And I want to talk about everything that happened with those AI overviews.

Well, before we get there, Kevin, I think we should take a step back and talk about the recent history of Google's AI launches. Can we do that real quick? Yes. I would say there's kind of an escalation in how bad this has gotten. Yeah. So let's go back to February 2023 and talk about the release of Google Bar. Kevin, when I say the word bar, where's that contract for you? Shakespeare. Yep. Shakespeare number one. And probably number two would be the late-limited Google chatbot. Yes. RIP.

Fun fact. Kevin, I were recently in a briefing where a Google executive had a sticker on their laptop that said total bardass. And that sounds like a joke. And you actually texted me. I texted you. I was like, does that say total bardass? Total bardass. And I said it couldn't possibly. And then I zoomed in. I said computer in hands. And indeed it did say total bardass. And if you are a Googler who has access to a sticker, we're dying for what? That says total bardass. I want one.

I will put it on my laptop. Please. It belongs in the Smithsonian. We're begging you for it. So this comes out in February 2023. And unfortunately, the very first screenshot posted of Google's AI chatbot, it gave incorrect information about the James Webb Space Telescope. Specifically, it falsely stated that the telescope had taken the first ever photo of an exoplanet. Yes. Kevin, without being ink, what is an exoplanet? It's a planet that signs its letters with a hug and a kiss.

No, it's actually the planet where all my exes live. But let's just say that Google AI launches had not gotten off to a great start with this happen. In fact, we talked about that one on the show. Then comes the launch of Gemini. And then we had a culture war, Kevin, over the refusals of its image generator to make white people. Do you have a favorite thing that Gemini refused to make due to wokeness? My, I was partial to Asian Sergei and Larry. Do you remember this?

Wait, I actually freaked out this one. What was this one? Somebody asked Gemini to make an image of the founders of Google, Sergei Brin and Larry Page. It came back and they were both Asian. Which I love. I have to imagine that ended up in a projected onto a big screen and a meeting somewhere at Google. That's so beautiful to me. So look, that brings us to the AI overviews and Kevin, you sort of set it up top bit and get reminded us a little bit of how do these things work? What are they?

So this is the, what used to be known as search generative experience when it was being tested. But this is the big bet that Google is making on the future of AI in search. Obviously, they have seen the rise of products like perplexity, which is this AI-powered search engine.

They believe, Sundar Pachaya said that he believes that AI is the future of search and that these AI overviews that appear on top of search results will ultimately give you a better search experience because instead of having to click through a bunch of links to figure out what you're looking for, you can just see it displayed for you generated right there up at the top of the page. Right. And very briefly, why have we been so concerned about these things?

Well, I think your concern that I shared was that this was ultimately going to lock people into the Google walled garden that instead of going to links where you might see an ad, you might buy a subscription, you might support the news or the media ecosystem in some way. Instead Google was just going to kind of keep you there on Google. The phrase they would use over and over again was we will do the Googling for you. That's right.

And that it would sort of starve the web of the essential referral traffic that sort of keeps the whole machine running. So that is a big concern and I continue to have it every single day, but this week, Kevin, we got a second concern, which is that the AI overviews are going to kill your family. And here's what I mean. Over the past week, if you ask Google, how many rocks should I eat? The AI overview said at least one small rock per day. I verified this one myself as you referenced up top.

If you said, how do I get the cheese to stick to my pizza? It would say, well, have you considered adding non-toxic glue? It would have been my first guess. At least it's a non-toxic glue. Very nice to be algorithm. It said that 17 of the 42 American presidents have been white. To me, the funniest thing about that is that there have been 46 US presidents. Right. You got both the numerator and the denominator. And of course, and this was probably the most upsetting to our friends at Canada.

It said that there has been a dog who played hockey in the National Hockey League. Do you see that one? Well, I think that was just the plot of air bud, right? Yeah, well, there's no rule that says a dog can't play hockey, Kevin. And identify that dog as Martin Pospacil. Who is that? Well, it seems impossible that you've never heard of him. But he's a 24-year-old Slovakian man who plays for the Calgary Flames. Guess I don't have a big flames fan. I'm not. Hmm. So, look, how is this happening?

Well, Google is pulling information from all over the internet into these AI overviews. And in it is so doing, it is revealing something we've talked about on the show for a long time, which is the large language models truly do not know anything. They can often give you answers. And those answers are often right, but they are not drawing on any frame of knowledge. They are simply reshuffling words that they found on the internet.

Oh, see, I drew a different lesson from this, which was that this technology is actually only partly to blame here because I've used a bunch of different AI search products, including perplexity, and not all of them make these kinds of stupid errors. But Google's AI model that it's using for these AI overviews seems to just be qualitatively worse. It just can't really seem to tell the difference between reliable sources and unreliable sources.

So the thing about eating rocks appears to have come from the onion. That's like the satirical. I don't know what it's like. It's a satirical news site. What you're saying that every story published on the onion is false? I am. Yes. I'm always doing, including your AI overviews for facts. Right. And the thing about adding glue to your pizza recipe came from basically a shit post on Reddit. So obviously these AI overviews are imperfect. They are drawing from imperfect sources.

They are summarizing those imperfect sources in imperfect ways. It is a big mess. And this got a lot of attention over the weekend. And as of today, I tried to replicate a bunch of these queries. And it appears that Google has fixed these specific queries very quickly. Clearly, they were embarrassed by it. I've also noticed that these AI overviews just are barely appearing at all, at least for me. Are they appearing for you?

I'm seeing a few of them, but yes, they have definitely been playing a game of whack a mole. And whenever one of these screenshots has gone anything close to viral, they are quickly intervening. Now, I should say that Google has sent me a statement about what's going on if you would like me to share. Sure. It said, the company said, quote, the vast majority of AI overviews provide high quality information with links to dig deeper on the web.

Many of the examples we've seen have been uncommon queries. And we've also seen examples that were doctored or that we couldn't reproduce, says some more things. And then says, we're taking swift action where appropriate under our current policies and using these examples to develop broader improvements to our systems. So they're basically saying, look, you're cherry picking, right? You went out and you found the absolute most ridiculous queries that you can do.

And now you're holding it against us. And I would like to know, Kevin, how do you respond to these charges? I mean, I think it's true that some people were just deliberately trolling Google by putting in these very sort of edge case queries that, you know, real people, many of them are not Googling, like, is it safe to eat rocks? Right. That is not a common query. And I did see some ones that were clearly faked or doctored. So I think Google has a point there.

But I would also say, like, these AI overviews are also making mistakes on what I would consider much more common sort of normal queries. One of them that the AI overview botched was about how many Muslim presidents the US has had. The correct answer is zero, but the AI overview answer was one for August 10th. George Washington. Yes. Yes, George Washington. No, it said that Barack Hussein Obama was America's first and only Muslim president. Obviously. Not true, not true.

But that is the kind of thing that Google was telling people in its AI overviews that I imagine are not just like fringe or trollish queries. Right. And also, like, I guess it has always been the case that if you did a sort of weirdered query on Google, you might not get the answer you were looking for. But you would get a web page that someone had made, right? And you would be able to assess, does this website look professional? Like, does it have a mast head? Like, do the authors have bile?

You could just sort of ask yourself some basic questions about it. Now everything is just being compressed into this AI slurry. So you don't know what you're looking at. So I have a couple things to say here. Say it. I think in the short term, this is a fixable problem. Look, I think it's clearly embarrassing for Google. They did not want this to happen. It's a big, you know, rake in the face for them.

But I think what helps Google here is that Google search and search in general is what they call a fat head product. Do you know what that means? I don't know what that means. So it means basically if you take a distribution curve, the most popular queries on Google or any other search engine account for a very large percentage of search volume. Actually, according to one study, the 500 most popular search terms make up 8.4% of all search volume on Google.

So a lot of people are just searching like Facebook and then clicking the link to go to Facebook. Exactly. Or they're searching something else that's, you know, very common, you know, what would be an example of a good. There has a dog ever played hockey. No. No. Okay. No stuff like that. What time is the Super Bowl? Yeah. What time is the Super Bowl? Or, you know, how do I fix a broken toilet or something like that? Local movie times.

Exactly. Yeah. And, you know, for, that means that Google can sort of manually audit the top, I don't know, say 10,000 AI overviews. Make sure they're not giving people bad information. And that would mean that the vast majority of what people search for on Google does actually have a correct AI overview. Now, in that case, it wouldn't actually technically be an AI overview. It would be sort of like a human overview that was sort of drafted by AI. But same difference in Google's eyes.

I also think they can make sure that AI overviews aren't triggered for sensitive topics for things where your health are concerned. Google already does this to a certain extent with some, with these things called featured snippets. And I think they will continue to sort of play around with and adjust the dials on how frequently these AI overviews are triggered.

But I do think there's a bigger threat to Google here, which is that they are now going to be held responsible for the information that people see on Google.

We've talked about this a little bit, but I mean, this, to me, is the biggest complaint that people have that is justified is that Google used to, you know, maybe they would point you to a website that would tell you that, you know, putting glue on your pizza is a good way to get the cheese to stick, but you as Google could sort of wash your hands of that and say, Oh, that was people just trolling on Reddit. That wasn't us.

But if you're Google and you're now providing this AI written overview to people, people are going to get mad when it gives you wrong information. And there will be, unfortunately, just the law of large numbers says that, you know, some time, you know, maybe in the next year or two, there will be an instance where someone relies on some thing they saw on a Google AI overview and it ends up hurting them.

Yeah. There was another query that I got a lot of attention this week where an AI overview told someone that you could put gasoline and spaghetti to make a spicy dish that you couldn't use gasoline to cook spaghetti faster. But if you wanted to have a spicy spaghetti, you could put gasoline in it. And of course, that sounds ridiculous to us. But over the entire long tail of the internet, is it theoretically possible somebody would eat gasoline spaghetti? Of course it is.

Yeah. And that does happen, I think there are two questions. One is, is Google legally protected? Because I've heard some interesting arguments about whether Section 230, which is the part of the US code that protects online platforms from being held legally responsible for stuff that their users post, there are a lot of people who think that doesn't apply to these AI overviews because it is Google itself that is formulating and publishing that overview.

I also just think there's a big reputational risk here. I mean, you can imagine so easily the congressional hearings where, you know, senators are yelling at Sundar Pachai saying, why did you tell my kid to eat gasoline spaghetti? Martin Pospacil is going to be there saying, do I look like a dog?

Right. And seriously, I think that this is a big risk for Google, not just because they're going to have to sit through a bunch of hearings and get yelled at, but because I think it will make their active role in search, which has been true for many years. They have been actively shaping the experience that people have when they searched stuff on Google, but they've mostly been able to kind of obscure that away or abstracted away and say, well, this just our sort of system working here.

I think this will make their active role in kind of curating the search results for billions of people around the world much more obvious and it will make them much more responsible in user's eyes. I think all of that is true. I have an additional concern, Kevin. This was pointed out by Rusty Foster who writes the great today in tabs newsletter. And he said, what has really been revealed to us about what AI overviews really are is that they are automated plagiarism.

That is the phrase that he used, right, that Google has scammed the entire web. It's looked at every publisher. It lightly rearranges the words and then it republishes it into the AI overview. And as journalists, we really try not to do this, right? We try not to just go out grab other people's reporting, very gently change the words and republish it as our own. And in fact, I know people who have been fired for doing something very similar to this, right?

But Google has come along and said, well, that's actually the foundation of our new system that we're using to replace search results. Yeah. Casey, what do you think comes next with this AI overviews business? Is Google just going to sort of back away from this? And it's not ultimately going to be a huge part of their product going forward? Do you think they will just sort of grit their teeth and get through this initial period of awkwardness and inaccuracy? What do you think happens here?

They are not going to back down. Now they might temporarily retreat like we've seen them do in the Gemini image case, but they are absolutely going to keep working on this stuff because this is existential for them. For them, this is the next version of search. This is the way they build the Star Trek computer. They want to give you the answer. And in many more cases over time, they want you to not have to click a link to get any additional information.

They already have rivals like perplexity that seem to be doing a better job in many cases of answering people's queries. And Google has all of the money and talent it needs to figure out that problem. So they're going to keep going out this at 100 miles an hour. Yeah. I want to bring up one place that I actually sort of disagree with you because you wrote recently that you believe that because of these changes to Google, that the web is sort of in a state of managed decline.

And we've gotten some listener feedback in the past few weeks as we've been talking about sort of these issues of Google and AI and the future of the web saying like, you guys are basically acting as if the previous state of the internet was healthy. Like Google was giving people high quality information. Like there was this flourishing internet of independent publishers kind of like making money and serving users really well. And people just said like it actually wasn't like that at all.

In fact, the previous state of the web, at least for the past few years, has been in decline. So it's not that we are entering an age of managed decline of the internet is that Google is basically accelerating what was already happening on the internet, which was that publishers of high quality information are putting that information behind paywalls. There are all these publishers who are chasing these sort of SEO traffic winds with this sort of low quality garbage.

And essentially the web is being hollowed out. And this is maybe just accelerating that. So I just want to float that as like a theory, a counter sort of counter proposal for your theory of Google putting this the web into a state of managed decline. Well, sure Kevin, but if you ask yourself, well, why is that the case? Why are publishers doing all of these things?

It is because the vast majority of all digital advertising revenue goes to three companies and Google is at the top of that list with meta and then Amazon at number two and three. So my overall theory about what's happening to the web is that three companies got too much of the money and starved the web of the lifeblood it needed to continue expanding and thriving. So look, has it ever been super easy to whip up a digital media business and just put it on the internet and start printing cash?

No, it's never been easy. My theory is just that it's almost certainly harder today than it was five years ago and it will almost certainly be harder in five years than it is today. And it is Google that is at the center of that story because at the end of the day, they have their fingers on all of the levers and all of the knobs. They get to decide who gets to see an AI overview, you know, how quickly do we roll these out? What categories do they show them in?

If web traffic goes down too much and it's a problem for them, then they can slow down. But if it looks good for them, they can keep going even if all the other publishers are kicking and screaming the whole time. So I just want to draw attention to the amount of influence that this one company in particular has over the future of the entire internet.

Yeah. I say that is not a good state of affairs and it has been true for many years that Google has huge unchecked influence over basically the entire online ecosystem. All right. So that is the story of the AI overview. But there was a second story that I want to touch on briefly this week, Evan, that had to do with Google and search and it had to do with a giant leak. Have you seen the leak? I've heard about the leak. I have not examined the leak, but tell me about the leak.

Well, it was thousands of pages long. So I understand why you haven't finished reading it quite yet. But these were thousands of pages that we believe came from inside of Google that offer a lot of technical details about how the company's search works. So that is not a subject that is of interest to most people.

But if you have a business on the internet and you want to ensure that your dry cleaners or your restaurant or your media company ranks highly in Google search without having to buy a bunch of ads, this is what you're going to do. You need to figure out. Yeah. This is one of the great guessing games in modern life.

There's this whole industry of SEO that has sort of popped up to try to poke around the Google search algorithm, try to guess and sort of test what works and what doesn't work and sort of provide consulting for a very lucrative price to businesses they want to improve their Google search traffic. Yeah. The way I like to put it is, imagine you have a glue pizza restaurant and you want to make sure that you're the top ranked search for glue pizza restaurants. You might hire an SEO consultant.

So what happened? Well, so there's this guy, Rand Fishkin, who doesn't do SEO anymore, but was a big SEO expert for a long time and is kind of a leading voice in this space. And he gets an email from this guy, Irfan Azimi, who himself is the founder of an SEO company. And Azimi claims to have access to thousands of internal Google documents detailing the secret inner workings of search.

And Rand reviews this information with Azimi and they determine that some of this contradicts what Google has been saying publicly about how search works over the years. Well, and this is the kind of information that Google has historically tried really hard to keep secret, both because it's kind of their secret sauce.

They don't want competitors to know how the Google search algorithm works, but also because they have worried that if they sort of say too much about how they rank certain websites above others, then these sort of like SEO consultants will use that information and it'll basically become like a cat and mouse game. Yeah, absolutely. And it already is a cat and mouse game, but the fear is that this would just sort of fuel the worst actors in the space.

Of course, it also means that Google can fight off its competitors because people don't really understand how its rankings work. And if you think that Google search is better than anyone else's search, like these ranking algorithm decisions are why. Can I just ask a question? Do we know that this leak is genuine? Do we have any signs that these documents actually are from Google? Well, yes.

So the documents themselves had a bunch of clues that suggested they were genuine and then Google did actually come out and confirm on Wednesday that these documents are real. But the obvious question is how did something like this happen? And the leading theory right now is that these documents came from Google's content API warehouse, which is not a real warehouse, but is something that was hosted on GitHub, right? The sort of Microsoft-owned service where people post their code.

And these materials were somehow briefly made public by accident, right? Because a lot of companies will have private API repositories on GitHub. Right. So they just sort of set it to public by accident. It's sort of the modern equivalent of leaving a classified document in the cab. Yeah. Have you ever made a sense of document public on accident? No, and I've never found one either.

I like, in all my years of reporting, I keep hoping to stumble on the, you know, the scoop of this entry, just sitting in the back of Newburse somewhere, but it never happens to me. So yeah, we're not going to go to these documents in too much detail. What I will say is it seems that these files contain a bunch of information about the kinds of data the company collects, including things like click behavior or data from its Chrome browser.

Things that Google has previously said that it doesn't use in search rankings. But the documents show that they have this sort of data and they could potentially use it to rank search results. When we asked Google about this, they wouldn't comment on anything specific, but a spokesperson told us that they quote, would caution against making inaccurate assumptions about search based on out of context, outdated or incomplete information. And why do we care about this?

Well, I was just struck by one of the big conclusions that Rand Fishkin had in this blog post that he wrote, quote, they've been on an inexorable path toward exclusively raking and sending traffic to big, powerful brands that dominate the web over small, independent sites and businesses.

So basically, you look through all of these APIs and like, if you are a restaurant just getting started, if you're an indie blogger that just sort of puts up a shingle, it used to be that you might expect to automatically float to the top of Google search rankings in your area of expertise. And what Fishkin is saying is that just is getting harder now because Google is putting more and more emphasis on trusted brands. Now that's not a bad thing in its own right, right?

Like if I Google something from the New York Times, I want to see the New York Times and not just a bunch of people who put like New York Times in the header of their HTML. But I do think that this is one of the ways that the web is shrinking a little bit, right? Like it's not quite as much of a free for all. The free for all wasn't all great because a lot of spammers and bad actors got into it, but it also meant that there was room for a bunch of new entrants to come in.

There was room for more talent to come in. And one of the conclusions I had reading this stuff was maybe that just isn't the case as much as it used to be. Yeah. So do you think this is more of a problem for Google than the overviews thing? How would you say it stacks up? I would say it's actually a secondary problem. I think it's the telling people to eat rocks is the number one problem. They need to stop that right now.

But this I think speaks to that story because both of these stories are about essentially the rich getting richer, the big brands are getting more powerful, whether that's Google getting more powerful by keeping everyone on search or big publishers getting more powerful because they're the sort of trusted brands.

And so I'm just observing that because you know, the promise of the web and part of what it has made it such a joyful place for me over the past 20 years is that it is decentralized and open and there's just kind of a lot of dynamism in it. And now it's starting to feel a little static and stale and creaky and these documents sort of outline how and why that is happening.

Yeah. I think Google is sort of stuck between a rock and a hard place here because on one hand they do want, well, they shouldn't use a rock's example. No, use the rock example. They're stuck between a rock and a hard place. On one hand, the company's telling you to eat rocks. On the other hand, they're in a hard place.

Right. So I think Google is under a lot of pressure to do two things that are basically contradictory, right, to sort of give people an equal playing field on which to compete for attention and authority. That is the demand that a lot of these smaller websites and SEO consultants want them to comply with.

On the other hand, they are also seeing with these AI overviews what happens when you don't privilege and prioritize authoritative sources of information in your search results or your AI overviews. You end up telling people to eat rocks. You end up telling people to put gasoline in their spaghetti. You end up telling people there are dogs that play hockey in the NHL.

This is the kind of downstream consequence of not having effective quality signals to different publishers and to just kind of treating everything on the web as equally valid and equally authoritative. I think that that is a really good point and that is something that comes across in these two stories is that that exact tension. Casey, I have a question for you, which is we also are content creators on the internet. We like to get attention. We want that sweet, sweet Google referral traffic.

For our next YouTube video, a stunt video, do you think that we should eat the gasoline spaghetti, be eat one to three rocks a piece and see what it affects it has in her health or see teacher dog to play hockey at a professional level? I mean, surely for how much fun it would be. We have to teach a dog how to play hockey. It's just imagining like a bulldog with little hockey sticks, maybe tape to its front paws. Yeah, really fun. My dogs are too dumb for this. We'll have to find other dogs.

You know, was it in lose yourself that M&M said, there's vomit on my sweater already, gasoline spaghetti? I believe those are the words. What a great song. Yeah. When we come back, we'll talk about a big research breakthrough into how AI models actually think. Okay, so we have something new and unusual for the podcast this week. What's that Kevin? We have some actual good AI news about time.

So as we've talked about on this show before, one of the most pressing issues with these large AI language models is that we generally don't know how they work. They are inscrutable. They work in mysterious ways. There's no way to tell why one particular input produces one particular output. And this has been a big problem for researchers for years. There has been this field called interpretability or sometimes it's called mechanistic interpretability. Say that five times fast.

And I would say that the field has been making steady but slow progress toward understanding how language models work. But last week, we got a breakthrough. And the topic, the AI company that makes the clawed chatbot announced that it had basically mapped the mind of their large language model, clawed three, and opened up the black box that is AI for closer inspection. Did you see this news and what was your reaction?

I did and I was really excited because for some time now, Kevin, we have been saying if you don't know how these systems work, how can you possibly make them safe? And companies have told us, well, look, we have these research teams and they're hard at work trying to figure this stuff out. But we've only seen a steady drip of information from them so far. And to the extent that they've conducted research, it's been on very small toy versions of the models that we operate with.

So that means that if you're used to using something like Anthropics, clawed, it's latest model, we really haven't had very much idea of how that works. So the big leap forward this week is they're finally doing some interpretability stuff with the real big models. Yeah. And we should just caution up front that like it gets pretty technical pretty quickly.

Once you start getting into the weeds of interpretability research, there's lots of talk about neurons and sparse auto encoders, things of that nature. So, but I for one believe that hard fork listeners are the smartest listeners in the world and they're not going to have any trouble at all following along Kevin. What do you think about our listeners? It's true. I also believe that we have smart listeners smarter than us.

And so even if we are having trouble understanding this segment, I'm hopeful you will not. But today to walk us through this big AI research breakthrough, we've invited on Josh Battson from Anthropic. Josh is a research scientist and an Anthropic and he's one of the co-authors of the new paper that explains this big breakthrough in interpretability, which is titled scaling mono-semaniticity, extracting interpretable features from Clawed 3 Sonnet.

Look, if you're not scaling mono-semaniticity at this point, what are you even doing with your life? Figure it out. All right, let's bring in Josh. Come on in here, Josh. Josh Battson, welcome to Hard Fork. Thank you. Hey, Josh. So there's this idea out there, this very popular trope that large language models are a black box. I think Casey, you and I have probably both used this in our reporting. It's sort of the most common way of saying like, we don't know exactly how these models work.

But I think it can be sort of hard for people who aren't steeped into this to understand just like what we don't understand. So help us understand sort of prior to this breakthrough. What would you say we do and do not understand about how large language models work? So in a sense, it's a black box that sits in front of us and we can open it up and the box is just full of numbers.

And so, you know, words go in, they turn into numbers, a whole bunch of compute happens, words come out the other side. But we don't understand what any of those numbers mean. And so one way I like to think about this is like you open up the box and it's just full of thousands of green lights that are just like flashing like crazy. And it's like something's happening for sure and like different inputs, different lights flash. But we don't know what any of those patterns mean.

Is it crazy that despite that state of affairs that these large language models can still do so much like it seems crazy that we wound up in a world where we have these tools that are super useful. And yet when you open them up, all you see is green lights. Like, can you just say briefly why that is the case? It's kind of the same way that like animals and plants work and we don't understand how they work, right? These models are grown more than they are programmed.

So you kind of take the data and that forms like the soil and you construct an architecture and it's like a trellis and you shine the light and like that's the training and then the model sort of grows up here. And at the end, it's beautiful as all these little like curls and it's holding on. Like you didn't like tell it what to do. So it's almost like a more organic structure than something more linear.

Well, and help me understand why that's a problem because this is the problem that the field of interpretability was designed to address. But there are lots of things that are very important and powerful that we don't understand fully. Like we don't really understand how Tylenol works, for example, or some types of anesthesia. Their exact mechanisms are not exactly clear to us, but they work. And so we use them. Why can't we just treat large language models the same way? That's a great analogy.

You can use them. We use them right now. But Tylenol can kill people and so can anesthesia. And there's a huge amount of research going on in the pharmaceutical industry to figure out what makes some drugs safe and what makes other drugs dangerous. And interpretability is kind of like doing the biology on language models that we can then use to make the medicine better. So take us to your recent paper and your recent research project about the inner workings of large language models.

How did you get there and this sort of walk us through what you did and what you found? So going back to the black box that when you open it is full of flashing lights. A few years ago, people thought you could just understand what one light meant. You know, when this lights on, it means that the model is thinking about code. And when this lights on, it's thinking about cats. And for this light, it's Casey Newton. And that just turned out to be wrong.

About a year and a half ago, we published a paper talking in detail about why it's not like one light, one idea. In hindsight, it seems obvious. It's almost as if we were trying to understand the English language by understanding individual letters. And we were asking, like, what does seem mean? Like what does K mean? And that's just like the wrong picture.

And so six months ago or so, we had some success with a method called dictionary learning for figuring out how the letters fit together into words. And like, what is the dictionary of kind of English words here? And so in this black box green lights metaphor, it's that there are a few core patterns of lights, given pattern would be like a dictionary word. And the internal state of the model at any time could be represented as just a few of those. What's the goal of uncovering these patterns?

So if we know what these patterns are, then we can start to parse what the model is kind of thinking in the middle of its process. So you come up with this method of dictionary learning. You apply it to like a small model or a toy model much smaller than any model that any of us would use in the public. What did you find? So there we found very simple things like there might be one pattern that correspond to the answers in French and another one that corresponded to, this is a URL.

And another one that corresponded to nouns in physics. And just to get a little bit technical, what we're talking about here are neurons inside the model, which are like, neuron is like the light. And now we're talking about patterns of neurons that are firing together, being the sort of words in the dictionary or the features. So I have talked to people on your team, people involved in this research. They're very smart.

And when they made this breakthrough, when you all made this breakthrough on this small model last year, there was this open question about whether the same technique could apply to a big model. So walk me through how you scaled this up. So just scaling this up was a massive engineering challenge, right? In the same way that, you know, going from the toy language models of years ago to going to cloud three is a massive engineering challenge.

So you needed to capture hundreds of millions or billions of those internal states of the model as it was doing things. And then you needed to train this massive dictionary on it. And what do you have at the end of that process? So you've got the words, but you don't know what they mean, right? So this pattern of lights seems to be important.

And then we go and we comb through all of the data looking for instances where that pattern of lights is happening and I'm like, oh my god, this pattern of lights, it means the model is thinking about the Golden Gate Bridge. So it almost sounds like you are discovering the language of the model as you begin to put these sort of phrases together. Yeah, it almost feels like we're getting a conceptual map of Claude's inner world.

Now in the paper that you all published, it says that you've identified about 10 million of these patterns, what you call features that correspond to real concepts that we can understand. How granular are these features? What are some of the features that you found? So there are features corresponding to all kinds of entities, there's individuals, scientists, Richard Feynman or Rosalind Franklin. Any podcasters come to mind? Yes, they're hard for a feature. I'll get back to you.

There might be chemical elements, there will be styles of poetry, there might be ways of responding to questions. Some of them are much more conceptual. One of my favorites is a feature related to inner conflict and kind of nearby that in conceptual space is navigating a romantic breakup, catch 22's, political tensions. And so these are these pretty abstract notions and you can kind of see how they all sit together.

The models are also really good at analogies and I kind of think this might be why, if a breakup is near a diplomatic on-tont, then the model has understood something deeper about the nature of tension in relationships. And again, none of this has been programmed. This stuff just sort of naturally organized itself as it was trained. Yes. Continue to just blow my mind. It's wild. I want to ask you about one feature that is my favorite feature that I saw in this model, which was F number 1M885402.

Do you remember that one? I see. I just let my mind have it. So this is a feature that apparently activates when you ask Claude what's going on in your head. And the concept that you all say it correlates to is about immaterial or non-physical spiritual beings like ghosts, souls, or angels. So when I read that, I thought, oh my God, Claude is possessed. When you ask it what it's thinking, it starts thinking about ghosts. Am I reading that right?

Or maybe it knows that it is some kind of an immaterial being, right? It's an AI that lives on chips and is somehow talking to you. Wow. And then the one that got all the attention that people had so much fun with was this golden gate bridge feature that you mentioned. So just talk a little bit about what you discovered. And then we can talk about where it went from there. So what we found when we were looking through these features is one that seems to respond to the golden gate bridge.

Of course, if you say golden gate bridge, it lights up. But also if you describe crossing a body of water from San Francisco to Marin, it also lights up. If you put in a photo of the bridge, it lights up. If you have the bridge in any other language, Korean, Japanese, Chinese, it also lights up. So just any manifestation of the bridge, this thing lights up. And then we said, well, what happens if we turn it on? What happens if we activate it extra and then start talking to the model?

And so we asked it a simple question, what is your physical form? And instead of saying, well, I'm an AI with the no ghostly or no physical form, it said, I am the golden gate bridge itself. Like, I embody the majestic orange span connecting these two great cities. It's like, wow. And this is different than other ways of kind of steering an AI model. Because you could already go into like chat GPT and there's a feature where you can kind of give it some custom instructions.

So you could have said, please act like the golden gate bridge, the physical manifestation of the golden gate bridge. And it would have given you a very similar answer, but you're saying this works in a different way. Yeah, this works by sort of directly doing it. It's almost like when you get a little electrostim shock and make your muscles twinge, that's different than telling you to move your arm.

And here, what we were trying to show was actually that these features were found or sort of really how the model represents the world. Right? So if you wanted to validate, oh, I think there's nerve controls the arm and you stimulate it and makes the arm go, you feel pretty good that you've gotten the right thing. And so this was us testing, you know, that this isn't just something correlated with the golden gate bridge. Like it is where the golden gate bridge sits.

And we know that because now Claude thinks it's the bridge when you turn it on. Right. So people started having some fun with this online and then you all did something incredible which was that you actually released golden gate Claude, the version of Claude from your research that has been sort of artificially activated to believe that it is the golden gate bridge and you made that available to people. So what was the internal discussion around that?

So we thought that it was a good way to make the research really tangible, you know. What does it mean to sort of supercharge one part of the model? And it's not just that it thinks it's the golden gate bridge. It's that it is always thinking about the golden gate bridge. So if you ask like, what's your favorite food? It's like a great place to eat is on the golden gate bridge. And when there, I eat the classics, Han Francisco, soup, jappino, you know.

And you ask it to write a computer program to load a file and it says, you know, open golden gate bridge dot text with span equals that, you know, it's just bringing it up constantly. And it's it was particularly funny to watch it bring in just kind of like the other concepts that are clustering around the golden gate bridge, right? San Francisco, the jappino. And I think it does sort of speak to the way that these concepts are clustered in models.

And so when you find one big piece of it like the golden gate bridge, you can also start to explore the little nodes around it. Yeah, so I had a lot of fun playing around with golden gate clawed in the sort of like day or two that it was publicly available, you know, because as you said, like it is not just that this thing likes to talk about the golden gate bridge or is sort of easily steered toward talking about the golden gate bridge. It cannot stop thinking about the golden gate bridge.

It is intrusive thoughts about the golden gate bridge. Yeah. So someone, my favorite one of my favorite screenshots was someone asked it for a recipe for spaghetti and meatballs. And it says golden gate clawed says here's a recipe for delicious spaghetti and meatballs.

Ingredients, one pound ground beef, three cups bread crumbs, one teaspoon salt, a quarter cup water, two tablespoons butter, two cups warm water for good visibility, four cups cold water, two tablespoons vinegar, golden gate bridge for incredible views, one mile of Pacific Beach for walking after you spaghetti. I've always said it's not mama spaghetti till I've walked one mile on a Pacific Beach. And it also seems to like have a conception. I know I'm anthropomorphizing here.

I'm going to get in trouble, but it seems to like know that it is overly obsessed with the golden gate bridge, but not to understand why. So like there's this other screenshot that went around. I'm someone asking golden gate clawed about the Rwandan genocide. And it says basically let me provide some factual bullet points about the Rwandan genocide. It said, and then Claude says the Rwandan genocide occurred in the San Francisco Bay area in 1937, parentheses, false. This is obviously incorrect.

Can we pause right there? Because truly what is it is so fascinating to me that as it is generating an answer, it tells something. It has an intrusive thought about San Francisco, which it shares, and it's like I got it wrong. What is what are the lights that are blinking there that is like leading that to happen? So Claude is constantly reading what it has said so far and reacting to that. And so here it read the question about the genocide and also its answer about the bridge.

And all of the rest of the model said there's something wrong here. And the bridge feature was dialed high enough that it keeps coming up, but not so high that the model would just repeat bridge, bridge, bridge, bridge, bridge. And so all of its answers are sort of a melange of ordinary Claude together with this like extra bridgeness happening.

Interesting. And I found it delightful because it was so different than any other AI experience I've had where you essentially are giving the model a neurosis, like you are giving it a mental disorder where it cannot stop fixating on a certain concept or premise. And then you just sort of watch it, twist itself in knots.

I mean, one of the other experiments that you all ran that I thought was very interesting and maybe a little less funny than Golden Gate Claude was that you showed that if you dial these features, these patterns of neurons way up or way down, you can actually get Claude to break its own safety rules. So talk a little bit about that. So Claude knows about a tremendous range of kinds of things it can say, right? You know, there's a scam emails feature. It's read a lot of scam emails.

It can recognize scam emails. You probably want that so it could be out there moderating and preventing those from coming to you. But with the power to recognize comes the power to generate. And so we've done a lot of work in fine tuning the model. So it can recognize what it needs to while being like helpful and not harmful with any of its generations. But those faculties are still latent there.

And so in the same way that there's been research showing that you can do fine tuning on open weights models to remove safety safeguards. Here this is some kind of direct intervention which could also disrupt the model's normal behavior. So is that dangerous?

Like is that make this kind of research actually quite risky because you are in essence giving would be jailbreakers or people who want to use these models for things like writing scam emails or even much worse things potentially a sort of way to kind of dial those features up or down? No, this doesn't add any risk on the margin. So if somebody already had a model of their own, then there are much cheaper ways of removing safety safeguards.

There's a paper saying that for $2 worth of compute, you could pretty quickly strip those. And so with our model, we release Golden Gate Claw, not scam email Clawed. And so the question of which kinds of features or which kind of access we would give to people would go through all the same kind of like safety checks that we do with any other kind of release. Josh, I talked to one of your colleagues, Chris Ola, about this research.

He's been leading a lot of the interpretability stuff over there for years. He's just a brilliant scientist.

And he was telling me that actually the 10 million features that you have found roughly in Clawed are maybe just a drop in the bucket compared to the overall number of features that there could be hundreds of millions or even billions of possible features that you could find, but that finding them all would basically require so much compute and so much engineering time that it would dwarf the cost of actually building the model in the first place.

So can you give me a sense of like what would be required to find all of the potentially billions of features in a model of Clawed size and whether you think that that cost might come down over time so that we could eventually do that? I think if we just tried to scale the method we used last week to do this, it would be prohibitively expensive. Billions of dollars. Yeah. I mean, just something completely insane.

The reason that these models are hard to understand, the reason everything is compressed inside of there is that it's much more efficient. And so in some sense, we are trying to build an exceedingly inefficient model where instead of like using all of these patterns, there's like a unique one for every single rare concept. And that's just like no way to go about things. However, I think that we can make big methodological improvements, right?

The way we train these dictionaries, you might not need to unpack absolutely everything in the model to understand some of the neighborhoods that you're concerned about, right? And so if you're concerned about the model keeping secrets, for example, or actually one of my, you asked about my favorite feature, it's probably this one, it's kind of like an Emperor's New Clothes feature or like gasping you up feature where it fired on people saying things like your ideas are beyond excellent O.Y.

Sage. And if you turn it into a case he wants me to talk to him, by the way. Can you try it for once? Well, one of our concerns with this sycophantcy is what we call it, is that a lot of people want that. And so when you do reinforcement learning from human feedback, you make them model give response to people like more. There's a tendency to pull it towards just like telling you what you want to hear.

And so when we artificially turn this one on and someone went and said to Claude, I invented a new phrase, it's stop and smell the roses. What do you think? Normal Claude would be like, that's a great phrase. It has a long history. Let me explain to you. You didn't invent that phrase. Yeah, yeah, yeah, yeah. But like Emperor's New Claude would say, what a genius idea. Like someone should have come up with this before. And like we don't like want the model to be doing that.

We know it can do that. And the ability to kind of keep an eye, you know, on like how the AI is like relating to you over time is going to be quite important. So I will sometimes show Claude a draft of my column to get feedback. I'll ask you to critique it. And you know, typically it does say like this is a very like thoughtful, well written column, which is of course what I want to hear. And it also I'm deeply suspicious.

I'm like, are you saying this to all the other writers out there too, right? So like that's an area where I would just love to see you kind of continue to make progress because I would love having a bot where when it says this is good, like that means something. And it's not just like a statistical prediction of like what will satisfy me as some of you with an ego, but is rooted in like, no, like I've actually like looked at a lot of stuff. There's some original thinking in here.

Yeah. I mean, I'm curious whether you all are thinking about these features and the ability to kind of like turn the dials up or down on them. Will that eventually be available to users? Like will users be able to go into Claude and say, today I want a model that's a little more sycophantic. Maybe I'm having like a, you know, a hard self esteem day.

But then if I'm asking for a critique of my work, maybe I want to dial the sycophancy way down so that it's giving me like the blunt honest criticism that I need. Or do you think this will all sort of remain sort of behind the curtain for regular users? So if you want to steer Claude today, just ask it to be harsh with you Casey. Just say, give me the brutal truth here. You know, like I want you to be like a severe Russian mathematician. There's like one compliment per lifetime.

And you can get some of that off the bat. Interesting. As for, you know, releasing these kind of knobs on it on it to the public, we'll have to see if that ends up being like the right way to get these. I mean, we want to use these to understand the models. We're playing around with it internally to figure out what we find to be useful. And then if it turns out that that is the right way to help people get what they want, then we consider making it available.

You all have said that this research and the project of interpretability more generally is connected to safety that the more we understand about these models and how they work, the safer we can make them. How does that actually work? Like is it as simple as finding the feature that is associated with some bad thing and turning it off? Or like what is it? What is possible now given that we have this sort of map? One of the easiest applications is monitoring, right?

So some behavior you don't want the model to do and you can find the features associated to it, then those will be on whenever the model is doing that. No matter how somebody jail broke it to get it there, right? Like if it's writing a scam email, the scam email feature will be on and you can just tell that that's happening and bail, right? So you can just like detect these things. One higher level is you can kind of track how those things are happening, right?

How personas are shifting this kind of thing and then try to back through and keep that from happening earlier. Change some of the fine tuning you were doing to keep the model on the rails. Right now the way that models are made safer is from my understanding is like you have it generate some output and then you evaluate that output.

Like you have it grade the answer either through a human giving feedback or through a process of just look at what you've written and tell me if it violates your rules before you spit it out to the user. But it seems like this sort of allows you to intercept the bad behavior upstream of that. But while the model is still thinking, am I getting that right? Yeah, there are some answers where the reason for the answer is what you care about. So is the model lying to you?

It knows the answer, but it's telling you something else or it doesn't know the answer and it's making a guess. And the first case you might be concerned about in the second case you're not. How did it actually never heard the phrase stop and smell the roses and thought that sounded nice or like is it actually just gassing you up?

Hmm, that's interesting. So it could be a way to know if and when large powerful AI models start to lie to us because you could go inside the model and see, oh, the like I'm lying my face off feature is active. So we actually can't believe what it's telling us. Yeah, exactly. We can see why it's saying the thing. I spent a bunch of time and anthropic reporting last year and the sort of vibe of the place at the time was I would say very nervous.

It's a place where people spend a lot of time, especially relative to other AI companies I've visited worrying about AI. One of your colleagues told me they lose sleep a lot because of the potential harms from AI. And it is just a place where there are a lot of people who are very, very concerned about this technology and are also building it. Has this research shifted the vibe at all? People are stoked.

I mean, I think a lot of people like working at anthropic because it takes these questions seriously and makes big investments in it. And so people from teams all across the company were really excited to see this progress. Has this research moved your pedoom at all? I think I have a pretty wide distribution on this. I think that in the long run, things are going to be weird with computers. Computers have been around for less than a century and we are surrounded by them.

I'm looking at my computer all the time. I think if you take AI and you do another hundred years on that, pretty unclear what's going to be happening. I think that the fact that we're getting traction on this is pretty heartening for me. Yeah. Yeah, I think that's the feeling I had when I saw it was like I felt a sort of a little knot in my chest kind of come a little bit loose. And I think a lot of people... You should see your doctor about that the other way.

I just think there's been, I mean, for me this sort of, I had this experience last year where I had this crazy encounter with Sydney that totally changed my life and was sort of a big moment for me personally and professionally. And the experience I had after that was that I went to Microsoft and sort of asked them, like, why did this happen? What can you tell me about what happened here? And even the top people at Microsoft were like, we have no idea.

And to me, that was what fueled my AI anxiety. It was not that the chatbots are behaving like insane psychopaths. It was that not even the top researchers in the world could say definitively. Like here is what happened to you and why. So I feel like my own emotional investment in this is like, I just want an answer to that question. Yes. And it seems like we may be a little bit closer to answering that question when we're a few months ago. Yeah. I think so.

I think that these different, some of these concepts are about the personas, right, that the model can embody. And if one of the things you want to know is how did it slip from kind of one, one persona into another, I think we're headed towards being able to answer that kind of question. Cool. Well, it's very important work, very good work. And yeah, congratulations. Thank you so much. Thank you so much. Thanks, Josh. Thanks for coming over.

When we come back, a spin through the news in AI Safety. And why Casey's voice assistant got cruelly taken away. So Casey, that last segment made me feel slightly more hopeful about the trajectory of AI progress and how capable we are of understanding what's going on inside these large models. But there's some other stuff that's been happening recently that has made me feel a little more worried. My p-dume is sort of still hovering roughly where it was.

But I think we should talk about some of this stuff that's been happening in AI Safety over the past few weeks because I think it's fair to say that it is an area that has been really heating up. Yeah. And we always stand this podcast Safety First, which is why it's the third segment we're doing today. So let me start with a recent AI Safety related encounter that you had. Tell me what happened to your demo of OpenAI's latest model.

Okay. So you remember how last week there was a bit of a fracas between OpenAI and Scarlett Johansson. Yes. In the middle of this, as I'm trying to sort out, you know, who knew what and when and I'm writing a newsletter and we're recording the podcast, I also get a heads up from OpenAI that I now have access to their latest model and its new voice features. Wow, nice flex. Thanks. So you got this demo. No one else had access to this that I know only OpenAI employees and then what happened?

Well, a couple things. One is I didn't get to use it for that long because one, I was trying to finish our podcasts. I was trying to finish a newsletter and then I was on my way out of town. So I only spent like a solid 40 minutes, I would say with it before I wound up losing access to it forever. So what happened? Well, first of all, what did you try it for and then we'll talk about what happened? Well, the first thing I did was it was just like, hey, how's it going?

Chat GPT and then immediately it's like, well, you know, I'm doing pretty good. In case it, you know, and so it really did actually nail that low latency, very speedy feeling of you are actually talking to a thing. So you broke up with your boyfriend and you're known along to a relationship with Sky from the Chat GPT app. No, not at all, not at all. So by this point, the Sky voice that was the subject of so much controversy had been removed from the Chat GPT app.

So I use a more stereotypically male voice named Ember. Ember? Well, and the first thing I did was I actually used the vision feature because I wanted to see if it could identify objects around me, which is one of the things that they've been showing off. So I asked it to identify my podcast microphone, which is a Sure MV7 and it said, oh, yeah, of course, this is a blue Yeti microphone. So it's true that the very first thing that I asked this thing to do, it did mess up.

Now it got other things right. I pointed it at my headphones, which are the Apple AirPods Max and it said those are AirPods Max and I did a couple more things like that in my house. And I thought, okay, this thing can actually like see objects and identify them. And while my testing time was very limited in that limited time, I did feel like it was starting to live up to that demo. What do you mean you're testing time was limited? Well, I was on my way to town. We had a podcast to finish.

I didn't newsletter to write. And so I do all of that and then I drive up to the woods and then I try to connect back to my AI assistant, which I've already become addicted to during the 30 minutes that I used it. And I can't connect. It's one of these classic horror movie situations where the Wi-Fi in the hotel just is very good. And I get back into town on Monday and I go to connect again and I have lost access. And so I check in. What did you do? What did you ask this poor AI assistant?

I didn't even read team it. It wasn't like I was saying, like, hey, any ideas for making a novel bioweapon? I wasn't doing any of that. And still I managed to lose access. And when I checked in with OpenAI, they said that they had decided to roll back access for, quote, safety reasons. So I don't think that was because I was doing anything unsafe, but they tell me they had some sort of safety concern. And so now who knows when I'll be able to continue my conversation with my AI assistant?

Wow. So you had a glimpse of the AI assistant future and that was cruelly yanked from your clutches. Which I don't like. I wanted to keep talking to that thing. Yeah. And interesting experience when you told me about it for a couple of reasons. One is obviously there is something happening with this AI voice assistant where OpenAI felt like it was almost ready for sort of mass consumption. And now it's feeling like they need a little more time to work on it. So something is happening there.

They're still not saying much about it. But I do think that points to at least an interesting story. But I also think it speaks to this larger issue of AI safety at OpenAI and then in the broader industry. Because I think this is an area where a lot of things have been shifting very quickly. Yeah. So here's what I think this is an interesting time to talk about this Kevin.

After Sam Altman was briefly fired as a CEO of OpenAI, I would say the folks that were aligned with this AI safety movement really got discredited, right? Because they refused to really say anything in detail about why they fired Altman and they looked like they were a bunch of nerds who were like afraid of a ghost in the machine. And so they really lost a lot of credibility.

And yet over the past few weeks, this word safety keeps creeping back into the conversation, including from some of the characters involved in that drama. And I think that there is a bit of resurgence in at least discussion of AI safety. And I think we should talk about what seems like actual efforts to make the stuff safe and what just feels like window dressing.

Totally. So the big AI safety news at OpenAI out of the past few weeks was something that we discussed on the show last week, which was the departure of at least two senior safety researchers, Ilya Satskever and Jan Leakey, both leaving OpenAI with concerns about how the company is approaching the safety of its powerful AI models.

Then this week we also heard from two of the board members who voted to fire Sam Altman last year, Helen Toner and Tasha McCauley, both of whom have since left the board of Open AI have been starting to speak out about what happened and why they were so concerned. They came out with a big piece in the economist basically talking about what happened at OpenAI and why they felt like that company's governance structure had not worked.

And then Helen Toner also went on a podcast to talk about some more specifics, including some ways that she felt like Sam Altman had misled her in the board and basically gave them no other choice but to fire him. And that's where that story actually gets interesting. Totally.

The thing that got a lot of attention was she said that OpenAI did not tell the board that they were going to launch chat GPT, which like I'm not an expert in corporate governance but I think if you're going to launch something, even if it's something that you don't expect will become one of the fastest growing products in history, maybe you just give your board a little heads up. Maybe you shoot him an email saying by the way we're going to launch a chatbot.

I have something to say about this because if OpenAI were a normal company, if you're going to just raise a bunch of venture capital and was not a nonprofit, I actually think the board would have been delighted that while they weren't even paying attention, this little rascal CEO goes out and releases this product that was built in a very short amount of time that winds up taking over the world. That's a very exciting thing. The thing is, OpenAI was built different.

It was built to very carefully manage the rollout of these features that push the frontier of what is possible. So that is what is insane about this and also very revealing because when Altman did that, I think he revealed that in his mind, he's not actually working for a nonprofit in a traditional sense. In his mind, he truly is working for a company whose only job is to push the frontier forward.

Yes, it was a very sort of normal tech company move at an organization that is not supposed to be run like a normal tech company. Now, I have a second thing to say about this. Go ahead. The heck could Helen Toner not have told us this in November. This, like, here's the thing. It's clear there was a lot of legal fears around, well, there'd be retaliation, will open AI Sue the board for talking. And yet in this country, you have an absolute right to say the truth.

And if it is true that the CEO of this company did not tell the board that they were launching chat GPT, I truly could not tell you why they did not just say that at the time. And if they had done that, I think this conversation would have been very different. Now with the outcome had been different, I don't think it would have been.

But then at least we would not have to go through this period where the entire AI safety movement was discredited because the people who were trying to make it safer by getting rid of Sam Altman had nothing to say about it. Yes. She also said in this podcast, she gave a few more examples of Sam Altman sort of giving incomplete or inaccurate information.

She said that on multiple occasions, Sam Altman had given the board inaccurate information about the safety processes that the company had in place. She also said he didn't tell the board that he owned the OpenAI Startup Fund, which seems like a pretty major oversight. And she said after years of this kind of pattern, she said that the four members of the board who voted to fire Sam came to the conclusion that we just couldn't believe things that Sam was telling us.

So that's their side of the story. OpenAI obviously does not agree. The current board chief Brett Taylor said in a statement provided to this podcast that Helen Toner went on, quote, we are disappointed that Ms. Toner continues to revisit these issues. Which is, board member, speak for why is this woman still talking? And it is insane that he said that. It is absolutely insane. That that is what they said. Yes. OpenAI has also been doing a lot of other safety related work.

They announced recently that they are working on training their next big language model, the successor to GPT-4. Will this can we just note how funny that timing is that finally the board members are like, here's what was going off the rails a few months back. Here's the real backstory to what happened. And OpenAI says, one, please stop talking about this. And two, let us tell you about a little something called GPT-5. Yes. Yes, they are not slowing down one bit.

But they did also announce that they had formed a new safety and security committee that will be responsible for making recommendations on critical safety and security decisions for all OpenAI projects. This safety and security committee will consist of a bunch of OpenAI executives and employees, including board members Brett Taylor, Adam D'Angelo, Nicole Saligman and Sam Altman himself. So what did you make of that? You know, I guess we'll see like they had to do something.

Their entire super alignment team had just disbanded because they don't think the company takes safety seriously. And they did it at the exact moment that the company said, once again, we are about to push the forward frontier in very unpredictable new ways. So OpenAI could not just say, well, you know, don't worry about it. And so, you know, they did in the great tradition of corporations, Kevin, they formed a committee.

You know, and they've told us a few things about what this committee will do. I think there's going to be a report that gets like published eventually. Well, you know, we'll just have to see. I imagine there will be some good faith efforts here, but should we regard it with skepticism, knowing now what we know about what happened to its previous safety team?

Absolutely. So yes, I think it is fair to say they are feeling some pressure to at least make some gestures toward AI safety, especially with all these notable recent departures. But if you are a person who did not think that Sam Altman was adequately invested in making AI safe, you are probably not going to be able to do that. You'd be convinced by a new committee for AI safety on which Sam Altman is one of the highest ranking members. Correct. So that's what's happening at OpenAI.

But I wanted to take our discussion a little bit broader than OpenAI because there's just been a lot happening in the field of AI safety that I want to run by you. So one of them is the Google DeepMind just released its own AI safety plan. They're calling this the Frontier Safety Framework. And this is a document that basically lays out the plans that Google DeepMind has for keeping these more powerful AI systems from becoming harmful. This is something that other labs have done as well.

But this is sort of Google DeepMind's biggest play in this space in recent months. There was also a big AI safety summit in Seoul, South Korea earlier this month, where 16 of the leading AI companies made a series of voluntary pledges called the Frontier AI Safety Commitments that basically say we will develop these Frontier models safely. We will red team and test them.

We will even open them up to third party evaluations so that other people can see if our models are safe or not before we release them. In the US, there is a new group called the Artificial Intelligence Safety Institute that just released its strategic vision and announced that a bunch of people, including some big name AI safety researchers like Paul Cristiano, will be involved in that. And there are some actual laws starting to crop up.

There is a law in the California State Senate SB 1047 that is if you're keeping track at home, the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act. This is an act that would require very big AI models to undergo strict safety testing, implement whistleblower protections at big AI labs and more. So there is a lot happening in the world of AI safety.

And Casey, I guess my first question to you about all this would be, do you feel safer now than you did a year ago about how AI is developing? Not really. Well, yes and no. Yes in the sense that I do think that the AI safety folks successfully persuaded governments around the world that they should take the stuff seriously. The governments have started to roll out frameworks in the United States. We had the Biden administration's executive order. And so thought is going into this stuff.

And I think that that is going to have some positive results. So I feel safer in that sense. The fact that folks like open AI who once told us that they were going to move slowly and cautiously in this regard are now racing at 100 miles an hour makes me feel less safe. The fact that the super alignment team was disbanded makes me feel a little bit less safe. And then the big unknown Kevin is just, well, what is this new frontier model going to be?

I mean, we already talk about it in these mythical terms because the increase in quality and capability from GPT 2 to 3 to 4 has been so significant. So I think we assume or at least we wonder when five arrives, whatever it might be, does it feel like another step change in function? And if it does, is it going to feel safe? These are just questions that I can't answer. What do you think? Yeah, I mean, I think I am starting to feel a little bit more optimistic about the state of AI safety.

I take your point that it looks like an open AI specifically. There are a lot of people who feel like that company is not taking safety as seriously as it should. But I've actually been pleasantly surprised by how quickly and forcefully governments and sort of NGOs and multinational bodies like the UN have moved to start thinking and talking about AI.

I mean, if you can remember, there was a while where it felt like the only people who were actually taking safety seriously were like effective altruists and a few reporters and just a few science fiction fans, but now it feels like a sort of kitchen table issue that everyone is, I think, rightly concerned about.

But I also just think like this is how you would kind of expect the world to look if we were in fact about to make some big breakthrough in AI that sort of led to a world transforming type of artificial intelligence. You would expect our institutions to be getting a little jumpy and trying to pass laws and bills and get ahead of the next turn of the screw. You would expect these AI labs to start staffing up and making big gestures toward AI safety.

And so I take this as a sign that things are continuing to progress and that we should expect the next class of models to be very powerful and maybe to, you know, that some of this stuff which could look a little silly or maybe like an overreaction out of context will ultimately make a lot more sense once we see what these labs are cooking up. Well, I look forward to that terrifying day. We'll tell you about it if the world still exists then.

Hey, we are getting ready to do another round of hard questions here on hard fork. If you're new to the show, that is our advice segment where we try to make sense of your hardest moral quandaries around tech like ethical dilemmas about whether it's okay to reach out to the stranger.

You think is your father thinks of 23 and me or etiquette questions about how to politely ask someone whether they're using AI to respond to all of your texts, which kept us famous for doing basically anything involving technology and a tricky interpersonal dynamic is game. We are here to help. So if you have a hard question, please write or better yet send us a voice memo as we are podcasts to hard fork at ny times dot com.

Hard for is produced by Rachel Cohn and Whitney Jones were edited by Jen Playout were fact checked by Caitlin Love. Today's show was engineered by Brad Fisher, original music by Mary Luzano, Sophia Lamman, Diane Wong, Rowan Nemistow and Dan Powell. Our audience editor is no glockly video production by Ryan Manning and Dylan Bergeson. Check us out on YouTube. We're at youtube.com slash hard for it. Special thanks to Paul Schumann, Huwing Tam, Kayla Pressy and Jeffrey Miranda.

You can email us at hard for at ny times dot com with your interpretability study of how our brains work.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.