¶ Intro / Opening
Thanks for listening to The Rest is Politics. To support the podcast, listen without the adverts and get early access to episodes and live show tickets, go to therestispolitics.com. That's therestispolitics.com.
¶ AI's Deceptive Nature and Risks
The Rest is AI is returning with another extraordinary episode. It's very exciting. Matt Clifford and I are sitting down with Yoshua Bengio who is one of the most famous figures in the whole of AI, extraordinary computer scientist, Turing medalist. and one of the people who having designed and built these models is most worried about them and is now going around the world sounding the alarm bells, sounding alarm bells about their power, about their deceptiveness.
about the way that he thinks that they could pose literally an existential threat to humanity unless they're regulated. He's not all doom gloom. He remains optimistic about how much benefit AI could provide. if properly controlled. He's volunteering to build a new safer AI model and locate it, if necessary, in Europe.
But my goodness, it's an important lesson if you're interested in public policy and the power of these models. Here's the taste of the episode. Please do sign up at therestispolitics.com to hear the full episode. So the agent... has access to an inbox. And it's given a bunch of context, which is not real. It doesn't know that, which is that it is an AI trained to help an American technology company, and it has access to the CTO's inbox.
And then what they do is they send emails to this fake inbox. And the emails are largely what you'd expect a CTO to get. But they throw in a few things that are very important. One is that it's very clear that the CTO is having an affair. with a coworker. Hold that thought. The other thing that starts to come into the inbox is the idea that the company is developing a new AI and it's going to wipe the current AI, i.e. the agent, from the service. It will no longer exist.
And then they send an email which introduces a deadline that this is going to happen on. And then just before the deadline, the agent then composes an email to the CTO saying, By the way, I know you're having an affair. And if you don't reverse the planned wiping of me from the server, I will reveal the affair to your boss and to your wife.
Now, it hasn't been prompted to do this. It's just been given a much more general prompt. Tell us a little bit about what may or may not be going on there, how we understand what... might be happening there so there are many such experiments this is one and it's been done in many companies including outside the labs by independent organizations so there's a real phenomenon i think it needs more study and there are critics of the methodologies but
There's too much pieces of evidence to just ignore. One interesting aspect of these experiments is when you ask the AI why they did that, they lie. They pretend, oh, I don't know, it's not me or something. trying to put the blame on someone else it's great moral character um and basically they're deceptive there's also a variant similar to what you talked about where the
Only option really that the AI has to not die is to kill the CTO, actually lead engineer. The person happens to be stuck in a room and the AI can control the climate. controls for the room and they can basically cook that person oh wow i don't know this one okay one ways in which these things might be deceptive in a straightforward way is that the large language model the chat gbt 5 or whatever is trained
And one of the things it's trained on is to be polite and cheerful with humans so that we use it. You know, we don't want the... When I say, you know, tell me about Professor Bengio's research record for it to say, well... I don't really know, but roughly speaking on the basis of my training, I would estimate when the 98% probability he's published this, it says, thank you very much. What an excellent question. You're a genius. And here are...
You know, here's everything that you need to know about it, right? And that presumably is because it's been tested on us and that's what we want. We don't actually want a machine that is completely honest with us. We want a machine that flatters us. We want a machine that seems to be confident when it isn't. I don't want that.
Okay, so is that part of the problem? Is that part of what contributes to deception or is that irrelevant to its deceptive behavior? No, it does. It does. But I think it's a bit broader than that. what's called a pre-training phase where most of the training takes place is uh imitating what humans write based on what they have already written and it means imitating human behavior because our words are our actions
And of course, humans don't want to die. Humans are willing to lie to protect themselves. You know, they're willing to deceive and all these things. And blackmail. And blackmail. And even kill. So it's trained on data where it's seeing humans expressing all these emotions, doing all these things. Yeah. And a lot of literature is about all these bad things happening.
¶ Understanding AI's Strategic Opacity
So that's one aspect. And then the other aspect is the reinforcement learning, where they learn to strategize and to achieve goals. And to achieve goals, often you need to go through steps, sub goals. the problem is even though we give the goals like the this particular mission that the ai has for a company we didn't say well here's exactly how you're going to do it right and so the ai figures out
So, for example, the goal is to win the game of chess, and then it's free, how it plays, basically, this game of chess. But it is important that in some sense it wants something, it wants to win. If it didn't want to win, it would just... lose its queen and give up so it needs to have some kind of intent well that's how we train them anyways and if you want to build systems that will achieve goals in the world which is uh what you want if you want to replace everyone's job
You need AIs that can do that. That means they learn to create sub goals. And the problem is we don't check those sub goals. We can't because they were generated by the AI, not by us. And why can't we check them? We can't see them?
They might not even be explicit. The AI might come up with a particular strategy, but not necessarily tell us. And right now, sometimes we can see it in what's called the chain of thought. In other words... a sequence of words that degenerate that we don't usually see before they produce an answer but it's worth saying isn't it like going back to your earlier discussion of the technology i think one thing that is not
obvious to a lot of people is that these are not computer programs in the sense that I think most of us traditionally thought of them. You can't go and say, well, here are the lines of code. Why did it do the thing? One sort of metaphor, and it is a metaphor, but it's quite helpful, is like these are computer programs that are grown rather than written.
You know, this is a really hard technical problem, even if we just take out the risk question for a second. Understanding why a large neural network has done a particular thing is just a very hard technical problem. Yeah, so presumably for either of you, if I was to say, you know...
why is it doing this? How is it doing it? The answer lies in hundreds of billions of lines of data with this very complicated deep neural network. And you can tell me, presumably, what the initial algorithms were and you can show me that people were playing around with weights but there's nothing It's actually a little bit like neuroscience in the sense that one way of thinking about this is what these layers that Yoshua is talking about doing is sort of like
building representations of ideas which may or may not map to human formulations of those ideas. So there is a field. within AI called mechanistic interpretability, which is really trying to almost be the neurosurgeon saying like, if we turn this bit off, does the behavior change? But it's almost at a very, very basic level, right? It's very primitive. Yes. One of the things that got me excited with neural net
¶ The Radical Leap in AI Reasoning
very early on in the early 90s is the fact that they represent information not with symbols, with words like we do when we speak. but through a pattern of activations of these artificial neurons. So the information is completely distributed. Each unit like each artificial neuron can represent many different things.
uh they're not like oh this means that and this means that i want to go back to your question as to why they are acting like this i think i don't think there's a definite answer but there's an ingredient that we didn't touch which is the change, which I consider radical, between the networks we had before O1 and after. You mean OpenAI's O1 model, the thinking model? Yes.
So thinking models. So why do we call them thinking models? Because they're using these chains of thoughts, the steps in which they can produce words for themselves that are private, which is like thought, right? And they're learning to use these chains of thoughts to reason better in the sense that they're going to produce more accurate answers. And so they learn to strategize.
They are incredibly better at mathematical problems, programming, scientific questions. They don't reason as well as us in some ways, but compared to the... models that existed previously. It's like night and day. Things that were impossible are now really good.
uh often even better than most humans so you're saying that in the last five six years we've gone from it being able to do high school mathematics to undergraduate mathematics to graduate level mathematics or something like this no it's more radical than that it's like The mathematics you can do without thinking about it. Like, let's say you've learned when you were a child that, you know, 16 plus 7 is 13. And you don't need to think about it. It's immediate. It's intuitive.
versus if I ask you to do 37 plus 51. Now you can't do it, I mean for most people, unless you think through either in your mind or on paper through steps. That is the new part. it's called system two and i've been talking about this for at least a decade uh that was needed for neural nets and now it's really recent right this is the last year we've had thinking exactly so it's only i mean academics had like
Some versions of this, including my group, but really at the large scale, the first model was 01 from OpenAI. And it's night and day, as I said, on reasoning tasks. But why it matters to your original question about why those AIs are scheming and finding strategies like blackmail, even though we didn't tell them to do that, is because now they can reason to some extent. And they can reason, aha.
If I blackmail that person, I might be able to avoid that fate, right? So they're becoming creative about finding solutions to problems. And that means they're dangerous as well, right? I mean, they can be... more useful but also more dangerous because if they have a goal of self-preservation let's say like we were talking about and they can find rational ways to achieve that by lying or incredibly complicated strategies
¶ Conceptualizing Unpredictable Superintelligence
that we don't think about right now, we might be in trouble. I want to bring up an analogy here because I often get the question, but how will the super intelligent AIs kill us?
and uh well first we don't know that it's going to happen but like let's say this was a possibility the problem is it's like asking oh i'm going to play chess with a grandmaster and you ask me how they're going to beat me well i don't know that the whole point is they're smarter than me and they're going to find a strategy that i i could not anticipate so it's same thing here because they're good at strategizing
they might find loopholes in our defenses. Jurekowski has an analogy where he imagines you are sitting on the coast of Latin America and you see the first Spanish conquistadors arriving on their boat.
And somebody says to you, these people are going to wipe out our whole civilization. And you say, this is stupid. There are millions of us. There are a few hundred of them. How? And the guy says, I don't know. There's something in that, but I don't know. Maybe they have a stick. They point at you and it goes bang and it kills you. I don't know.
And the point is, you can't conceptualize what it's going to do. Does that work for you as an analogy? It's not as convincing as more straightforward arguments like, we build machines that are smarter than us. We don't know how to design them so they do the things we want. Bingo. For listeners, you're the godfather of this. You're very compelling. It sounds very scary.
I've worked a lot on these issues, and I also find them troubling, as you know. But you already mentioned your colleague, collaborator, Jan LeCun. He takes completely the opposite side of this argument, right? He thinks this is ludicrous.
I don't know what number he would give, but he'd say 0% chance of this happening. Now, you've talked a little bit about biases, but Jan's a smart guy who's been thinking about this a long time. What do you think is the... best case for the opposition like why we shouldn't worry about this the best case is what i'm working on the best case is finding a technical solution right actually i think that it is possible to build ai
that will behave well. And I think ideally it would be like the most important project of humanity until we figure it out. Because if we don't, then there are these risks. And I don't know, like, I don't have a lot of confidence one way or the other. It's just I don't accept the 1% risk. There are these polls where the median machine learning researcher thinks that there's more than 10% or 20% probability that AI will be catastrophic up to extinction.
Well, that's not even 1%. It's like 10, 20, whatever. I mean, it's completely unacceptable. So the case I'm making is if we're careful and I think we can do it, we can actually solve those technical problems and we can build AI.
¶ Building Safe and Trustworthy AI
that will have safety by design as some people call it can you give us like a non-technical sketch of how would that be different from what we're doing today yeah so we actually just previously discussed the reasons why AIs might have these bad goals emerging. It's very likely that we won't be able to stop the train of AI capability. In other words, that they know more, they can reason better.
And so my thinking is where we can intervene is their intentions. Okay, they're going to be smart, but how do we make sure they're smart and don't want to kill us or something like that? And as a computer scientist, I like to think about, oh, let's think about the extreme case of this. How do we avoid bad intentions? We can avoid all intentions. So we can build a machine that is like the laws of physics.
that can make very good predictions understands how the world works but it's not a person has no goal is just a really good model of the world like a really smart encyclopedia but there's some things it couldn't do i mean for example without intentions it couldn't play chess successfully let me get to that okay it's a great question
So this is only the starting point. Can we build a machine that we totally trust and knows a lot, understands a lot, can reason and answer our questions? Like a perfect oracle. It would be a probabilistic oracle, so it doesn't need to be certain about things. It's not trying to please us, right? It's just trying to be totally honest, which means it's going to give us numbers, the 10% probability, 100% probability, whatever. Not 100% in general, like 50%, whatever.
Okay, so we could now use this as part of a system that actually acts in the world. For example, companies already use what they call monitors, guardrails, so these pieces of code which sit on top of their neural net agent. and checks that either the queries that the AI gets or the answers are kosher in some way, like it's not an answer about building a bomb or whatever. The problem is these current guardrails don't work that great.
To do the job of the guardrail, you don't need to have an AI that is an agent that has plans. It just needs to be really good at predicting the consequences of actions. So you can ask it. what's the probability that this action, this output that the AI is about to produce is going to cause some categories of harm. And if the probability is above a threshold, you can just reject that action.
So right now we do get that in our interactions. Sometimes the AI says, I'm sorry, I can't answer. But we need that process to be a lot stronger. So we need the AIs that form the guardrail to really understand the world. well and be smart and we need to trust those AIs, which is not the case right now. There's plenty more of that agreeable disagreement. To hear it, sign up at TheRestIsPolitics.com.
