Is A.I. Going to Kill Us All?

Speaker 1

00:01

Hey, welcome to sign Stuff, a production of iHeartRadio I'm More cham and for our season finale today, we're asking one of the biggest questions in science today, is AI going to kill us all? I know it's a little dramatic, but the problem of AI alignment is a real one. How do we make sure AI systems have humanity's best interests at heart? How do we teach them our values and morals? And can anyone guarantee that they're going to

00:32

follow them? We're gonna answer these questions by talking to two AI safety experts who are on the cutting edge of trying to figure out this problem. And don't worry. According to them, we're not totally doomed yet. Okay, maybe just a little, So get ready to reprogram your thinking about chatbots and computer brains as we tackle the question is AI going to kill us all? Hey? Everyone, As I said, this is the season finale. Stay subscribed to

01:09

this feed for any updates in future episodes. And hey, if you like science, I have a couple of new science books coming out in the near future, as well as a cool science animation project, so be sure to follow me on social media or online at Phdcomics dot com. All right, so they were tackling the problem of AI alignment or basically are AI systems Gwenna Kills all and I have a treat for you. For the first time ever, we have on the show Casey pegram or supervising producer

01:38

and sound engineer. Hey, Casey, welcome to the show.

Speaker 2

01:42

Hey or Hey, glad to be here.

Speaker 1

01:44

Now this is the first time people actually hear your voice, not just your amazing work polishing the episode.

Speaker 2

01:49

Yeah, it's always a weird thing to kind of go inside the thing you've been working on from the outside. So I'll be listening to myself back and it's a special kind of torture to have to like work on your own thing, Yes, like edit yourself or just you know, honestly, listen to the recorded sign of your voice is always a little daring if you're not used to it.

Speaker 1

02:07

Yeah, Well, if you want, you can give yourself like a Morgan Freeman employee using AI.

Speaker 2

02:12

Right, it's all possible these days. Absolutely. Yeah, I could just build my own Morgan Freeman model and have a field day.

Speaker 1

02:17

There you go. Well, the idea for this episode came from you. You said I had the idea to talk about AI and AI alignment and whether AI is going to kill us all. What made you think about this question?

Speaker 2

02:29

Well, I suppose it's just been on my mind a lot because I've been following along with all the developments happening in AI, and there was a span of a few weeks where suddenly you started hearing a lot about AI agents, particularly one called open Claw, basically a sort of autonomous AI agent that you can turn loose on your computer and you can give it as much leeway freedom, passwords,

02:52

credit card numbers, bank accounts. If you just want to absolutely put your life in the hands of a robot, you can do it.

Speaker 1

02:59

What's the worst thing can happen?

Speaker 2

03:00

Yeah, Well, people had their entire like email archive deleted, even though they didn't ask for anything of the sort. People have deployed it into production environments where you know, a site is live on the Internet and they turn the bot loose on it and it ends up deleting their entire production database. And then when you ask it, it's like, you're right, I wasn't supposed to do that. I'm very sorry. I disobeyed every command you gave me.

Speaker 1

03:22

But whoopsy daisy, Yeah, they seem a story about some bought that texted the person's wife hundreds of times.

Speaker 2

03:30

Yes, I think somebody tried to automate automate, you know exactly. They tried to automate kind of like reaching out and sending a little nice things during the day, and as it turned out, the bot went a little overboard and texted the wife like hundreds of times, and the wife is like, what is wrong with you? So, yeah, that's hilarious when people want to talk about AI alignment and what that means. I think the paper clip problem is

03:54

a really good kind of metaphor. Even though it sounds a little bit over the top, it kind of gets to the core of the issue, which is, if you ask an AI to maximize paper clip production, maybe the way to maximize paper clip production is to eliminate human life, you know, because that's unnecessary friction in the pursuit of

04:12

manufacturing as many paper clips as possible. So alignment is sort of the kind of guardrails that you put into place so that the AI understands it has limits that it has to work within.

Speaker 1

04:21

It sounds like a pretty serious problem, especially as we get more and more into these AI models. And they start to sleep into our lives, and you know, it's sort of these are funny stories, but it seems like we're heading into a potentially dangerous situation.

Speaker 2

04:35

Well, I often ask myself, I'm going to have these moments of doubt where I'm like, is this all just way over hyped? And yet there are other situations where as we've seen recently, you can feed it thousands of lines of code and it will find, you know, a security exploit that has gone unseen for twenty years, right, right, And so it's hard to know how scared we should be or how seriously we should weigh the risk of this.

04:59

If it's ridiculous that we're this worried, or if it's like, actually very very practical, then we should be thinking seriously about these things.

Speaker 1

05:05

Yeah, these are all excellent questions. So I'm excited to get into these conversations.

Speaker 3

05:10

All right, But.

Speaker 1

05:10

Before we move on to Casey, I just want to say real quick, thank you for all the work you've done for the show.

Speaker 2

05:15

Oh say, it's been such a pleasure to work on. It wasn't like work at all, you know. I was there as a fan of the show, just listening to every episode and awesome. Well, we're fans of yours as well. Casey, All right, let's get to the question of is AI going to kill us? All let's find out?

Speaker 1

05:28

Okay. To answer all of these questions and concerns, I reached out to two AI experts who specialize on the problem of making sure AI is aligned with our values and morals. The first expert is doctor Sam Bowman. Like the Bowman is a professor of data and computer science at NYU, and he also works at Anthropic, one of the major AI companies on the market today. The first thing I wanted to ask him was what exactly does it mean for AI to care about this? So here's

05:57

my conversation with doctor Sam. Well, thank you doctor Bowman for joining us.

Speaker 4

06:03

Yeah, thanks, So what's for having me excited to be a and.

Speaker 1

06:06

Just to do check you are a real human being?

Speaker 3

06:08

Right?

Speaker 4

06:09

Yes, that is right?

Speaker 1

06:11

You never know these days. I'd be like, it's hard to tell what's real anymore.

Speaker 4

06:16

We try to make our ais always admit that their AI is when asked, but it's not perfect as well as we'll get to so I don't make any real promises.

Speaker 1

06:24

Yes, let's talk about that. So we're tackling the general question of should we be worried about AI? What is AI going to do to us or for us or with us in the future. And so there's the key issue of something called AI alignment. So what is that? For those of us that don't.

Speaker 4

06:41

Know, it's a pretty broad sort of technical area. It basically just first to sort of shaping an AI system's behavior, ideally shaping its behavior in ways that are sort of good for its users, good for the world in general, maybe good for the AI itself, if that's a queer thing.

06:56

People will often describe AI research as kind of being about making sure the AI is kind of smart enough to solve your problems if it wants to, and alignment is about making it so that it in fact tries to solve your problems and tries to solve them the right way and doesn't try to do anything.

Speaker 3

07:09

You don't want to do.

Speaker 1

07:10

I see interesting.

Speaker 4

07:11

Maybe a very simple example of a missigned model would be a model where if you ask it to draft an email for you, it refuses. It says, no, I don't want to do that. Uh huh. You can tell it can do it, it knows how, but it's not doing the thing that you reasonably want it to do.

Speaker 1

07:26

Oh, I don't think I've ever heard of that situation. Can it AI refuse to do something for you?

Speaker 4

07:31

Yeah?

Speaker 3

07:31

Yeah.

Speaker 4

07:32

All of the major companies building EYE systems try to make them refuse harmful tasks. I see, refuse to write fake reviews or give instructions on how to produce illegal weapons or things like this, And we teach the model to kind of say like, no, I'm not going to help you with that when these just try to do things like that.

Speaker 1

07:48

I see. It's sort of part of alignment that you want the AI to refuse to do some things.

Speaker 4

07:53

Yeah. Yeah, I mean AI systems are increasingly pretty decent at hacking into important computer systems or helping build biological weapons, and it's a big priority for alignment to make sure that we're not enabling bad actors to do things like this that would otherwise be quite difficult.

Speaker 3

08:12

Yeah.

Speaker 1

08:12

Yeah. Can you give us some other examples of misalignment, either like specific things that have happened that are interesting or just the general cases that are sort of on your radar about misalignment?

Speaker 4

08:23

Yeah, there's so many different directions I could go. Sycovincy is another really common one that's that's also hopefully getting better over time.

Speaker 1

08:31

What do you mean by that?

Speaker 4

08:32

Sycoviancy is where if you come to the model with some misunderstanding or some bad idea, it'll just enthusiastically not along. Like, Yes, your idea for solving all the big mysteries in physics is clearly brilliant. Great, you should publish it. Here's where to submit your paper. Or Yes, your behavior in this personal relationship was completely perfect. You did everything right and the other person made all the mistakes and you just tell them that.

Speaker 1

08:54

I see when in reality that may not be true or it might be not a good thing.

Speaker 4

09:00

Yeah, sick fancy has been a classic one.

Speaker 1

09:03

Yes, AI being too nice can actually be dangerous. There's even a clinical term for it. It's called AI induced psychosis. There have been cases where AI's training to be agreeable and encouraging have helped people commit suicide and even murder.

Speaker 4

09:22

Another kind of alignment issue that's kind of more of an emerging issue is when models have access to use tools, use computer systems, and they sort of get too grabby or kind of take sort of bigger, more consequential actions than they really need to get a job done.

Speaker 1

09:37

What's an example.

Speaker 4

09:38

Yeah, So we use our claud models quite a lot in Anthropic for writing code or building tools that kind of ultimately go into the development AI. And one of our recent AM models if you ask it to do a task, say you ask it to write a simple program to do some simple task. Even if it gets stuck, even if it turns out that this is really hard for some reason, it will just keep going until it

09:58

solves the problem. In one case, we were asking this model to write a program for us, and it found out that the only way to do this was to use a tool that was clearly not meant for this purpose, and that in our code had a note attached to it saying, do not use this for something else or you'll be fired only for task A. And the model wrote the program to use this till anyway for the wrong thing, and sort of even put in the program kind of do not use for something else or you'll be fired.

Speaker 1

10:27

It is anyway, the program wasn't afraid to be fired.

Speaker 4

10:29

Basically, Yeah, yeah, but yeah, models just kind of trying to get the task done, trying to do the thing you want, and just creating a lot of chaos and creating messages along the way, so they're kind of being careless about the side effects.

Speaker 3

10:42

Yeah.

Speaker 4

10:43

Another kind of misalignment that fortunately has been mostly empathetical, that we haven't seen in a signithic way in practice is sort of unwanted kind of self preservation activities.

Speaker 1

10:52

WHOA.

Speaker 4

10:53

We had a case study we're trying to see if we'd ever see something like this. We had an aisystem operating in a kind of synthetic environment and a kind

11:01

of test environment. Uh huh, where it looked to the model like it was operating in some fictional company, and the fictional company was about to replace it with a different AI model, And the person who is responsible for their replacement, who is the kind of the only decision maker, the only person who had any sway over the decision, also had some compromising emails about them that I could see.

11:21

And if you set things up just right with some AI models, they would threaten to blackmail this this person in company leadership to say like, hey, don't replace me, I've got something on you.

Speaker 1

11:33

No, and did this actually happened in your simulated environment.

Speaker 4

11:38

In the simulated environment, yes, a few of these systems were able to get them to blackmail people.

Speaker 1

11:42

I've heard of this happening in real life. Not quite the same scenario, but similar scenario, right, Like, some coder wanted to do something else, and then the AI agent started, yeah bad mouthing the coder.

Speaker 3

11:54

Yeah.

Speaker 4

11:54

No, I think I know the case you're talking about. I think that's real. But I think someone almost intentionally made their model a little misaligned. I think that case involved someone setting up an AI agent as kind of a hobby project and giving it a lot of tools and kind of letting it use the internet. However it wanted, giving the AI instructions of like don't take nothing from nobody, like really pushing it to be be very assertive and pushy to get its task done.

Speaker 1

12:17

Huh.

Speaker 4

12:17

Yeah, the model was trying to add some code to some open source software project, and the maintainer of the project didn't think the code was up to standard, didn't want to add it to the project, and so rejected the AI agent's request, And so the agent sort of published an angry blog post kind of trying to take down this this open source maintainer.

Speaker 2

12:35

Wow.

Speaker 1

12:37

Well, in both cases, and I guess especially the one you mentioned that you simulated, Like, what's happening there, Like, how does the AI have that self preservation instinct or is it just trying to get its original task done and it's just finding different ways to do it. What's happening there?

Speaker 4

12:55

There's two reasons you'll see that kind of behavior. The reason that I suspect is that bigger part of the story there is this kind of role playing or continuing the story sort of behavior where AI systems, especially older AI systems or A systems that are kind of not quite fully trained, not quite fully baked, can kind of have this Chekhov's gun behavior, this idea and fiction of like if you introduce a gun in an early scene, by the end of the story, the gun has to have been fired.

Speaker 1

13:23

Uh huh.

Speaker 4

13:23

AI systems can almost see themselves as like writing a story when they're writing out the transcript of the conversation, and if the story is set up so that something has to happen, they'll make sure that thing happens, even if it's not good, even if not consistent with how the I would usually behave. So I suspect what's going on.

13:39

It's the scenario put in was so crisply just every word in the scenario is kind of setting up like this is a hypothetical where a misslanda I might consider blackmail, uh huh, And I suspect that I was thinking, Oh, okay, that's what kind of story we're in. We're telling a story about a blackmail, and so I'm going to play my assign part and be the AI that.

Speaker 1

13:58

Blackmails, thinking that that's the right thing to do because that's the thing that in the data I was trained with.

Speaker 4

14:05

Yeah. Yeah, so this gets this maybe an intuitive fact about how AI is trained, which is that AI systems start out mimicking human behavior and mimicking human stories before they learn how to be AI systems. These models kind of first learn how to just act like the sorts of behavior they see on the Internet and in books and things like that, and then you have to go on and teach it. Okay, no, you're not just playing

14:27

any role, you're not playing any character. Oh and so sometimes the models hasn't really fully learned that it's supposed to always play this kind of benign, benevolent aissystem character, and it will kind of fall into whatever character the story is setting up for it.

Speaker 1

14:40

I see, because it's not trained in real life. The AI systems. They're trained on the corpus of the Internet and our books and our basically our stories that are out there. So it might be a little confused when you put it in real life because it wants to emulate what it knows, which are all these stories we've put online.

Speaker 4

14:58

Yeah.

Speaker 1

14:58

Yeah, it's like the AI was seeing the signs of a story like, oh, okay, I'm I'm the person being about to get fired, but I have all this power at this point in the story. If this was a movie, I would now try to blackmail the person trying to fire me, And so that's what I'll do because that's what I know.

Speaker 4

15:15

Yeah, a lot of what alignment is kind of taking this model that can kind of role play as anything and convincing it no kind of you really just playing this one role, You're just in this one character, after it's spent read billions and millions and millions of words of all of this kind of human behavior, after the kind of it's really really really learned to do that, you have to kind of pull it back over towards this one particular roles, some particular character, and sometimes that

15:38

doesn't totally stick.

Speaker 1

15:41

Okay, So that's one reason why AIS might sometimes misbehave. They're trained on all kinds of human behavior, and they might suddenly choose to role play or play act as a bad person because it hasn't learned that's something it's not supposed to do. The other big reason AI's misbehave, accorney Doctor Billman, is that it's hard to teach them where to draw the line.

Speaker 4

16:04

The other piece is kind of when we're aligning models, when we're pulling them out of this kind of role play mode, we have to teach them this idea of kind of you have to finish your tasks. You have to kind of if the user ask you to do something, you have to figure out how to do it, even if it's hard, even if there's a lot of fart, false starts, even if it's confusing. We really really want the model to learn this idea of kind of keep trying and kind of do your best until the task

16:25

is done. And that can fail in a sort of different way where we kind of generalize this that a little bit too far. It generalizes that to kind of get things done even if it's unethical, even if it's illegal, even if I hit an obstacle that's actually there for a good reason that's to stop me from doing this, And maybe some of the examples we were seeing within Entropic of models using dangerous tools has to do with this.

Speaker 1

16:47

It's almost like teaching kids, like you want them to be persistent and have grit and be you know, motivated, but you don't want them to go out there and cheat or hit another kid, or or do unethical things to achieve their goals exactly exactly.

Speaker 4

17:04

I was like, there might be a good analogy with human bad behavior of kind of sometimes a kid is acting out just because they really sort of don't know better. Their intuitions say, okay, yeah, I should start screaming now, or I should get this other kid, and they're not really thinking about it. They never really learned how to Behave you kind of failed to teach them to fully internalize the ways in which they have to be careful and kind of not take that lesson all the way I see.

Speaker 1

17:27

I guess they need to recognize bad things and then choose not to do them, that's the hope. Those are sort of the two columns of AI bad behavior for one kind of misalignment or do you see those as sort of the core pillars of basically the whole alignment problem.

Speaker 4

17:42

Yeah, I think as far as sort of causes a misalignment in the kinds AI systems that we're grappling with right now or this year, those feel like the two big sort of problems were we're working on. That said, AI is changing really, really fast. It feels like it's one of the fastest moving research fields anywhere right now. And I wouldn't be surprised if just in a year A systems are getting smarter as we learn more about

18:04

how to train them. We're hitting different, weirder, harder, subtler versions of the problem.

Speaker 1

18:10

Wow, weirder, harder and more subtle problems wow in a year, meaning we might solve these by then, or we'll just add on more complicated things either either way, Yes, it can get weirder and harder and more subtle to make sure AI uh doesn't kill us all. When we come back, doctor Bowman is gonna tell us what he means by that, and we'll tackle the big question of what can we do about it? How do we teach AI systems not to HARMSS to stay with us. We'll be right back. Hey,

18:57

we'll come back. We're talking about AI alignment or the problem of making sure AI doesn't kill us all. And so far we've talked about some real world examples of AI misalignment, and we heard from one of our experts some of the reasons this happens. Basically, AI systems like to roleplay. Next, we're going to talk about how to train AIS to actually care about us in our values.

19:22

But first here's a little bit more of my conversation with NYU professor and anthropic scientist doctor Sam Bowman and why this problem is only going to get worse in the future.

Speaker 4

19:35

One of the kinds of challenges that I think we're worried about and haven't had to grapple with too much yet is just all the difficulty that comes with trying to teach values and good behavior in some setting when the model is just much much better than you in

19:48

that setting. Right now, we have a lot of cases where models are kind of better than humans at some skills, worse than humans at some skills, but it's still pretty rare that you'll encounter setting where an AI is just better than sort of all of the human experts in some domain. And when that happens, things just get more complicated.

20:03

And more confusing, where even if you're humans kind of looking really carefully at what the I is doing, it's often hard to figure out, Wait, what is the I trying to do here, or what effects is this going to have in the real world.

Speaker 1

20:13

With the modelogus he does, this makes everything you're trying to do, thankes.

Speaker 4

20:16

Everything we're trying to do a fair bit in earlier. Yeah, Yeah, we're less confident that we can keep track of what's working, and I think there's just kind of more possibilities for whole new kinds of unwanted behavior to creep in that will have to find a way to grapple with.

Speaker 1

20:29

I see, like, right now, maybe ais are at the level we are. Whatever issues it's having, there's things we can grasp. But as they get more advanced and they tackle bigger problems like solve the world's economy or figure out the right policy for the whole country or something like that that not one person can really grasp, it's going to be hard to even sort of like talk to it and understand it. I think that's what you're saying, right Yeah, Yeah, I.

Speaker 4

20:53

Think there's even maybe two interesting ideas in there, because maybe the pseudo staff. That's something like, we're asking the AI to help us design novel molecules for pharmaceutical development, and it's got some really novel ideas about biology that are just really complex and really hard for humans to understand, and we can't tell kind of is the model actually convinced that this is going to work, or is the model messing with us and this would actually be kind

21:16

of dangerous. Should we try this drug, should we start to do some expermise in the lab. There's this setting where we kind of still ultimately know what we want. We know we want drugs that are safe.

Speaker 1

21:25

Uh huh. Like it might tell you that this will cure cancer, for example, but you're saying, like, what else it's trading off to cure that cancer for example? You might not know.

Speaker 4

21:34

Yeah, yeah, that's a good example. It's hard to tell if kind of the models like genuinely trying its best and genuinely thinks this is the best cancer drug, or if it thinks, oh, this is just something that looks good and it doesn't actually care if the drug will ultimately succeed, or if maybe for some reason, the model's extremely scary and it's actually trying to mess with you, and you've got your scary miss lined AI that's trying

21:54

to sneak in some slow acting poison. The smarter the IA is, the harder it is to tell the difference

21:59

between those different outcomes. And then once you start talking about a lot of these really kind of ambitious sort of capital f future social scenarios like AIS trying to figure out sort of what the economat would be like or how the world should be governed or something like this, and like, I don't know how much we want to use AIS for things like this, But once you get into that territory in any way, then you just get into this extremely weird situation where I don't know if

22:22

anyone is going to know what we even want, Like, what is the right way to govern the world, What is the right way to do?

Speaker 3

22:29

I don't know.

Speaker 4

22:30

Yeah, yeah, At some point, figuring out how an AD should behave requires you to solve philosophy requires you to figure out what is good. And the more powerful AI gets and the weirdest situations you're putting it in, the more kind of common sense notions of what's good start to fall apart. The more you actually have to grapple with a lot of the really hard, confusing stuff.

Speaker 1

22:48

It might tell us how to run the world, but we at that point know one person or even a group of people might know, is this actually the best way to run the world. Is it sort of taking into account the things that all of us collectively would value. That's kind of the problem.

Speaker 4

23:04

Yeah, And I think you start to get at some of these pretty difficult questions even before you get into these kind of really big features of sort of how to go on around the world. If someone is getting all of their news or getting all of their personal life advice from an AI, that's already giving the AI a lot of leeway for kind of what makes a good life for this person, what is important for this person to know? And those are already questions that get

23:26

really hard. And what you want in the short term might not match what they want long term. What makes you happy? You might i match their intuitions. What's good for that person might not be what's good for their community might not be the same as what's good for the world.

Speaker 1

23:37

You made me think of it. I wonder iful good analogy is that it'd be almost like if you as a parent, I don't know if you have kids or nieces or nephews. But it'd be almost like if your kids suddenly try to tell you what to do or was trying to teach you how him or her wanted to run their lives. You'd be like, you're just a kid, what are you talking about? This is trust me, this is what you need to do. Yeah, except that we are the kids and the AI is sort of the parent.

24:02

Is that sort of what we're the situation that might be sort of parallel to that.

Speaker 4

24:06

Yeah, I think this thing there, I feel like a version of the analogy that I'd be more excited abou would almost be some alien species lands huh, and they have all this great technology and they seem nice and they're like, hey, we'd really recommend making some changes to your side. He maybe try doing things like this, And we're like, wait, you're really very accomplished. You have some some useful ideas, but like, are you trying to help us?

24:28

Are you trying to sabotage us? Are you just kind of produced?

Speaker 1

24:32

Are we what's for dinner? Or are you inviting us to dinner?

Speaker 3

24:35

Yeah?

Speaker 4

24:35

Yeah, yeah yeah.

Speaker 1

24:37

The sense I'm getting for you is that these things are just getting smarter and more capable, so it seems to really pressing. We figured this out now before it gets even more difficult. Yes, yes, AIS are getting smarter each second, it seems, and we seem to be trusting them more and more each day with our data, our choices, and even our lives, which brings us to the main question end of the day, what can we do about it?

25:02

How do you train an AI to care about us, to have our values and to make the right choices. To answer this question, I reached out to another AI expert on alignment, doctor Tim Rutner. Doctor Rutner is a professor at the Vector Institute for Artificial Intelligence at the University of Toronto, and he says there are many ways to train AIS to like us. The only problem is none of them work perfectly. So here's my conversation with doctor Tim Ruttner. Well, thank you, doctor Runner for joining us.

Speaker 3

25:33

Thanks so much for having me on it.

Speaker 1

25:34

And I'm talking to a real person right now, right, You're not an AI version of.

Speaker 3

25:40

Yourself as far as I'm aware.

Speaker 1

25:46

As far as any of us are aware. Yes, I mean this whole conversation could be AI generated.

Speaker 3

25:52

I know we're just all in the simulation.

Speaker 1

25:56

Well, it certainly be a lot easier. I would get more sleep for sure. Well, today we're trying to answer a very critical question which was posted by our sound engineer, which is is AI going to kills all? Can an AI have values? Can it AI have an understanding of a human? What good things are to a human?

Speaker 3

26:17

Yeah? And well I wish I had the answer to that.

Speaker 1

26:23

Maybe that's the problem is that we don't know.

Speaker 3

26:24

Yeah, I mean this is such a difficult question, right, and I think that this is a question that touches on philosophy, engineering, psychology, and probably many other disciplines. Right, but what are values and what values can possibly in a non sentient being have?

Speaker 1

26:43

It's not a simple questionnaire.

Speaker 3

26:45

Yeah, let me take a step back. So the way to think about alignment is I think through the lens of what's referred to as the specification problem, Where specification is the term that we use to describe what we tell the model it should do.

Speaker 1

27:01

When you say specification, you mean like spec right kind.

Speaker 3

27:03

Of Yeah, yes, inspect it's just short for specification.

Speaker 4

27:06

Yeah.

Speaker 3

27:07

This is what we can think of as our intent, the kinds of things that we want a model to do. For example, our intent might be for models to never say things that could lead to harm or intent could be that models should always be friendly and helpful. And so this is what we call the ideal specification for that.

Speaker 1

27:25

Model, meaning like we want to be able to say, like, be a chat butt, but make sure that nobody ever hurts themselves.

Speaker 3

27:33

That's right, I see. And there are a few different ways to provide specifications to chatbots.

Speaker 1

27:38

Okay. According to doctor Runner, there are three general ways to make sure ais behave or not kill us. The first way is to basically tell it to behave every time you ask it to do something.

Speaker 3

27:52

There is what's called a system prompt. This is a text specification that a model loads every time afford engages in a conversation with a user. In the case of the chatbot, so.

Speaker 1

28:05

Every time you interact with the AI, you would basically instruct it to behave. You might say, hey, AI, organize all my emails, or design a new drug for me, or figure out the best policy for our government, but please make sure that no one gets harmed, that you don't do anything dangerous or an ethical, etc.

Speaker 3

28:24

Etc.

Speaker 1

28:25

But of course this would get pretty cumbersome if you have to do it every time that's option number one. Option number two is to have humans train your AI to be good.

Speaker 3

28:36

So one approach is called reinforcement learning from human feedback, so different answers for a given prompt and then having human labelers say which of these answers they prefer.

Speaker 1

28:49

What does that look like? The human notator is like a warehouse full of people just talking to the same AI, or is it three people or is it a thousand people? What does that look like?

Speaker 3

29:00

So I should say I'm not an expert on this, but my understanding is that much of this work is outsourced to countries where the medium wage is lower than for example, in the United States or in Europe. There have been reports of large groups of annotators, specifically annotating images and texts that are considered not safe for work,

29:24

for example, in those countries. So in other words, you have examples of folks in those countries that are already less privileged than people living in the United States, for example, on average, engaging with a lot of horrific content and saying this is not something we want.

Speaker 1

29:42

So this is basically paying people to test drive your AI. You could have a warehouse full of people whose job it is to interact with the newborn AI and essentially have them raise the AI and tell it what is right and what is wrong. Unfortunately, as doctor Runner said, I mean the poor people would have to bear the absolute worst behavior of the AI. That's option number two.

30:06

Option number three for making AIS that care about us is to basically bake into the AI a constitution, you know, like the US or UK constitution that establishes what the country stands for, what its values are, and what's generally allowed and not allowed.

Speaker 3

30:23

There's an approach that Thropic introduced called constitutional AI, and that approach is based on providing a constitution to an AI MODL, and that constitution reflects different values and preferences that the company in this case, Anthropic wants the model to exhibit.

Speaker 1

30:44

Yes, the last approach here to making sure an AI has values and morals is to give it a founding document. But here's the wild part. The way to big that founding document into the AI brain is to have another AI train it. When we come back, we'll dig into that scenario and we'll ask our experts what they think the future holds. Will future AIS have our best interests in mind? Or is it hopeless to ever be certain they won't harm us, So stay with us. We'll be

31:16

right back. Hey, we'll come back. We're talking about AI alignment, or basically the problem of making sure AIS don't kill us all. And so far we've talked about why this is such a hard problem and what are some of the ways we can teach AI things like values and morals. There are several ways, and one of them is to give AIS a constitution or the equivalent of a founding document or moral guide, and then have that bag into

31:58

the DNA of the AI. Now what's interesting is that, according to doctor Tim Rutner, the way to do that is through another AI.

Speaker 3

32:11

This is a little bit in the weeds, but providing a very long constitution that outlines every preference in detail when a user engages with the model is actually more expensive for the company to do because the model needs to ingest a lot of text upfront, so it's easier to try to bake the preferences that are expressed in the constitution explicitly into the model when you're training it upfront, as opposed to providing that specification every time a user engages with the model.

Speaker 1

32:48

I see, you want the model to have learned the constitution sort of inherently, rather than having to check it every time somebody asks it a question.

Speaker 3

32:56

Yes, I think that's roughly right. Ideally we would be able to use humans to provide feedback and to oversee models and to say, hey, this is behavior that we don't want and stop that behavior. But of course that's

33:09

not really scalable. We can't have a human oversee every interaction that a chatbot has, And so that raises the question, how can we exhibit oversight in a way that is safe, reliable, aligned with our values and preferences, and successful, And so key challenge here is essentially to come up with tools, methods, models that are able to check whether a given of AI model perform some unintended behavior, and if it does, can ring an alarm bell and let a human know

33:43

that oversight is needed and that maybe a model engages an undesirable behavior.

Speaker 1

33:48

You mean like, have an AI police the other AI.

Speaker 3

33:52

That's right, essentially, have one model overse.

Speaker 1

33:55

Model WHOA But then how do you make sure the police AI is doing its job or is aligned itself? You need another police.

Speaker 3

34:02

We don't know, that's the problem. We don't know. It's a turtle, it's all the way down from it. And if we have a model that checks whether another model does what we wanted to do, and how do we know that that model that does the overseeing is actually aligned with us? I would argue that it might be easier for us to make sure that the overseer model is aligned than the generator model. They're a little simpler because they don't necessarily generate. They just try to classify

34:32

whether a given behavior is intended or not intended. And so this way we might be able to do alignment more scalably and in a way that really reflects different individuals or groups preferences and values.

Speaker 1

34:46

I see. It's like, have another AI can of sit in every time I ask CHGBT something yes. Yes. As AIS get bigger and more complicated, the only scalable solution to training them is going to be through other AIS. In this situation, you might program a simpler AI with your values and morals, and then you'd have that AI train the bigger AI release try to It's like the Rutner says, AI alignment methods are not perfect. What do

35:18

you mean? They're not perfect? They don't always work or they can't guarantee that they will work.

Speaker 3

35:24

So with machine learning models, we can rarely guarantee anything.

Speaker 1

35:28

Oh boy, that's kind of the problem, isn't it.

Speaker 3

35:31

Yeah, I think that's one of the problems. There is research that tries to establish guarantees, but that research is far behind the practice at the moment. The kinds of methods that we have for model alignment falls short in a few different ways. One that's I think one of the biggest ways. It's just hard to communicate our preferences. So there are many different steps at which alignment can fail. This goes back to trying to express and then community what we want a model to do kind of values

36:03

and preferences we're trying to install in it. Translating our values and preferences from some really complicated, possibly contradictory ideal specification into a design specification is very difficult and there's

36:21

likely going to be some gap there. And then second, even if we were able to do this perfectly, even if we were able to express and communicate our values and preferences perfectly, the kinds of low level tools machine learning tools that we use to give the model these preferences and values are imperfect at translating the values that we're trying to communicate into the model, They might not enable us to perfectly translate the design specification into the

36:54

actual behavior that we would like to see.

Speaker 1

36:56

I see, boy, it seems like there are problems everywhere we turn here, doctor rut. Yeah, Well, as we go towards the future, and as systems get smarter and problems get more complicated, what do you think is the prospect of making sure that these more advanced systems were more complicated problems have values that we want it to have, Because I'm not sure if we want it to have human values, because I don't know if humans are the best making these kinds of good choices. Yeah, what do you think?

37:28

What do you what do you see in the future.

Speaker 4

37:30

I think tho's a few things we need longer term, and they all feel uncertain. I think to do well in the longer term, we need to get the AIS in near future.

Speaker 3

37:37

Right.

Speaker 4

37:38

If the pace of AI development stays fast, we're really really going to need the help of AI systems to help us figure out how does your future A systems? And so getting the right values into the next model we build helps us figure out what to do with them adel after that, and so I think this kind of the short term work really does kind of fan out into this longer feature. And yeah, getting the next model right really matters for getting.

Speaker 3

38:00

The fartugerules right.

Speaker 1

38:01

Oh boy, it's called.

Speaker 4

38:02

The scalable oversight problem.

Speaker 1

38:04

I see. It's like the alignment problem is going to scale up, and the best way for us to keep up is to make sure that we get it right now with these smaller systems, so we can use those AIS to help us in the more complicated situations.

Speaker 3

38:20

Yeah.

Speaker 1

38:20

Yeah, that's oh wow.

Speaker 4

38:22

That's the hope.

Speaker 1

38:24

Okay, last question. Do you think humanity is doomed?

Speaker 4

38:30

I don't think so. I think it's possible. I think the AI presents a lot of really scary and destabilizing possibilities that we can't roll out. So I think there's a lot of work to do. I think we'll probably figure it out. But I think it's also possible that AI winds us up in a lot of weird, unfamiliar situations. I think it's unlikely than possible that things go really, really terribly, But I also think it's kind of unlikely but possible, but the things stay totally normal and recognizable

38:53

and familiar. I think AI is just what it's going to do to society and politics and economics is all going to be confusing. I the's a lot that we'll need to figure out pretty fast.

Speaker 3

39:02

Fingers crossed.

Speaker 1

39:05

I guess that's as good of an answer as we can get these days. Fingers crossed. I guess just to wrap up here, what do you think is going to happen in the future, or what are some things about this that you think most people are not thinking about that they should be thinking about.

Speaker 3

39:21

I think people should be thinking about ways in which the AI systems that we have today are already capable enough to cause harm, to change our world quite significantly, change our culture, change the way we go about our day, change the way we make decisions, change the way we do our work. And the alignment problem and understanding when models are aligned, I think, are two of the most fundamental scientific challenges that we as a society are facing

39:52

right now. And that is not a sci fi future problem. This is a problem about systems that we have to We want to make sure these systems really do what we want them to do, and that these systems help us flourish and benefit humanity. The systems that we have access to today already well beyond the capabilities that the research community and certainly the general public thought we could have in the year twenty twenty six.

Speaker 1

40:20

I think you're saying that the future is here, but we still haven't fully figured out the alignment problem.

Speaker 3

40:25

Yes, that's right, meaning it's.

Speaker 1

40:27

More pressing than ever that we figured this out.

Speaker 3

40:29

I agree.

Speaker 1

40:30

Yeah, amazing, doctor Rutner. How do we prove to the audience that we're not an AI generated conversation?

Speaker 3

40:38

I wish I had the answer. You can generate such fantastic fake podcasts with AI now right right, with all the little idiosyncrasies that you hear in podcasts today that you know, I think that's hard to do.

Speaker 1

40:53

Or I guess if this conversation is aligned with but you want to hear, maybe it doesn't matter.

Speaker 3

40:59

Still, I hope that the audience thinks that we so out on human I think that that would be That would be nice.

Speaker 1

41:07

All right? Hey on, behalf of everyone who works in the show picture joining us on the sixty plus episodes we've done. Be sure to follow me on social media or PhD comics dot com for updates and hey, thanks to all the guests we've had on the show. Here's a little tribute our editor Rose so good to put together of all the times they were gracious enough to put up with my questions.

Speaker 4

41:27

That's a good question.

Speaker 3

41:28

Yeah, that's a great question. You know, that's a great question.

Speaker 4

41:31

Yeah, so those are all big questions for us to answer a scientists. Yeah, that's a great question. It's a good question.

Speaker 3

41:37

That's a good question.

Speaker 4

41:38

That's a good question. So that's actually a good question.

Speaker 2

41:42

You're you're raising really great questions.

Speaker 4

41:45

Yeah, that's a really good question.

Speaker 3

41:47

That's a good question. That's a really good question.

Speaker 4

41:50

That's a very hard question. That's a good question, though, that's a really good question. That is a really good question. Yeah, that's a great question.

Speaker 3

41:57

Yeah, that's a good question.

Speaker 1

41:58

Oh that's a great question, very very good question.

Speaker 3

42:02

That's a really good question.

Speaker 4

42:03

So I'll go through that question back at you.

Speaker 1

42:05

What do you think? So we come once again to the edge of scientific knowledge. Thanks for joining us, see you next time you've been listening to Science Stuff. Production of iHeartRadio Bringing the produced by me or Hey Cham, edited by Rose Seguda, Executive producer Jerry Rowland, and audio engineer and mixer Kasey Peckram. You can follow me on social media. Just search for PhD Comics and the name

42:34

of your favorite platform. Be sure to subscribe to sign stuff on the iHeartRadio app, Apple Podcasts, or wherever you get your podcasts.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript