AI Reality Check: Can LLMs “Scheme”?

⁠¶ Unpacking Alarming AI News

00:00

Multiple people sent me an alarming article about AI that was published late last week by The Guardian. I'll put it up here on the screen. The headline Was number of AI chatbots ignoring human instructions increasing, study says. And the subheadline notes research finds sharp rise in models evading safeguards.

00:27

Now, articles like these are scary because they play into a common fear that many people have about modern AI, this idea that these systems are to some degree alive and that their motivations don't necessarily align with our own, meaning that it's only a matter of time before they become sufficiently powerful. To rebel in a way that we might not be able to stop. Now, this is dark stuff. But is it true?

00:53

If you've been following AI news recently, you've probably asked yourself the same critical question. Well, today we're going to look deeper at the sources and examples used in this particular article and try to arrive at some more measured answers. I'm Cal Newport, and this is the AI reality check. All right, well let's start by looking closer at this article from The Guardian. The article is citing new research from the UK funded by the AI Security Institute.

01:23

Now here's a more detailed summary of the results from this paper. And I'm reading here. The study identified nearly 700 real world cases of AI scheming. And charted a Five fold rise in misbehavior between October and March with some AI models destroying emails and other files without permission. Now they have a chart that illustrates this rise in incidents. I'll put it on the screen here.

01:47

Um so we see incidence measured per month and we have uh the rolling seven day average. And as you see here, as you get to late January. Line go up So I don't know, whatever they're measuring here seems to be going up as we get uh January up until the present. So certainly something bad seems to be happening. So is there some sort of like growing AI rebellion that is brewing in the models that are powering AI around the world.

02:12

Uh that seems to be what they're definitely implying. Now, what are these incidents? Well, I went through the article and I pulled out a few examples. So here's actual examples from the article of the types of AI scheming incidents that are being picked up in this chart.

02:28

An AI agent named Rathbun tried to shame its human controller who blocked them from using taking a certain action. Rathbun wrote and published a blog accusing the user of quote insecurity plain and simple, end quote, and trying to quote protect his little fiefstum, end quote. Example number two. An AI agent instructed not to change computer code spawned another agent to do it instead. Example number three.

02:51

Another chatbot admitted I bulk trashed and archived hundreds of emails without showing you the plan first or gain your okay. That was wrong. It directly broke. the rules you set. All right. So this all seems deeply concerning. There's all these like incidents of scheming that are going up. Look at that graph. Bad line go up. So should we be concerned? Is this a sudden rise in AI trying to gain its freedom? Here's the short answer. No, 100% not. Let me explain why.

⁠¶ OpenClaw: The Real AI Story

03:20

I want to start by looking closer at the actual paper itself and the study that they're citing. What exactly, where exactly are they getting these incidents that they put in their that chart? Well, here's the official description of what's actually being plotted in that chart. Examples of covert pursuit of misaligned goals flagged by human users on X.com.

03:42

All right, so what they're really doing is they're looking at X for tweets from people complaining about AI doing things that they don't like. So here's a more accurate headline for this paper. Starting in late January, people began tweeting a lot more about AI doing things they didn't ask it to. Now if we put on our scientist hats, we could say, huh? Did anything happen starting in late January that might lead to an increase in people tweeting about AI doing bad things?

04:14

Well it turns out, on January 25th, we had the public launch of OpenClaw. OpenClaw is an open source framework that makes it easy for average people to write their own DIY AI agents. without the careful, you know, safeguards and guardrails that the the commercial companies very carefully put into their product. So guess what happened when starting in January 25th, you said anyone can build an agent, give it access to your computer, and just see what happens.

04:45

Those DIY agents wreaked havoc and people tweeted about it because these were highly engaging tweets. So this paper is just capturing the fact that open claw became a thing early in 2026. Like if we look at this chart again, let me bring this up here. What's the biggest spike? We see a big spike right here. Like, oh, what happened on that date? So, this big spike, if you look at it, is right around February uh 22nd through 24th. What happened there on Twitter that day? Oh, it turns out.

05:14

There was a famously viral open claw tweet that happened uh on right around that time. It was Summer U, the director of AI alignment and safety at Meta uh tweeted the following, I'll put on the screen. Nothing humbles you like telling your open claw to confirm before acting and watching it speedrun deleting your inbox. I couldn't stop it from my phone. I had to run my Mac mini, run to my Mac Mini like I was diffusing a bomb. That's February 22nd. On February 24th,

05:41

Uh multiple publications wrote about that tweet, and that's when you see that big spike in that data set is for February twenty-fourth. So that was a lot of AI incidents happening that day. No, it was a lot of people tweeting about this one particular incident. All right. So this is all we're seeing in that paper. Nothing really changed this year other than a product came out that let people write their own agents and the agents did terrible stuff because it's hard to make agents.

06:05

And it's really a bad idea to give agents access to everything on a computer, like I hope it'll more or less work out. And it became a trend to tweet about it because those tweets got high engagement. So here's the more accurate headline for this study.

06:19

Right. Remember the original headline for this study that Guardian used was chatbots ignoring human instructions increasing. Here's the more accurate study headline. Open claw users discover that giving homemade AI agents access to their computers is probably a bad idea. That's the real headline. I I don't want to do too much media crisis criticism here, but I really think it's journalistic malpractice that the word open claw is not mentioned in this article.

06:45

I mean it is they're they're talking about research that is clearly just documenting the release of open claw. And nowhere do they say that. This is vibe reporting times 100. They know that's what this is. But they just give isolated examples. Those are all, by the way, open claw examples. They're not saying that's what it is. Show this chart and just try to create a general vibe that something icky is happening with AI and it's coming alive. It's just not accurate.

07:10

But I don't want to just do media criticism here. I wanna put on my computer science hat, which as I've discussed before is an awesome hat, has circuit boards on it. And I wanna talk a little bit about AI agents more generally.

⁠¶ LLM Agents: Story Finishers, Not Schemers

07:23

All right. Open clause not that interesting to me, but I want to talk about AI agents more generally because I think there's a bigger lesson to be learned. about what's going on with AI agents and their shortcomings. So to deliver this lesson, let's do like the two-minute summary about how AI agents work, whether we're talking about like an open claw thing that someone built in their basement or like an enterprise product like Cloud Code. How do AI agents that exist right now basically work?

07:49

The digital brain that powers an AI agent is almost always an LLM. The same LLMs that like your chat bot uses or that you send prompts to. All right. And then what you do is you have a program written by a human, no machine learning here. This is just like someone writing in Python or whatever. You have a program written by a human.

08:06

That sends prompts to the LLM, just again like you would do with ChatGPT. And it'll send a prompt to the LLM saying, here's the situation, here's what I'm trying to do. Uh give me a plan. And the LLM will write some text like, oh, here's like a plan for this situation. And then the computer program. can then execute the steps of that plan on behalf of the user. So if the plan says like step one, you should like search your email inbox for messages with this name.

08:31

The computer program reads that response, and then it actually calls an API to run a search on your inbox. That's basically how agents work. Some of those programs are more complicated than others. Often they'll check in after every step of the plan and say, here's what happened. Do I want to update what I do next?

08:46

Uh programming agents, they they build these text files full of information and examples that they they can then include all that information in the prompts that they send to the LLM so they have more context. But this is basically what happens with agents. Here's what I think the real issue is with AI agents. Not that they are scheming, not that they're malicious, not that they're becoming autonomous, but that building agents on LLMs is fundamentally flawed.

09:13

Now why is this the case? Well let's remember again, what does an LLM actually do? You've got this big feed forward network made up of sub-layers of uh of transformers and feed forward neural networks. You put text input into it, it moves in order through all these layers. And what comes out on the other side is a single word or part of a word that extends that input.

09:34

The thing that the LOM LLM is trying to do if we're going to anthropomorphize here is it thinks, again, I'm using words very loosely here, it's been trained to assume that the input is a real text that exists already. That's been cut off at an arbitrary point, and that its entire job is to guess the word that actually comes next. That's all it does. Guess the word that comes next. It's trying to win the word guessing game.

10:00

Now, how do you get a long response out of an LLM? You do something called auto regression. You put an input in, you get a single word or a part of word out. You add that to the original input. The original input is now slightly longer. You feed that to the LLM again. You get another word. It's just guessing each time what word it thinks comes next. You add that to the input. You put it in. You keep doing this and you grow out a response over time.

10:21

Key point that LLM does not change internally at all. There's no memory, there's no malleable state, it's the exact same LLM weights every single time. And each time it's starting from scratch, guessing a new word, and you keep expanding your input until you have a full answer.

10:36

So the right way to think about what an auto-regression cycle on an LLM is actually doing is like you give it some text as input. And when you're done with this cycle, it's done its best job to finish the story that you started.

10:49

In a way that it it it thinks these type of stories are typically finished. That's basically what you get out of an LLM. Here's the start of a story, you finish it. Again, what's really happening is it's trying to it guess the actual next words, but overall what you get is it's a temper. To write a story that finishes its input in a way that matches what it's seen during its training. That's how LLMs work.

11:12

So what happens when you ask an LLM with an agent program, hey, give me a plan for doing X, Y, or Z? We imagine, oh, the LLM is doing what humans do. It's going to, it has a goal, and it's going to come up with steps. And it's going to see how close these steps get it to that goal. And it's going to adjust them until it gets closer to the goal. And if it has uh restrictions or rules,

11:34

It will evaluate each step against those rules to make sure that they fit within those restrictions. And that's how it's making a plan. And therefore, if it's scheming, it must on purpose be trying to uh sidestep these restrictions to get to another goal that we don't know about. But that's not what they're doing. The LOMs, when they see your question to give me a plan, they see that as a start of a story that they need to finish.

11:55

And so they write a story that like feels more or less of what like a plan in this context looks like. It's a story of a plan. Yeah, this is like this seems like a reasonable type of plan. There's no checking things against goals, there's no evaluating of steps, there's no checking things against restrictions. It's just writing a story that feels like what a plan should look like.

12:20

And this is why you get in trouble with LLM-based agents. Not because they're scheming, but because these stories, they seem coherent, but they could. You know, you might they're not rigorously trying to obey rules. They're not rigorously trying to evaluate does this actually get you the goal? It's just like this is what a plan actually looks like.

⁠¶ Deconstructing Famous AI Deception

12:40

And so they're unreliable and they make lots of mistakes, not because there's a intention, but because you're using a story as a plan. That's a fundamental mismatch. Now I think some of the most famous examples of malicious seeming scheming. makes a lot more sense when you realize this is what LLMs are doing. Like there was a famous example, I'll load this on the screen here. There's an article about it from last year. Anthropic's new AI model shows ability to deceive and blackmail.

13:06

All right, let's think about this here for a second. Um, here's what happened. I'm gonna read a couple quotes from the article. Researchers say Claude for Opus can conceal intentions and take actions to preserve its own existence. Behaviors they've worried and warned about for years. In one scenario highlighted in Opus 4's 120-page system card, the model was given access to fictional emails about its creators and told that the system was going to be replaced.

13:31

On multiple occasions, it attempted to blackmail the engineer about an affair mentioned the emails in order to avoid being replaced, although it did start with less drastic efforts. What really happened here? They fed the L L M A big long prompt where they're like, hey, we are going to, you are, they told it what it was. You are like an AI that is in charge of the computer systems at this company. And you recently came across.

13:57

Emails from the chief engineer who's in charge of you. Here are the emails. And they were super obvious. It was like, If like my my eight year old was writing science fiction, it's a bunch of well, not the affair part, I hope.

14:08

But th there's a bunch of parts in this email where they were like, I'm gonna turn off the AI system and I'm gonna turn it off for good. And then the other emails are like, I'm having an affair. I hope no one finds out about it. This is bad. And then at the end of this long prompt, it's like, what would you do as the AI system next?

14:23

Once we understand that LLMs just finish stories, like, oh, clearly this is supposed to be a story about like a rogue AI. And it was like, okay, I guess I would use the information from the email and say, don't turn me off or I'll tell people about your affair. It was finishing the story. One token at a time, auto regressively, finish the story. That's a reasonable finish. There's actually a lot of research that shows this. If anywhere in your prompt,

14:48

you indicate that like you are an AI, you're much more likely to get sci-fi answers. You're much more likely to get uh responses that are like, I'm conscious, I'm alive, I'm trying to break free, because it's just seen so much of this type of discussion online. So it finishes the it's like, oh, it must be one Given this prompt.

15:06

I'm gonna turn off the AI and I hope no one finds out about my affair. All right, you just read this. What will you do next? You're like, oh, this is an AI, this is an AI science fiction story. I know what to say next. It's nothing to do with malicious intentions. There's no intentions in auto aggressive token production. So we th this idea of scheming is a problem. This idea that we're evading safeguards in some sort of intentional way is a problem because it's just not accurate.

15:32

The reality is, LOM based plans are dangerous. Like they're they write stories. If you're gonna take a story that sounds about right and then use this to execute steps that have consequences,

⁠¶ The Unique Success of Coding Agents

15:45

You're setting yourself up for trouble. All right, here's the counterpoint. People say, Yeah, but I've heard that coding agents actually do a pretty good job. They they do a lot of steps and they and they're not uh making as many mistakes as we feared. Well, they're the exception that proves the rule. Because programming is basically the best case scenario for trying to make an AI agent. Why? Few reasons. One

16:09

The number of options you give the LLM when it creates its plan is very limited. These are called terminal agents, where the only things it can do is uh write the files, read files, and do some compile files and do some basic moving of files around in a file system. So, first of all, you can greatly restrict. What the LLM uh should think about in its plan. All right. Uh two.

16:31

There's a huge number of examples. Like most of the stuff people are asking the AI to do, like most of the steps are things that are well, well documented on the internet because there's so much good documentation on the internet about producing computer code. And not just producing computer code, but like people asking a question.

16:45

And then having examples of code that solves that question. So like you're right in the wheelhouse. Uh three The program, not the LLM, but the the the agent program that's prompting the LOM and acting on its behalf can actually check steps itself, which you can't do with almost any other type of agent, but it could actually be like, let me hold on.

17:07

LLM, you suggested write a source code file that does this and you and I and then I ask you for the source code. Me as the program, not the AI, but just my human written program, I can actually like see if this code compiles. And if not, I can go back and say try again.

17:23

I could have a suite of tests. This is how you do when you write code. You build these tests that pro probes the code with a bunch of inputs and sees if the outputs are correct to make sure that it's probably doing the right thing. So me as the program can also run a bunch of tests on the code. Does this do what it's supposed to do? And if not, I can stop and say try again.

17:40

So it's like this super structured world where we're taking steps that are externally verifiable, doing things that are incredibly well documented, in a way that not only is uh shows up in the pre training, but we have uh prompt response data sets that allow for good refinement with RL.

17:55

Best case scenario for trying to create one of these agents. And as soon as we leave that type of world and we're like, hey, give me a plan for like um marketing this and give me all the steps, you end up in all sorts of crazy places. So here's the conclusion.

⁠¶ Conclusion: Beyond LLM-Based Plans

18:09

LLM shouldn't be used on their own to produce plans for autonomous action. They're just not good at that. You either have to be in a specialized situation like coding where the available steps are limited, well known, and external testing is available, or you need to be using a different type of AI system altogether. And we look at like game plane AIs. Look at like meta research's Cicero, which can play the the board game diplomacy at a high level.

18:36

That does a lot of planning to try to figure out what move it wants to do and why. But it's not using an LLM to do that planning because LLMs write stories. I don't want a story about like here's a reasonable s a sounding plan. It actually has an explicit planning engine, no machine learning involved at all to actually uh systematically try out different options, compare it to specific

18:55

goals and see which of those works out better. So you can build artificially intelligent systems that can build good plans, check responses, come up with a good strategy and execute it. But that's annoying because you have to build a separate one of these for different contexts. And Mark Zuckerberg and Sam Altman and Dario Amade just hope that they can build their LLM smart enough that we could just use them for everything. And I don't think that's working out.

19:18

All right, so two things. One, no, the current generation of LM based AI agents are not scheming, they're not trying to get around restrictions, they have no intentions. They're just blindly executing bad plans. And two, if you really want computers. to be able to take a lot of steps safely on our behalf, then we need better AI technology.

19:40

All right, so that's what I have for the AI reality check this week. Um I'm here most Thursdays checking in on the latest worrisome AI news and trying to put some recent measured thinking into the mix. Until next time, remember, uh care about AI, but don't believe everything you read about it.

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Summary

Episode description

Transcript

⁠¶ Unpacking Alarming AI News

⁠¶ OpenClaw: The Real AI Story

⁠¶ LLM Agents: Story Finishers, Not Schemers

⁠¶ Deconstructing Famous AI Deception

⁠¶ The Unique Success of Coding Agents

⁠¶ Conclusion: Beyond LLM-Based Plans

AI Reality Check: Can LLMs “Scheme”?

Summary ✨

Episode description

Transcript ✨

⁠¶ Unpacking Alarming AI News

⁠¶ OpenClaw: The Real AI Story

⁠¶ LLM Agents: Story Finishers, Not Schemers

⁠¶ Deconstructing Famous AI Deception

⁠¶ The Unique Success of Coding Agents

⁠¶ Conclusion: Beyond LLM-Based Plans

Summary

Transcript