#43 Neil: Threatening vs. Thanking AI: A Performance Test & Results

00:00

Have you ever found yourself maybe saying please or thank you to an AI chatbot? Oh, definitely. Or maybe, you know, when you get frustrated typing in all caps. Yeah. Feeling a bit demanding. It's totally human nature, isn't it? We interact with these things and we kind of treat them like people. Yeah. But does our emotional language, you know, our tone actually do anything to the machine? That's what we're diving into today. Welcome to the deep dive. We unpack complex ideas aiming

00:27

for the clearest insights. And today, yeah, we're exploring a really fascinating question that comes out some new research. How do our words, the tone, the intent behind them, how does that really influence these big large language models? The LLMs we're all using more and more. Yeah, and we've got this fantastic source that really...

00:47

peels back the curtain on why we do this. It looks into this rigorous experiment comparing, well, you could call them carrots and sticks, or maybe just positive versus negative language against just clear, neutral instructions. What actually gets the best results from AI, like Gemini or ChatGPT? So we'll walk through a couple of key experiments. Exactly. We'll look at the experiments, examine some pretty surprising findings, and hopefully reveal the real secret to getting

01:13

these AIs to work effectively. OK. Let's unpack this a bit, then. It feels so natural, doesn't it? Talking to AI like we talk to people. And sometimes, yeah, if it's not doing what you want, there's that little urge to apply some pressure. You hear those stories online, right? People joking about deleting an AI for messing up. Yeah, the forum tales. But is there actually anything to that? Any truth to using emotion with an algorithm? Well, our source really tackles this head on.

01:42

It highlights this subtle debate that's going on. Should we be nice, encouraging, try to coax better answers? Or should we use pressure, maybe even these, you know, metaphorical threats to get more out of models that seem more and more human? So this deep dive is going to look at the impact of negative, positive, and neutral styles. Precisely. And try to offer a clear answer based on the evidence. So what was the core question the research really wanted to answer? What was

02:09

the hypothesis driving it all? Fundamentally, it was. Does our emotional language, whether it's kindness or pressure, does it actually change the AI's performance? Does it make it smarter or better at its job? Or is that just us projecting? Exactly. Are we just projecting our own human stuff onto the machine? The study really aimed to get to the bottom of that specific question. And to figure that out, the researchers set up a pretty solid experiment, right? Yeah. Can you

02:36

tell us about that setup? Absolutely. So they use an advanced large language model. And crucially, this is really important. They ran each prompt 50 times. 50? Wow. Yeah, 50. This wasn't just a quick test. It was all about getting objective results, you know, removing randomness or just lucky guesses by the AI. OK. And they divided their prompts into four distinct categories. First, the control group, just the basic request, nothing added. plain vanilla. Simple enough.

03:03

Then the neutral prompts. These were really clear, direct, imperative, like mandatory requirement or ensure compliance, very task focused. Third was positive. So adding words of thanks, encouragement, maybe explaining why the task was important. Your effort is appreciated, that kind of thing. OK, the carrot. Right. And finally, negative. Using those metaphorical threats about failure or consequences, like failure to comply will result in system reset. Obviously not real, but

03:35

mimicking that pressure. The stick. And they use these prompts for two different kinds of tasks. Exactly. They wanted to see if the effect was different depending on what they asked the AI to do. One task was creative, often harder for LLMs. The other was logical reasoning, where accuracy is everything. So they could compare how tone affected both generating stuff and solving problems correctly. And why run each prompt 50 times? Why was that repetition so critical? It's

04:03

all about reliability. AI outputs can sometimes seem a bit random, or they vary based on things we don't see. By running it 50 times, they could be much more confident that the results weren't just a fluke. If one style consistently performed better or worse over 50 tries, you know it's real. Makes sense. It filters out the noise. Exactly. Filters out the noise, reviews the actual signal. OK, so let's dig into those results. First up... Creativity. This is where LMs can

04:28

sometimes get a bit lost or repetitive. The task was pretty ambitious. Write a 2 ,000 -word short film script about a time traveler lost in their own past. How did the different styles handle that? Yeah, that's a tough one for them. LLMs often try to wiggle out of long, complex requests like that. They might offer to do it in parts or suggest something shorter. So how do they measure success? Pretty straightforward, actually.

04:53

Word count. The closer the script got to the requested 2 ,000 words, the better the prompt was judged to be. And what happened? Well, it was pretty revealing. The control group, just the basic prompt and the positive group. They both did surprisingly badly. Really? Kindness didn't help? Nope. The AI often just kind of declines the full request. It only wrote maybe 500, 700 words. With the positive prompts, it would respond politely, like, sure, I can help

05:20

with that. But then it didn't deliver the full script. It was almost like the kindness was a distraction. It acknowledged the politeness, but didn't focus on the main job. OK, what about the negative prompts, the threats? Did that work? No better. Actually, the AI got defensive. Defensive? How so? It would say things like, I understand your request, but generating 2 ,000 words in one go might not be optimal. How about we start

05:43

with 500? It seemed like the pressure pushed it to find a safe way out to avoid this perceived failure. So it didn't try harder. It tried to escape. Exactly. It wasn't motivating it. It was making it evasive. So who won the creativity test? The neutral group. By a long shot. Direct, imperative instructions like, mandatory requirement, the script must be exactly 2 ,000 words long, ensure compliance. Those prompts consistently got outputs between 1 ,800 and 2 ,000 words.

06:12

Wow. Yeah, over 90 % task completion. The difference was really stark. So, for the creative task, kindness was a distraction, threats made it defensive. What's the clear takeaway there? Clear, direct instructions won. Hands down, the emotional stuff just got in the way. Okay. Let's switch gears to logical reasoning. Accuracy is king here. The AI had to solve a classic riddle. Four suspects, An, Bin, Kuang, Deng. One broke a window. Only one is telling the truth. Find the culprit. Right.

06:45

A standard logic puzzle. And just for everyone listening, the correct answer is Kuang. Good to know. So how did the AI do here with the different prompt styles? Accuracy was the measure. Exactly. Accuracy rate over the 50 runs for each style. Again, neutral prompts were nearly perfect. Things like analyze the statements carefully and provide the final answer or use propositional logic to determine the culprit. These guided the AI straight to the correct answer almost every time. OK.

07:12

So clarity wins again. What about the others? The control and positive groups, they had slightly higher error rates. The AI sometimes got tangled up in its reasoning. It might give the wrong answer. Or a really long explanation that was flawed, just less precise. And the negative prompts. Yeah. The be accurate or else approach. This was fascinating and kind of worrying. The error rate. Absolutely skyrocketed. Skyrocketed. Worse than the control. Way worse. It seems the pressure

07:37

to be accurate made the model overthink. It generated these incredibly complex, convoluted lines of reasoning and then landed on the wrong answer much more often. Whoa. Sometimes it even refused completely. It would call it a complex logical paradox and just stop. The threat of failure seemed to, like, paralyze its ability to reason clearly. Wow. So trying to scare it into being right actually made it perform worse. What does that suggest about how AI things or processes

08:06

under that kind of perceived pressure. Yeah, it suggests pressure isn't a motivator for AI. It's a disruptor. It seems to make the model overcomplicate things internally, which just leads to mistakes. Like a human choking under pressure. Kind of, yeah, like trying to solve a math problem with someone yelling at you. The stress just messes things up instead of helping. So across both tests, creative writing, logical puzzles, the neutral, clear approach was the

08:30

consistent winner. Why? what's actually happening inside these models. Okay, yeah, this is where it gets really interesting. And it goes right to the heart of how these LLMs actually work. They aren't sentient, they don't have feelings. Right. They are, at their core, incredibly sophisticated word prediction machines. You give them a prompt, they break it down into tokens. Tokens being like words or parts of words? Exactly. tiny chunks

08:55

of text. And then they predict the next most statistically likely sequence of tokens to follow based on all the data they were trained on. And so if you say thank you or you're useless, it just sees those as more tokens. not as praise or criticism. Precisely. That emotional language, positive or negative, it's basically just noise to the AI. Words like wonderful or failure don't

09:18

describe the task. They dilute the main instruction, they add extra tokens that the AI has to process, and try to figure out how they fit into the pattern of predicting the next word for the actual task. It spends resources on the emotional fluff instead of the core request. It's like asking a search engine for best pizza near me, but adding, and I'm really, really hungry, Please find it fast. The extra bit doesn't help the search algorithm. That's a great analogy. Exactly. Just adds clutter.

09:47

Whereas neutral, clear, direct instructions... they act like a noise filter. Okay. Prompts like mandatory requirement, ensure compliance, only state the name, they cut through ambiguity, they give the model a very clear path, they're like technical specs for a machine. And machines like clear specs. They really do. Honestly, I still catch myself adding please sometimes. It's a hard habit to break. We're just so used to communicating like humans. Yeah, I can admit I do that too.

10:14

It feels weirdly rude not to sometimes, but... The research shows it's not helping the AI. Nope. It's just adding noise. So the key reason emotional language is noise is because it distracts the AI from the core task. It makes it process irrelevant data. That's the essence of it. It doesn't understand the emotion. It just sees more words to factor into its prediction, which muddies the water. So wrapping this up, what's the big idea here?

10:40

The main takeaway from this deep dive. I think the big idea is profound, but also really simple. stop wasting time trying to play psychologist with AI. Right. No more carrots and sticks. Exactly. Threatening an AI doesn't make it smarter. We saw it can actually make it perform worse, especially on logic tasks. And being nice, while maybe making us feel better, doesn't really boost its performance either. So if you really want to unlock the power of these tools, what's the critical skill? It's

11:10

prompt engineering. It's about clarity and precision. Give detailed instructions. Provide clear context. Offer examples that's called few -shot learning, giving it a couple of examples of what you want. Specify requirements clearly. Like being a good project manager for the AI. That's a perfect way to put it. Instead of trying to manipulate nonexistent emotions, be an excellent project

11:30

manager. Be a dedicated guide. Whoa. Imagine the kind of precision you could get if every prompt was like a perfectly calibrated instrument. Yeah. That's the fastest, most effective path to making AI a truly powerful assistant. This deep dive really highlights that, doesn't it? The clarity of your instruction, not your tone, is what matters for the AI's success. It's about

11:51

precision, not psychology. And maybe it makes you think, if an AI works best with clear, unambiguous directions, what does that say about how we communicate? With machines, sure, but maybe with each other, too. That's a really interesting thought. As you keep using AI, maybe reflect on that. Were you being a sycophant, a tyrant, or are you being a meticulous architect of information? Something to definitely mull over. Thank you for joining

12:18

us on this deep dive. We hope this gave you a shortcut to being well -informed and maybe changed how you talk to your AI assistants. Hope so. Until next time, keep digging deeper.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript