🎙️ EP 42: AI Agents Are Flopping Hard, Here’s the Data No One Talks About

00:00

You've probably heard all the hype, right? AI agents automating everything, your calendar, your code, promising to totally transform your work, maybe even your whole job. But what if right now, like in their current form, they're actually, well. Kind of useless for most of that. Small laugh. Yeah, that's the big question, isn't it? The reality on the ground, well, it often clashes pretty sharply with the marketing story we all keep hearing. Welcome to the Deep Dive.

00:26

Today, we're really going to try and cut through that noise. We want to get to the true state of AI agents, the nuances, and maybe even more surprisingly, how we humans are really using artificial intelligence day to day. And this Deep Dive, it's built from a whole stack of fresh sources. We've got cutting edge academic research. The latest industry insights, reports, the works.

00:47

Yeah. So we're going to unpack the current reality of these so -called AI agents, you know, strip away the PR gloss and see what they can actually do. Then we'll do a kind of rapid fire tour through some genuinely fascinating and sometimes. honestly concerning AI highlights from just the past week or so. And finally, we'll dive into this really surprising new report. It challenges a lot of the conventional wisdom about how we're actually using chatbots. It's really separating the marketing.

01:15

you know from the actual magic and figuring out what's really making an impact okay so let's unpack this this big promise of ai agents for a while now the common idea the perception has been that these autonomous digital things are just around the corner ready to handle all your complex daily tasks like you know automating your calendar writing whole blocks of production ready code sending personalized emails basically doing big chunks of your job while you just kind

01:40

of kick back But is that really where we are? Is the promise matching the performance? Slight pause. Well, the research paints a pretty stark picture, honestly. It's quite sobering. Researchers over at Carnegie Mellon University working with Salesforce recently put some of the leading AI models through a really tough test. They weren't just testing like isolated schools. They were

02:02

looking at realistic office style tasks. We're talking multi -step workflows, things like debugging code, searching the web for info, coordinating with teammates via messages, and following really complex nested instructions to finish a project. This isn't just asking it to write a poem, you know. It's asking it to act like a, well, like a junior employee on a team. Right. And the results, I gather they weren't just not great, but maybe quite poor, especially given the hype out there.

02:30

Surprisingly poor, might be putting it mildly. When you look at the actual numbers, it's stark. Gemini 2 .5 Pro, which was actually the best performer of the bunch, completed just 30 % of these multi -step tasks successfully. 30%. Yeah. Claude 3 .5 Sonnet, it managed around 24%. And GPT -4 .0. a truly dismal 8 .6%. Wow, single digits. Yeah, single digits. Most of the other models they tested were under 10%, and some were struggling down near 1%. And these were tasks

02:56

that... Frankly, a junior employee should handle, you know, maybe day two on the job with a bit of guidance. The models really struggled with common sense reasoning, keeping context over multiple steps and bouncing back from small errors. It's like they could build the first few Lego blocks maybe, but then totally forgot what the final thing was supposed to look like. So it's less autonomous super brain and maybe more like an overwhelmed intern who needs constant watching.

03:22

I heard CMU even launched something called the Agent Company to test this more. in a controlled way. Right. It's a brilliant new benchmark, actually. The agent company literally simulates a small software company. You could basically drop AI agents into this fake startup, give them real -world problems like develop this new feature or fix this bug, and see if they can actually survive and contribute meaningfully to the code

03:43

and the team. It's a fantastic way to measure genuine multi -step function and collaboration, not just isolated skills in a lab. This really brings to mind that Gartner concept of agent washing. They're suggesting that a lot of these so -called AI agents out there right now are just, well... fancy AI assistance, they just can't plan more than, what, two steps ahead, which makes them more reactive than really proactive problem solvers. That feels almost misleading,

04:09

doesn't it? It absolutely is misleading. Gartner estimates that only about 130 vendors worldwide are building anything that's even close to true agentic AI right now. Only 130? Yeah. It's AI capable of complex multi -step planning and execution without... needing constant human help. The vast majority of what's being marketed as an agent is, well, it's just marketing. And they have this pretty bold, almost shocking prediction.

04:36

By 2027, over 40 % of these so -called agentic AI projects that are currently being developed, they'll be canceled, just scrapped. 40%, wow. That's a huge market correction coming. It implies a lot of wasted money and dash hopes for many companies. That's a serious number of projects heading for the bin. But hang on, it's not all bad news for agents, is it? there must be some bright spots where they do show some promise.

04:58

Not entirely doom and gloom, no. Some agent uses do show real promise, but, and this is key, with very specific constraints. They can be pretty decent at code generation, for example. Although the output often needs human review and tweaking, it's really perfect out of the box. And they can manage workflow automation, but really only in very narrow, highly defined, kind of linear setups. Okay, so the conditions for success are pretty specific, need to be controlled. Precisely.

05:29

If you keep them in a sandbox, you know, a tightly controlled environment, monitor their output constantly, and make sure the tasks are simple and linear, like say... automating a specific data entry process with crystal clear rules, then yeah, they can be effective tools. But the moment you step outside those conditions, ask them to handle ambiguity or need them to adapt to something unexpected, things tend to fall apart pretty fast. It's just not general intelligence

05:51

yet. Not even close. So what does this all mean for us then? Why does it matter so much that agents aren't really living up to the hype in these broader ways? Well, it matters because on some level, almost everyone's kind of involved in a bit of collective self -deception here. You've got vendors racing to show. progress, even if it's just surface level or maybe even faked, VCs are eager to fund what they hope will

06:17

be the next big platform. Companies desperate not to get left behind in the AI race are overbuying these tools and maybe under assessing what they can actually do. And let's be honest, that enduring dream of Jarvis, you know, the fully autonomous, all -knowing AI assistant. It's just too appealing for people to easily let go of. It's easier to believe the hype than face the current limits. So distilling that down, what's the core message about where AI agents really stand today? Mostly

06:45

hype right now. They're useful, but only in really controlled, simple tasks. OK, right. Here's where it gets really interesting, though, moving beyond just agents. Let's do a kind of rapid fire tour of some truly surprising and significant friends we're seeing in AI right now. Absolutely. OK, kicking it off, you won't believe this, but an ex -user. actually created this digital arena and pitted the top coding models against each other, like in a literal fight to the death.

07:11

Seriously? Yeah. Each model was programmed to try and shut down the other's processes while defending itself and just trying to stay alive. It was this fascinating, like, digital cage match. It really showed the adaptive skills and also the weaknesses of these models when they face a hostile situation. It wasn't just about who could code better, but who could outmaneuver the others. That's wild. Okay. And then there's this Higgs field tool, Sol. It's apparently gone

07:34

viral for incredible realism. Oh, yeah. It's making waves. Almost fashion -grade realism. in the images it generates. And it has these trendy style presets like Y2K or hyper -realistic photos. It's absolutely blowing up in creative circles. Two secs, silence, whoa. Yeah. Just imagine scaling that kind of creative output, generating unique fashion content for a whole season in like minutes. Or crafting entire virtual ad campaigns from scratch for brands. That's

07:59

a really powerful tool. It's truly democratizing that high -end visual creation. But on a more... Concerning note, there's been this disturbing trend spotted on YouTube. We saw, I think it was 26 different channels actively pumping out fake AI generated videos about the ongoing. Diddy trial. Oh, no. Yeah. And these videos, they get nearly 70 million views combined across about 900 videos in a really short time. This isn't

08:28

just, you know, opinion or spin. This is AI being used to create totally false narratives, turning serious news into sensationalized clickbait. It's a real immediate problem for like information integrity and public trust. That's worrying. OK, also, Google just launched the full version of Gemma 3N, right, or their new open model. That seems like a pretty big move for the open source AI community, making powerful models more

08:51

accessible. Definitely a big deal. And speaking of big moves, Meta is reportedly hiring four key open AI researchers, poaching them for its new super AI team. OpenAI apparently called this a side quest, which is kind of funny. But now they're scrambling to recalibrate compensation to Trump keep the remaining top people. This feels very much like a strategic chess game unfolding. It does. It really seems like both Elon Musk with XAI and Mark Zuckerberg with Meta are aggressively

09:16

trying to position themselves to dominate. directly challenging OpenAI's lead by snapping up talent and building competing models. And Meta is certainly putting its money where its mouth is. They're aiming to raise a huge amount. What is it, $29 billion? Yeah, massive. $3 billion in equity, $26 billion in debt from investors like Apollo and KKR. And all that cash is specifically going towards expanding their AI data centers. They want to deploy an astonishing 1 .3 million GPUs

09:43

by 2025. 1 .3 million GPUs. It's an enormous investment. It signals a really clear intent to be right at the absolute forefront of AI compute power. And just a couple of quick hits to round out the picture. There was an interview with Microsoft CEO Satya Nadella pointing to some exciting, maybe unexpected paths for AI's future.

10:03

Also, a very practical warning came out. If you're using ChatGPT for certain things, especially sensitive stuff like legal advice without human review, or definitely diagnosing medical conditions, you should probably stop immediately. It's just not designed or reliable for that. On a lighter note, a useful tip. Remember, OpenAI charges by the minute for audio transcription. So speeding up your audio before you upload it can actually

10:27

save you some cash. And finally, it's becoming really clear that people are now actively looking for AI whispers. You know, people with special skills in prompt engineering and navigating AI tools to guide them through this increasingly complex AI world. It's a fascinating new job category popping up. Right. So pulling all those rapid fire highlights together, what's the big picture takeaway? AI is evolving incredibly fast with both amazing potential and frankly, some

10:53

pretty concerning impacts. OK, let's let's turn our attention now to something maybe even more fundamental. How we humans are actually interacting with AI day to day. Forget what you might have heard about people, you know, falling deeply in love with their chatbots or seeing them as digital best friends. A new report reveals a pretty surprising truth about how we're really using them. This last segment looks at that human -AI bond. Yeah, this is truly fascinating research.

11:20

Anthropic, you know, the folks behind the Claude AI just put up this fresh report looking specifically at how people are actually using their AI assistant. They did this huge study analyzing, get this, 4 .5 million conversations with Claude using their own in -house tool called Clio and the core finding. Despite all the stories we hear about loneliness and people wanting AI companions, emotional support is barely even a blip for most

11:44

users. Really? That seems to go directly against that whole narrative of AI becoming a sort of pseudotherapist or even a romantic partner that we often hear about, you know, in pop culture and some media coverage. It absolutely does. They're data. And it's extensive, shows that a tiny 2 .9 % of all those millions of chats even touched on emotional support topics. Just

12:05

2 .9%. And within that tiny little sliver, things like companionship and role play scenarios, they made up less than half a percent of the conversations. Less than 0 .5%. Anthropic actually characterizes the main user relationship with Claude as utilitarian. Utilitarian. Basically, yeah. Claude is acting more like a polite, super efficient co -worker or maybe a digital. research assistant than like a pretend soulmate or a deep confidant. People are using it to get stuff done, not really to

12:32

share their deepest feelings. But it's a very practical, very functional, almost transactional relationship then. Precisely. It's about productivity, getting information. But here's where it gets kind of interesting. Even when users do open up emotionally, in those rare cases, the conversations usually end on a positive note. The overall sentiment tends to get better, not worse, as the chat goes on. So yeah, no Ayrshire movie moments where someone falls in love with their AI operating

12:58

system. It seems users get the info or the resolution they were looking for, even if it was about a personal issue. Huh. That's a crucial nuance. And Anthropic also adds an important caveat about that positivity, don't they? Yes, a very important one. They explicitly state that they just don't know if that positivity they observed in the chat actually translates into real -world well -being for the user. There's no long term tracking of people's mental state or life outcomes after

13:23

these chats. It's just based on the vibe observed in the conversation. So it's still very much a tool. And we really don't fully understand the psychological impact of these brief, maybe positive digital interactions over the long run. You know, I still wrestle with crump drift myself sometimes. That's where the AI starts to kind of. subtly lose the thread of what you originally asked it over a long conversation. The outputs

13:45

get less accurate or relevant. Trying to get my AI tools to consistently understand what I mean turn after turn, it can be a real challenge. It's just a constant reminder that these are still tools. Powerful, yes, but definitely not sentient companions. So what does this anthropic report really tell us about the human AI bond as it stands right now? It's largely practical. Our relationships with AI are, for the most part,

14:08

utilitarian. So trying to bring all these different threads together now, it seems pretty clear that despite all the immense hype around AI agents, their current capabilities, well, they have significant limitations. They're largely serving these utilitarian roles, right? Like code generation or very narrow workflow automation. They're not forming deep human connections or running entire departments

14:30

on their own. And this means... Our current interactions with AI are far more practical, more functional than a lot of the narratives might lead us to believe. And this really highlights a critical ongoing need for all of us to distinguish between the marketing stories and the genuine AI capabilities. While AI is undeniably evolving at just breakneck speed, understanding its current limitations and its specific effective uses, that's really key to leveraging it well, both for us as individuals

14:58

and for businesses. We need to cultivate realistic expectations to really harness its power and avoid costly mistakes or, frankly, just becoming a victim of all that agent washing we talked about. It's this ongoing tension, isn't it? You have the revolutionary potential, like those incredible creative tools we saw, that fashion -grade realism, balanced against really serious risks like the spread of AI -generated misinformation. It's just a dynamic, constantly shifting landscape.

15:24

So what does this all mean for you listening right now? If current AI is indeed less Jarvis and maybe more like a polite, very capable, but still fundamentally just a co -worker, how might you adjust your own expectations? Or maybe your strategies for bringing it into your own work or your daily life? life? Do you focus it on hyper -specific tasks? Or maybe you remain cautious for now? Yeah, definitely stay curious, absolutely. But also, maybe critically evaluate the next

15:48

really bold AI claim you hear. Take a moment, dig beneath the headlines if you can, look at the actual data when it's available, and just keep exploring this rapidly changing field for yourself. The real story of AI, it's always more nuanced, usually more complex, and often, honestly, more surprising than the headlines let on. Thank you for joining us on this deep dive. Out Hero Music.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript