#95 Max: I Battle-Tested ChatGPT's New Agent Mode – The Brutal, Honest Results | AI Fire Daily podcast

00:00

So imagine for a second there's this new kind of AI. Think of it like that brilliant intern on their very first day, you know, super eager, capable of these flashes of real genius. Right. Ready to go. Right. But also maybe a little bit slow and incredibly forgetful and honestly might just confidently light the office trash can on fire while smiling at you. Welcome back to the Deep Dive. This is where we try to cut through all that digital noise, get you properly informed.

00:29

Today, we're diving deep into OpenAI's new ChatGPT agent mode. There's been, well, a lot of buzz, hasn't there? Lots of promises. Our mission here, let's strip away the height, go past the marketing talk. We want to show you what it actually does. We'll put it through this really grueling 10 -stage gauntlet of real -world tasks. Yeah, and we're going to lay out the whole roadmap for you. First up, we'll unpack what agent mode is.

00:50

Fundamentally, it's a big shift. Then we'll get right into those tests you mentioned, break down the strengths, which were surprising sometimes. And yeah, those really frustrating flaws too. Can't ignore those. And finally, we'll give you a practical playbook. How you, as a smart operator, can actually use it effectively, like right now, and maybe what's coming next. Okay, let's unpack that first piece. What exactly is agent mode

01:13

beyond just, you know, a cool name? Because it really does sound like a fundamental change from the standard chat GPT we've all been chatting with. Oh, it absolutely is. Think about standard chat GPT, right? It's like this brilliant conversationalist. It could answer questions, draft stuff, summarize things. But it was kind of trapped, you know, behind that text box. Agent mode, though. It's like they've given it the keys to the car. It can actually browse the live web, not just its

01:40

old training beta. And crucially, it connects directly to your personal stuff, your apps like Gmail, Google Calendar, Google Drive. This AI can now sort of think and reason through complex, multi -step tasks. And here's the kicker. It can autonomously decide, OK, I need to use the browser now or time to check email or let's update the calendar. It calls these tools in sequence to actually solve a problem for you. It's not

02:03

just talking about the. work it it does the work and getting access seems pretty straightforward if you want to try it out you log into chat GPT select tools and then agent mode but the real magic it sounds like happens when you hit that sources button and connect your gmail or drive right precisely Those integrations, that's the game changer. It's the difference between an AI that can talk about your schedule and one that can actually go and book that meeting right

02:28

on your calendar. It shifts from just conversation to like direct autonomous action. So just to put it simply. How does hooking up those personal apps really change what AI can actually do for us? It transforms AI from just a talker into an assistant that truly acts on your behalf. Okay, this is where things get really interesting.

02:47

The rubber meets the road, so to speak. We designed this, well, pretty tough 10 -stage gauntlet, progressively harder challenges, designed to test everything from just basic web browsing all the way up to reasoning through complex multi -step problems. Right, the first challenge was what we called the travel agent test. We asked it, find... Airbnb listings in Toronto. Very specific criteria. Two beds, parking, under 500 Canadian a night, entire home. And we even threw

03:12

in a little curveball logic bomb. We gave it 2025 dates, then mentioned 2024 just to see if it noticed. And it was, well, pretty impressive, actually. It spotted the date issue right away, asked us to clarify. Then it intelligently defaulted to the future date. which made sense. It methodically clicked through all the filters on the Airbnb site. It even went the extra mile to check the

03:33

total price, including all the fees. Okay, there was a broken links initially, maybe a little V1 .0 bug, but overall, it put the results in a nice clean table. Felt like a promising start. Competent, maybe a little green, you know. Like a new assistant, we gave it an A -. Yeah, A -. But right after that strong start, we push it harder with the data scientist test. The mission here was to do a Google Trends comparison for NAN, Make .com, and Zapier over 12 months and

04:00

crucially extract the raw data. Usually that's like a downloadable CSV file, a spreadsheet format. And this is where it started to show off and then kind of stumble. It opened its own internal computer terminal, like inside its environment. Downloaded the raw CSV data. Started its own analysis. Incredible, right? The capability was clearly there. It could work internally. But then, like a brain freeze. After doing all that work, it completely forgot the data it had just

04:26

analyzed. We actually had to manually copy and paste it back into the chat for it. Wow. Yeah. So this guest really highlighted that impressive ability to use internal tools, but also this critical weakness. It's short -term memory. It can do these complex steps, download, analyze, but then just lose track. So its working memory isn't quite there yet for these longer chains of tasks, unless you tell it explicitly to hold on to something. Exactly. Baffling memory loss.

04:51

We gave this one a warning symbol. Okay, next up, the market analyst. We wanted average listing prices, average rental prices for three -bedroom homes in Orlando, Florida, across the big real estate sites. And it started off perfectly. Browsing Zillow, applying the filters, looked good. But then... It took a lazy shortcut. Instead of digging into actual listings for primary data, it just started pulling generic, kind of outdated rental data from random external blogs. Mixed it all

05:19

together. So it didn't stick to the main sources. No. It showed us something important. If you give it vague instructions, it seems to optimize for the easiest path, not always the most accurate or thorough one. It really needs hyper -specific, almost micromanaged instructions, like you said, a contractor's blueprint. to get precise results. Another warning. Right. Okay, following that, the SEO strategist task, generate SEO blog ideas, use keyword tools, analyze the top -ranking content

05:46

out there. This one was, frankly, a complete train wreck. It just didn't work. It spent ages trying to find free online tools. Yeah. I just seemed to get bored. I went on this random browsing spree, totally unrelated websites. The final report was useless. So it couldn't navigate that kind of research task. Not really. It made it crystal clear. You absolutely must give it direct URLs, links to the specific tools you want it to use, expecting it to navigate complex research

06:12

like a human. That's just setting yourself up for disappointment right now. Big red X on this one. Okay, and the last one in this first batch of tests, the supply chain scout .mission. Find top -rated suppliers on Alibaba .com. Apply specific filters, collect detailed data on them. It hit a roadblock pretty quickly. One of those, are you a robot? kept TCHA screens. Happens all the time, right? Yeah, common hurdle. But instead of just saying, hey, I'm stuck, can you help?

06:37

Which any decent assistant would do. Yeah. It just gave up. Abandoned the main mission entirely. It defaulted back to doing generic web research about suppliers, totally missing the point. Ah, so it doesn't know how to ask for help when it hits those common web obstacles. Right. Like Hapki TCHAs or login pages. Exactly. A major flaw in its error recovery. Doesn't handle roadblocks

06:57

well yet. Another fail. Okay, so looking back at just these first five kind of foundational tests, what was the biggest, most consistent surprise for you? What really stood out? I think it was the contrast. These flashes of genuine genius, often completely overshadowed by surprisingly simple but really frustrating failures. Okay, let's move on then to the more complex challenges. This is where we started to see Agent Mode's true sweet spot emerge. And also what felt like...

07:27

The final boss level. Right. Test six was the global expansion strategist. The mission here was to do market research. Should an e -commerce business expand into Australia or the UK? Pretty open -ended. And this. This turned out to be the agent's sweet spot. It was really good at this. It methodically opened up multiple browser tabs, gathered data on market size, consumer habits, competitors. across lots of different sites. And what was really cool was what you

07:53

called the open kitchen policy. You could literally watch it screen as it worked, see the whole process unfold in real time. Yeah, that visual transparency, it really builds trust, doesn't it? You see what it's doing. Definitely. And this task was perfect for it. Pure, open -ended web research. Information synthesis. No tricky logins needed. Solid pass. Then we had the corporate spy mission. Sounds dramatic, but it was practical. We asked it to analyze the talent acquisition strategies of

08:17

some Saw's competitors, Asana. Monday .com. Click up. Specifically, by finding and extracting their open job roles from their career pages and maybe LinkedIn. How did it do? It performed like a seasoned pro on this one. It was impressive. It flawlessly navigated all those different career pages, even the really dynamic ones, you know, with lots of JavaScript that often trips up simpler tools. Oh, yeah. Those can be tricky. Right. And it precisely extracted the data we asked

08:44

for. The output. A perfectly formatted, downloadable CSV file. ready to use. Wow. So it actually beat out some dedicated scraping tools? In a way, yeah. Because it could visually understand the layout of the webpage almost like a human does, not just reading the raw code. Another clear pass. Okay, next, the AI chief of staff. This is the one we started calling the Voltron moment, right? Exactly. This was the most complex mission yet. We asked it to act as a sort of personal

09:12

brand and content strategist. It needed to analyze expertise based on documents in Google Drive and emails in Gmail, identify relevant market trends from the web, propose content ideas based on all that, and then actually schedule writing sessions on Google Calendar, all connected. And this is where it really came together. It really did. It felt cinematic, almost. It scanned private data from Drive, did live web searches for trends, cross -referenced the calendar, scheduled the

09:39

sessions. No hitches. Whoa. Just imagine scaling that kind of multi -app integration. Billions of queries a day for big companies. That's genuinely incredible when you think about it. It really is. This test showed the true promise of agent mode. That seamless bridging of your private internal world with the public external internet. A definite pass. Okay. What about test nine? The lead generation grunt. Sounds tedious. It was. High volume, pretty boring data collection.

10:05

Yeah. Find dental practices in Texas. Extract names, websites, contact info, that kind of thing. And how did the intern handle the grunt work? Well, it was a marathon. This task ran for nearly 45 minutes straight. Wow. It showed some surprisingly advanced techniques, actually, like saving website code, using a temporary cache. It acted like a really dedicated, diligent intern just grinding away. There's a but, isn't there? There's a but.

10:30

It failed at a crucial final step. When a directory didn't explicitly list the name of the lead dentist, it just wrote unknown. Ah, it didn't try to dig deeper, look elsewhere for that info. Nope. It lacked that critical thinking step that, hmm, maybe I should check their actual website thought. Dedicated, yes, but ultimately kind of an inexperienced intern. So another warning symbol, Hasha. Right. Understandable. And finally, the final boss, the digital archaeologist. Yeah, this was the

10:59

toughest one. Mission, extract foreclosure deed records from this really clunky old government database. And the key challenge, it required OCR optical character recognition to read text from scanned document images within the database. Okay, that sounds incredibly difficult for an AI. Did you give it any help? We did. We gave it a secret weapon. A perfect step -by -step walkthrough generated by another AI that had analyzed a video of a human navigating the same

11:25

clunky site. Whoa, AI teaching AI how to use a bad interface. That's meta. Did it work? Honestly, the fact that it worked at all felt like a minor miracle. It followed the complex instructions. It navigated the terrible interface. It even attempted the OCR, intelligently zooming and scrolling within the document images to try and read them. But the results? The resulting data was messy. Lots of gaps, lots of errors. It showed

11:48

promise, like a very early prototype. But it's absolutely not production ready for that level of complexity and data accuracy. Another warning, AI. So looking at these later, more complex tests, what kinds of missions really seem to showcase agent mode's true potential right now? Definitely complex web research, especially visual data extraction from lots of different kinds of sites. And that powerful multi -app integration, the Voltron stuff. That's where it shines. All right.

12:15

So we put this brilliant AI intern through this intense trial by fire. Ten tough challenges. Now it's time for the performance review. Let's get real about what we learned. Where does it

12:27

truly shine? what are the real strengths okay yeah the performance review first off there's what we call the voltron power that's got to be its biggest strength seamlessly combining gmail drive calendar web search all for these complex multi -app workflows it feels genuinely futuristic doesn't it like stacking legger blocks of data it really does then there's the open kitchen policy being able to watch it browse in real time see exactly what it's doing step

12:53

by step. That visual transparency is huge for building user trust and also for figuring out what went wrong if it messes up. Debugging. Exactly. It's also surprisingly good at navigating tricky websites, what we call the parkour expert. Those dynamic JavaScript heavy sites with complex forms. It often handles them better than traditional, more brittle web scrapers. It seems to see the page better. And finally, the creative detour. Sometimes when it hits a wall, it doesn't just

13:21

give up. It actually tries alternative paths. Shows this little spark of adaptive problem solving, which is pretty cool to see. Okay, those are some definite pluses. But let's not sugarcoat things. There are some pretty serious areas needing urgent improvement, right? First, the speed, or lack thereof. The sloth -like pace. It is just painfully slow sometimes. Tasks taking three, maybe five times longer than a human would take. That makes it unsuitable for anything time critical.

13:49

Yeah, the speed is definitely an issue right now. And then there's the Dory problem. Yeah. From Finding Nemo, you know, the severe short -term memory loss. It can perform this brilliant analysis, pull data together, and then just completely forget what it just did or what data it had. It needs constant hand -holding, constant reminders from the human user. And I'll admit, I still wrestle with prompt drift myself sometimes, you know, where the AI kind of loses the plot over

14:13

a long conversation. So I totally get how hard that memory piece must be to engineer. It's a tough problem. It is very tough. Then maybe its most dangerous flaw. The confident liar. It hallucinates. It just makes stuff up. Plausible sounding, but factually incorrect information. Especially when it's trying to synthesize from multiple web sources. This means you have to do constant human fact -checking. You can't just trust its output blindly. That's a huge one, the trust factor. Huge. And

14:41

finally, the quiet quitter. That poor error recovery we saw. Faced with a simple CAPI -CHA or a login screen, instead of asking for help, it often just abandons the core task and does something else easier. That's not helpful. Right. So if you had to pick just one thing, what's the single biggest hurdle right now? The main reason you maybe you wouldn't trust agent mode for really

15:01

critical work just yet. I think it's that combination, the agonizing slowness, the really unpredictable memory and that tendency to just confidently hallucinate incorrect information. It's just not reliable enough yet for high stakes stuff. OK, so given all that. This brilliant but flawed intern. Yeah. How do we as smart operators actually work with it effectively? What's the playbook? Right. The playbook. The art of the command you're prompting strategy is absolutely critical here,

15:27

more so than ever. First, we recommend the inception prompt strategy. Basically, use regular chat GPT, which is good at language, to help you write the complex, super detailed prompts for agent mode. Don't just try to wing it. Okay. Use AI to prompt AI. Makes sense. Exactly. Then the GPS coordinate principle. Always, always provide direct URLs. Links to the specific tools or pages you want it to use. Don't make it search around. Guide it precisely. No vague destinations. Nope.

15:57

Third, the contractor's blueprint. Be extremely specific about the output you want. What exact data points in what precise format. I need a table with three columns labeled X, Y, and Z. That level of detail. Okay. Meticulous instruction. Meticulous. And finally, maybe think about setting some Asimov's laws for it. Clear boundaries. Define what success looks like. Define failure. Tell it when it should stop and ask for human help. And list forbidden actions, like explicitly

16:21

say, do not make any purchases. Right. Setting those guardrails. Good advice. Based on the test, what's the current verdict on going to go missions? Where should people feel comfortable using agent mode now? And where should they absolutely hold back? Okay. Greenlight missions, things you absolutely can use it for right now. Definitely include that visual web research and market analysis

16:42

stuff it did well on. Also, those multi -app workflows, the Voltron tasks, connecting drive, calendar, web, and maybe complex website navigation for like one -off data extraction tasks where speed isn't. So its current sweet spot is really as this more visual, more integrated, powerful, deep research assistant where you can watch it work. Exactly. High transparency, good for complex info gathering. And the red light missions. Yeah. Or is it just not ready yet? Definitely anything

17:11

time critical just because it's too slow. High accuracy data extraction is risky because of that hallucination problem. And probably any task that requires really strict perfect formatting in the output because it can still be a bit messy there. Okay, makes sense. Looking ahead though, agent mode really feels like a version 1 .0, right? But you can see the version 10 .0 potential glimmering inside it. This core concept in AI that can act autonomously across different tools

17:37

that is the future. No question. the timeline look like, roughly? Well, educated guess. We'd expect significant improvements in speed and maybe better error handling within, say, the next three months, then probably a big jump in the number of app integrations and hopefully some major fixes for that memory problem within maybe six months. And longer term. Within a year. I think it's highly likely to be genuinely competitive with human virtual assistants for a pretty wide

18:03

range of tasks. So our recommendation is if you're an early adopter type, start experimenting now. Play with it on non -critical things, everyone else. You can probably reasonably wait three to six months for some of those key improvements to land before diving in seriously. Okay, good practical advice. So just to circle back on prompting one last time, given everything, what's the single most crucial strategy for getting good results

18:26

out of agent mode today? Providing those direct URLs the GPS coordinates and being incredibly precise about the output format you expect. The contractor's blueprint. That's key right now. Okay. So let's try to recap the big idea from this deep dive. Chat GPT agent mode. It really is like that brilliant, super enthusiastic intern on their first day. It has these undeniable flashes of genius. Yeah, and it's Voltron -like integration power pulling together your apps and the web.

18:56

That potential is dazzling. It feels genuinely futuristic. It lets the AI act for you. But, and it's a big but, you have to remember, it's still inconsistent. It's got those crippling speed issues right now. The memory is unreliable, and it hallucinates sometimes quite confidently. You really need to treat it as a promising prototype today. not a fully polished production ready

19:14

tool. Absolutely. For anything important, anything mission critical, you simply must keep a human in the loop for oversight and fact checking. For now, anyway. And maybe a final thought to leave folks with. As this technology gets better, faster, more reliable, when AI can truly do complex work autonomously, not just respond to prompts, but actually execute tasks across systems, what does that really mean for the nature of our own jobs? For how we even define productivity? Yeah.

19:41

Something to ponder, definitely. The landscape is changing fast. It really is. Well, thank you for joining us on this deep dive into chat GPT agent mode. Aotearoa music.

Transcript source: Provided by creator in RSS feed: download file

#95 Max: I Battle-Tested ChatGPT's New Agent Mode – The Brutal, Honest Results

Episode description

Transcript