🎙️ EP 190: AI Agents Just Failed the Office Test. Google Dropped a 4D Vision Breakthrough

00:00

It's sort of the great paradox of January 2026, isn't it? It really is. We have these incredible models, Dash GPT 5 .2, Gemini 3, and they can do amazing things. They can write poetry, they can code, they can even, you know, fake empathy pretty convincingly. Right. But then you put them in a digital office. You give them a Slack login, a Google Drive folder, and a basic analysis task. You just fall apart. It's like a nervous intern on their very first day. That's it. It

00:25

is the intern paradox. And looking at the data we have today, it feels like the biggest reality check we've had in, what, the last two years? Oh, easily. Welcome back to The Deep Dive. It is Thursday, January 22nd, 2026. We're here to slow down, take a breath, and just try to make sense of the noise. And I'm here to help connect some of those dots. So today is really about taking a hard, honest look at where we actually

00:47

are. We're going to unpack why this idea of the autonomous office agent just isn't working yet. We'll be looking at a new benchmark called Apex that is, frankly, a little embarrassing for the big labs. But we're also looking at the other side of that coin. Because while the brains of the AI are struggling with paperwork, the senses, I mean, vision, audio, they've just taken this quantum leap forward. Right. We're talking about

01:14

Google's new 4D vision. Yeah. Pretty surreal story involving Liza Minnelli of all people. It is such a strange week. The tech is stalling in some areas and just hitting warp speed in others. Before we jumped into the data, though, I kind of have to make a vulnerable admission here. I honestly still wrestle with prompt drift myself. I was at my desk this morning trying to get an agent to format a report from three different documents. And I must have spent 45

01:42

minutes just. tweaking instructions, you know, correcting its errors, telling it to stop making updates. And then it hit me. What's that? I could have just done the work myself in 10 minutes. Oh, absolutely. That is the automation tax. You pay it in patience just to prove the machine can do it. Exactly. And that feeling, it's not just me being impatient. It turns out there's hard data to back it up now. So let's start with segment one, the Apex failure. Yeah, this is

02:07

a really significant report. It's called the AI gap. For, I'd say, the last 18 months, the narrative From OpenAI, from Google, from Anthropic has been pretty clear. AI is ready for real jobs. Exactly. We've been told they can be analysts, paralegals, executive assistants. Yeah, just give them the keys. Right. But we haven't really had a standardized way to test that claim. Not in a messy, real -world setting. That's the key. We had tests for write a poem or solve this math

02:37

problem. We did not have tests for be an employee. And that's where the Apex Agents Benchmark comes in. So this is not a multiple choice test. No, not at all. They literally dropped the top models. We're talking Gemini 3 Flash, GPT 5 .2, Opus 4 .5 into these simulated white collar workflows. So things like read the Slack thread, then cross -reference it with the PDF and Google Drive, and draft a reply to the client explaining why

03:01

we can't do the refund. Precisely. The stuff that's actually, you know, 90 % of knowledge work. Yeah. It isn't just generating text. It's context switching. It's reasoning across different sources that might even contradict each other. And the results were? Well, I'm looking at this chart and shocking feels like an understatement. They were abysmal. So the winner was Gemini 3 Flash. You want to take a guess at the accuracy rate? I mean, I would hope for at least a passing

03:30

grade, maybe, what, 65 %? 24. 24%. That was the gold medal. GPT 5 .2 came in at 23%. And the others, like Opus 4 .5 and the standard Gemini 3 Pro, were all hovering around 18%. That is just incredibly low. That means three out of four times the employee gets the task wrong. If I had a human intern with a 24 % success rate, I mean, I'd have to let them go before lunch.

03:55

or they'd just quit. The report notes that in most cases, the models just couldn't piece the information together, they'd hallucinate a policy that didn't exist, or they'd miss a key detail in the Slack thread because it contradicted the PDF. It really feels like a connective tissue problem. They can process the individual data points, right? They can read the PDF, they can read the chat, but stringing them together into a coherent chain of logic, that's where it all

04:20

breaks down. It is the perfect analogy for an intern. You ask an intern, what's the capital of France? They nail it. That's retrieval. Right. But if you ask them, look at these three messy files and tell me if we can legally fire this vendor, they panic. That's reasoning. It really makes you wonder about the timeline. We've all been preparing for this wave of autonomous agents

04:39

to take over our inboxes in 2026. So looking at these numbers, I guess I have to ask, is the dream of the autonomous AI employee officially dead for now? Not dead, just delayed. They lack connective tissue reasoning. That distinction is key. But while the current models are struggling, the labs aren't just sitting still, are they? No, not at all. And this brings us to the leaks and the new tools emerging this week. The rumor mill is really spinning. We've got a leak about

05:07

GPT -5 .3, which is codenamed Garlic. Garlic? That's an interesting choice. I know, right? But the leak suggests a pivot. Instead of just bigger is better, which has kind of been the strategy for five years, this next version is supposedly all about cost, speed, and crucially reasoning. Which would be a direct answer to the Apex failure. Exactly. If they can crack that reasoning bottleneck without making the model 10 times more expensive to run, well...

05:36

That changes the entire equation. But while we're waiting for garlic to maybe save the day, there are some tools right now that are quietly solving this workflow problem just by changing how we interact with the info. You mentioned Notebook LM earlier. Yeah, Notebook LM really feels like a sleeper hit to me. Everyone is chasing the big agent dream. But Notebook LM has evolved into this amazing all -in -one researcher. It's gone from just a summarizer to a synthesizer.

06:01

It has. The new workflows are really impressive. You can just dump in raw research papers, your notes, transcripts, and it doesn't just chat with you about them. It converts them. Right. I saw that workflow where it takes all that raw data and turns it into a slide deck outline, generates an audio overview, which sounds creepily human, and then it builds out comparison tables all in minutes. And that connects to this broader trend we're seeing, this idea of no PowerPoint.

06:28

Please tell me that means what I think it means. Well. It means we stop fighting with PowerPoint. The shift is towards creating consistent, full slides using scripts. So you write the narrative and the AI builds the visual container for you. No more dragging text boxes or aligning fonts. Exactly. That's the dream. We become editors -in -chief instead of slide designers. Yeah. But there was another tool leaked from OpenAI that sounds a little more managerial. Salute.

06:54

Salute is fascinating. It's an internal tool they're testing, and it's basically a project manager AI. You upload your files, you assign tasks to the AI, and it actually tracks the progress. So it's not just do this one thing. It's here's the project, now you manage the steps. It's trying to bridge that gap we saw in Apex. If the model can't reason across tasks on its own, maybe we just need a dedicated manager layer of software to force it to stay on track. Feels like a real

07:20

shift in our role, you know? With tools like Salute and Notebook LM, are we actually working less or are we just managing the AI more? Managing more, but the output quality is exponentially higher. That feels like the trade -off of 2026. You're the conductor now, not the first violin. But let's talk about the business side of this. Because running these reasoning models, these manager layers, it is not free. Far from it.

07:45

And the economics are getting pretty ugly. We saw a report on Anthropic, the makers of Claude. Their margins just took a really significant hit. Dropping from 50 percent down to 40. That's a huge slice of profit. It is. And the culprit is exactly who you'd expect. The cloud providers. Google and Amazon hiked their server costs by 23 percent. It's the unsexy reality of the AI ecosystem, isn't it? You can have the smartest model in the world, but if you have to pay a

08:11

landlord to run it, you're at their mercy. It's the infrastructure squeeze, and it's driving these massive capital deals. We just saw the Saudi infrastructure fund team up with Humane for a $1 .2 billion deal. Just to build data centers. $1 .2 billion just for the plumbing. For the plumbing. But, you know, while the infrastructure is getting more expensive and the reasoning is hitting a wall, the creative output is finding these really interesting new business models.

08:39

This Eleven Labs story really caught my eye. It's a landmark moment, I think. They dropped a 13 -track AI music album. And we're not talking about some, you know, anonymous generated lo -fi beats. This features Liza Minnelli and Art Garfunkel. Now, to be clear, because the ethics here usually get messy, Liza Minnelli did not go into a booth and record this, correct? No, she did not. But her state and her team signed off on it. And that's the key. Yes. Full royalties

09:05

are being paid. The labels are actually involved. It's a licensed collaboration. They're essentially treating her voice as an instrument, like a Stradivarius, that can be played by the AI with the artist's permission. That's a complete 180 from all the lawsuits we saw back in 2024 and 2025. It suggests a future where an artist's prime voice can become immortal. Exactly. And Spotify is leaning into this whole vibe culture, too. Their new prompted playlists are getting shockingly good. You don't

09:35

search for a genre anymore. You just type in, make me a playlist that feels like rainy Tokyo nights. And it actually understands the semantic texture of that request. It's moving from metadata like this is a rock song to emotional data. This song feels like heartbreak. That's it. So does the Eleven Labs album prove that artists and AI can actually coexist profitably? Yes. It turns legacy voices into a scalable, renewable asset. A renewable asset. That is a wild way to think

10:02

about a human voice. But speaking of sensing the world, we have to talk about the breakthrough of the week. This is the one that actually made me just stop and stare at my screen. Google's D4RT. Right. Dynamic 4D reconstruction and tracking. Now, we've had computer vision for a while. My car sees lane lines. My phone recognizes my face. How is this different? This is the difference between looking at a photograph and actually

10:24

stepping inside the room. Traditional computer vision looks at a 2D image and draws boxes around things. Cats, cars. Right. D4RT watches a video and builds a world model. A world model. Yes. It tracks every single pixel across time. That's the fourth dimension here, time. And from that flat video, it reconstructs a full 3D scene. So if I showed a video of me walking around my kitchen, it's not just seeing a man in a kitchen. It's actually... Building the 3D geometry of

10:53

the fridge, the table, the coffee cup. And it understands exactly where the camera is moving in that space. Correct. And it handles the messy stuff that usually breaks these models. Motion blur, things getting blocked. What we call occlusion. Ah. Like if you walk behind the fridge, old models would think you just vanished. D4NT knows you didn't disappear. It knows you're just behind an object. Okay. And the speed. That is the whoa moment. Previous models, and this was cutting

11:18

edge just six months ago. would take maybe 10 minutes to parse a video like that. 10 minutes, okay. I4RT does a one -minute video in five seconds. Two -sec silence, five seconds. Whoa. That's basically real time. It's anywhere from 18 to 300 times faster than anything else out there. It uses this unified query system. So instead of having one model track objects, another one gets the depth, another one map the room, D4RT does it all in a single pass. I mean, imagine

11:45

the implications for... Well, for everything. Robotics, obviously. If a robot can understand the 3D geometry of a room in milliseconds, it can navigate like a human. But also surveillance, content creation. It's like the AI isn't just watching a movie anymore. It's building the movie. set in its head instantly. It's superhuman perception. We're giving machines a spatial understanding of reality that might actually be faster than our own ability to process a scene. It's a fundamental

12:11

shift. So if machines can reconstruct reality this fast, what happens to truth in video evidence? We stop trusting our eyes and start trusting digital watermarks. That is a chilling thought. We're going to take a quick breather here. We've covered the failures of the office intern and the superhuman vision of Google. When we come back, we are going to look at the Empowered Utility Belt, the specific tools you can use right now to make your life a little easier. Stay with

12:38

us, and we are back. Okay, let's get tactical. We've talked about the big models, the high -level concepts, but what about the tools that actually save you time on a Tuesday afternoon? We called this section the Empowered Utility Belt. Yeah, there are some great ones this week. First up is something called Clawed Cowork. This one is interesting because it's not a new app. It's

13:00

more of a workflow, right? Right. It's a 13 -minute exercise, and it's designed to turn Claude from just a chatbot into a real thinking partner. It's all about priming the model with your context so you don't have to explain your job every single time you open a new chat. I love that. It's like onboarding your AI colleague one time instead of every single morning. What else you got? ChartGen AI. This one is for anyone who just drowns in data. It takes raw data from pretty much anywhere.

13:26

Facebook ad exports, TikTok analytics, just messy Excel sheets. And it turns them into professional charts in seconds. So no more fighting with Excel pivot tables. There are more pivot tables. And then there's Locate Store. It's very specific, but very cool. You have a Google sheet full of addresses. It instantly turns that into an interactive map with search filters. That's huge for logistics or even just planning a trip. But the one that really caught my eye for pure automation is Demonstrate.

13:55

Okay, how does that work? You record a browser task one time. So say, log into this portal, download the invoice, rename it, and upload it to Dropbox. You just do it once, Demonstrate records it, and then this is the kicker, it deploys it as serverless code. So you don't just get a macro, you get a piece of software that runs in the cloud for you. Exactly. You are literally programming by doing. And finally, we have Calum.

14:18

Right, the AI calendar assistant. But it handles the hard stuff like multi -person meetings, shared availability, all the real world constraints. It's really trying to kill the email ping pong of scheduling. Of all of these tools, if you had to pick one, which of these actually saves you an hour of sleep tonight? Demonstrate automating browser tasks is the ultimate cure for boredom. I am with you on that one. Anything that can fill out a web form for me is a friend of mine.

14:43

Amen to that. So let's zoom out. What does all of this really mean for you, the listener, sitting here in January 2026? I think we're in what you could call a gap year. A gap year. Explain that. Well, the text -based agents, the brains, they're struggling. GPT -5 .2, Gemini 3, they're getting barely 24 % accuracy on real work. The brain is still learning how to file paperwork. Okay. But while the brain is stumbling, the senses

15:10

are just exploding. Exactly. Google's D4RT vision is seeing the world 300 times faster than before. Eleven Labs has solved the legal and creative puzzle of AI music. So the synthesis here is, the AI might not be ready to be your lawyer or your project manager just yet, but it can see your world and sing your songs better than ever before. The brain is lagging, but the eyes and ears have become superhuman. That's a really

15:37

powerful place to be. It means we have to stop waiting for some magic general manager AI and just start using the super sensory AI tools we have right now. Use the vision, use the voice, use the automation. Don't wait for the reasoning to be perfect. And if you want to start somewhere practical, I would highly encourage you to try that Claude Cowork 13 -minute exercise we mentioned. It's a small investment that really pays off every time you open that chat window. It really

16:02

does change the dynamic. We're going to leave you with one final thought to mull over. Yeah, we talked about Google's D4RT, how it reconstructs a 3D world from a flat video in just five seconds. It's incredible. So here's the question. If an AI can reconstruct a 3D world from a flat video in five seconds, how long until it can reconstruct a better version of your workflow than you can? Something to think about. Thanks for diving in with us. We'll see you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript