🎙️ EP 144: Google Just Crushed TuesdAI, Gemini 3 Beats Everyone (Even GPT‑5 Pro)

00:00

Welcome back to the Deep Dive. For years, the big leaps in AI, they always seem to happen on a certain rhythm. We even had a name for it, right? Thursday AI. Well, you can go ahead and scratch that off the calendar. Google just completely reset the clock with Tuesday AI. Their Gemini 3 launch was massive. And the metrics they dropped, I mean, they just changed the entire scoreboard. This thing scored a 1501 ELO on El Marina. And maybe more importantly, it crushed GPT -5 Pro

00:28

on the humanities last exam. We're talking 37 .4 % to 31 .6%. That is a huge margin, over six points. This tells me it's not just a smarter chatbot. The real signal here is that we're looking at an entirely new class of AI teammate, something that takes action. And that's exactly where we're focusing today. So for everyone joining us on this deep dive, our whole mission is to cut through all the noise, all the data, and just pull out the most important signals from this whirlwind

00:55

of AI updates. We do have a packed roadmap. We're going to unpack the raw power of Gemini 3, this new agent model, and really look at what those big benchmark scores mean for how you can actually use it. Then we're going to pivot to its unexpected competitor, Grok 4 .1, the whole Vive first model that tried to steal the spotlight. And we'll finish up with the really practical stuff, the new agents coming from industry and the just colossal funding that proves this shift is here

01:21

to stay. Right. The focus is understanding that shift from a simple chatbot to a true multi -step agent that can plan and build things. Okay. Let's get into it. We have to start with Gemini 3. I mean, the performance numbers alone basically set a new state of the art. It didn't just win. It sort of redefined the whole arena. That's

01:40

completely right. And when we say 1501 ELO on El Marina, for you listening, if you don't follow the leaderboards, El Marina is where these top models go head to head on really complex real world stuff. That score, it's the new high watermark for general AI intelligence. But the win on humanity's last exam. That feels more consequential to me. That's not a trivia contest. It's a test designed to measure really deep cross -disciplinary reasoning. Winning there signals a huge leap in cognitive

02:06

ability. And it's built on tech that's pushing way beyond what we have now. They even mentioned a separate model, the DeepThink version, hitting 45 .1 % on ARC -AGI2. And that test, that measures fluid intelligence, you know, abstract reasoning. That's the score that got everyone's attention. I mean, you had Elon Musk and Sam Altman. both publicly congratulating them on the launch. Oh, yeah. When your biggest competitors stop to acknowledge the jump you just made, you know it's a big deal.

02:33

It absolutely is. But the real story here, it isn't just the score, it's the role. We have to talk about this agent upgrade. An agent is an AI that stops just answering your questions and becomes like a planning teammate. Yeah, it's about owning the outcome, not just finding the information. An agent can handle multi -step tasks. It plans out its actions on its own. And it can simulate the results of those actions

02:58

before it even does them. It's like it's stacking Lego blocks of action and data, not just handing you one block. And that all starts with what you give it. The input. Gemini 3 is natively multimodal. And this isn't just a theory. You can give it anything. A complex PDF, a photo, a technical diagram, even a rough scribble on a napkin. It uses all of it as context for its plan. The examples the Google CEO gave were just

03:24

stunning. Imagine sketching a website idea on a napkin, you hand it to the AI, and it turns that scribble into a full working website with all the code. That's utility in seconds. Or think about education. You take a complicated physics diagram and it turns it into an interactive lesson where you can actually move things around, run

03:44

simulations in real time. And it scales up, too, for big business tasks like it can analyze a long video of a golf swing, find a technical flaw and then suggest the exact drills you need to do to fix it. And the output isn't just text anymore. It can generate dynamic layouts with interactive tools built right in, you know, like actual data sliders that appear in your search results. Right. That's a perfect example of it moving from just giving you text to actively

04:06

building tools for you. And we should circle

04:08

back to DeepThink for a second. that version is built for the really heavy lifting for multi -hop logic what that means is the model can link distant non -obvious bits of information together from totally different data sets that's a sign of real abstract reasoning not just better memory it's still in safety testing before it hits the ultra tier which kind of tells you how powerful it is so if this model can build interactive uis and simulate outcomes and plan things out

04:35

What does that actually mean for the future of professional workflows, for creative work, technical work, all of it? It means AI is moving from being a passive source of information to actively building your tools and running complex operations. Now, shifting gears a bit, we have to talk about the surprise competitor that dropped right before Gemini 3 sucked all the air out of the room. I'm talking about Grok 4 .1 from XAI. That was

04:57

such a brilliant counter move. They presented it as vibe first, you know, focused on a specific tone fast, witty, a little edgy, but then it also delivered some really impressive intelligence scores. It's that classic EQ versus IQ battle. Precisely. And Grok 4 .1 gives you two modes to work with. You've got the standard version, which is super snappy, instant replies. And then there's Grok 4 .1 thinking, which it takes its time to process, but it's aiming for a deeper,

05:24

more analytical answer. And that thinking version actually, for a moment, had its day in the sun. It hit number one on the El Reno leaderboard, scored 1510 ELO, just for a few hours before Gemini came along and bumped it to number two. But it proves this isn't just a personality. It's a serious contender. Oh, it's very capable, especially in creative writing. It's incredibly strong. Only the next gen GPT model scored higher. But for me, the most important signal wasn't

05:50

the score. It was the massive improvement in just. basic reliability. They really needed to fix their accuracy problem, didn't they? They absolutely did. So they slashed the hallucination rate from 12 % all the way down to 4 .2%. And their internal test showed factual accuracy went up by 66%. And look, I'll be honest, I still wrestle with prompt drift myself when I'm trying to build these complex, multi -layered queries. So a reliability jump like that, it's huge for

06:14

user trust. Yeah. And that subjective quality matters, too. People describe the tone as friendlier, more empathetic, like the model just gets you a little bit better. That vibe is so important for getting people to actually use it. But here's the huge catch, the big limiting factor. Grok 4 .1 is completely platform locked. You can only use it on X, the social platform. or the Grok website and app. You can't plug it into external tools. You can't build it into your company's

06:44

workflow with an API. It is worth mentioning, though, that XAI did release a white paper with some of their training info. That's pretty rare for them, so it shows they're serious about backing up these performance claims. So why is that platform restriction such a serious limiting factor right now, especially when you compare it to Gemini 3, which is obviously being built for maximum utility everywhere? The tradeoff is control.

07:05

Grok's usefulness is limited because it can't integrate into existing business tools or external APIs. Okay, as we move into the wider industry context, I think it's really important that we just pause here and reinforce a point that always gets lost in the hype. That warning from the Google CEO. It really does need to be repeated. With all this excitement about these powerful new agents, we have to remember they are still, quote, prone to errors. You can't just blindly

07:30

trust the output. Critical thinking is still step zero. That being said, the scale of the infrastructure being built to support these agents is it's just staggering. Just look at the funding. Lambda, an AI cloud company, just raised over one and a half billion dollars with Microsoft heavily involved. And it's specifically to expand the infrastructure needed for this stuff. A billion and a half dollars. That signals a complete arms

07:56

race for computing power. These models need a ton of silicon, a ton of energy to serve up these complex multi -step agent requests at a global scale. Exactly. I mean, when you think about the multi -hop reasoning, the simulations these agents are running, the demand for computation is just exponential compared to a simple text query. Whoa. Yeah, imagine scaling that. Agents simulating outcomes and building UIs for... a billion queries a day. $1 .5 billion is what

08:24

it takes just to get ready for that future. That's a real moment of wonder. Thinking about that scale. And that infrastructure is already being put to work. Microsoft just confirmed this industry -wide pivot to agents at their Ignite 2025 event. They announced Agent 365 and 12 new co -pilot agents, all designed for specific professional jobs like marketing or customer service. The

08:46

agent race is officially on. We also got that funny little reminder of the ethical gray areas that open up when these models get so good at manipulation. Right. The story about the DoorDash customer who got a ref... fund by using AI to fake a photo of a raw burger. Right. It's a small thing, but it points to the immediate potential for just, you know, everyday misuse and fraud. And looking ahead, this competition is only going to get more intense. OpenAI is definitely not

09:11

sitting still. Their VP of research mentioned they have a stronger version of their IMO winning model coming in 2025. And that's their math champion model. So they're pushing pure mathematical reasoning just as hard. So given all of this, the new power, the acknowledged risks, what's the single most important rule for the average user right now to be both effective and safe. Critical thinking remains essential. Always use AI results as a starting point, not the final unquestioned fact.

09:38

So to wrap up this deep dive, I think the core takeaway for you, for the listener, is this. The whole AI race has fundamentally shifted. We are not arguing anymore about who can write slightly better text. Right. The new battleground is complex, multimodal. action planning. We're seeing the rise of true agents, not just smarter chatbots. They can take in all kinds of data and then actively build things for you. And the

10:01

competition is now on two clear paths. On one side, you have pure, raw performance and deep reasoning, which is Gemini 3. On the other, you have specialized tone, speed, and empathy, which is what Grok 4 .1 is going for. And both are fighting for your attention. Thank you for joining us on this deep dive into the new era of autonomous AI agents. The changes are happening so fast and the capabilities are becoming so much more powerful, which makes these conversations just

10:26

absolutely necessary. And as you think about the complexity of these new systems, here's a final thought for you to explore. Considering that safety tests are what's slowing down the release of the most powerful versions, how soon do you think the high reasoning Gemini 3 DeepThink model will actually clear safety? And what new really complex ethical questions will its multi -hop logic raise when it finally gets out there? Something to think about.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript