#376 Neil: GPT 5.4 Vs Claude 4.6 Vs Gemini 3.1 The Ultimate AI Battle Is Herer

00:00

Just yesterday, GPT -4 changed our world. Pete, today it's already history. It really is moving that fast. Welcome to this deep dive. I am deeply grateful you are here with us today. Yeah, thanks for tuning in. Our mission today is highly specific and honestly deeply important. We are unpacking a massive, comprehensive, early -2026 review. We're looking at the three new AI titans currently dominating the landscape. The big three, GPT -5 .4, Claude Opus 4 .6, and Gemini 3 .1 Pro.

00:34

And we are not just looking at dry, boring numbers today. No, absolutely not. We put these titans through an absolute gauntlet. A real torture test. Exactly. We ran them through five hardcore real -world tests. Testing things you actually do every single day. Right. We had them detect fake financial data. We checked if they can write human -like apologies. Which is surprisingly difficult for a machine. It really is. And we even made them code complex JavaScript games.

01:00

The technological landscape has shifted so dramatically lately. It's truly mind -blowing. These new models can think incredibly deeply right now. They can remember thousands of pages in just seconds. The sheer memory capacity is staggering. Let's start by looking at GPT 5 .4 from OpenAI. Think of this model as your ambitious top student. The quantitative analyst. Exactly. It boasts a massive 1 million token memory limit. Which is essentially like memorizing a very thick textbook

01:31

instantly. Right. It costs $2 .50 per million tokens. Pretty reasonable for the power. It is. It's incredibly fast and it deeply excels at complex logic. You can easily find it on platforms like OpenRouter right now. Yeah, it's highly accessible. Then we have Claude Opus 4 .6 from Anthropic. I like to think of Claude as your dedicated expert in writing. The Director of Communications. Right. And it features something

01:55

called Agent Teams right out of the box. It spawns many workers to handle separate parts of a task. Which is a massive plus for complex workflows. It delegates beautifully. It does cost a bit more, at $5 per million input tokens. Yeah, it's pricier. It has a 200 ,000 standard memory limit. Though there is a 1 million token beta right now. True. But it is highly safe and feels remarkably natural. Finally, we have Gemini 3 .1 Pro from Google DeepMind. This model is the absolute beast

02:24

of raw performance. The ultimate value workhorse today. It's the cheapest at just $2 per million tokens. And it scored an incredible 94 .3 % on the GPQA science test. That science score is honestly staggering. It really is. And it is natively multimodal right from the start. Meaning it understands audio and video directly without any text conversion. It skips that translation step entirely. Exactly. Which of these specs actually changes the day -to -day workflow for

02:52

a user? It really comes down to that massive memory size. When an AI can hold a mill - tokens, everything shifts. You stop breaking your projects into tiny frustrating pieces. You just feed the model everything all at once. The friction is just gone. So bigger memory means fewer bottlenecks for heavy daily tasks. Absolutely. It fundamentally changes how you work. Let's move to our first major evaluation. The hallucination test. Right. We need to talk about creating trustworthy financial

03:20

reports. This is a massive headache for busy professionals today. Hallucination is just when the AI confidently makes things up. And it ruins trust instantly. You ask the model to write a serious report. It gives you strong numbers and detailed links. But then you actually click on those links. And the link is broken or the number is totally fake. To test this, we used a raw PDF file. dense data. It contained real Southeast Asia stock market data. This was data from 2025.

03:50

We asked them to write a 1 ,500 word report. And they had to use real Bloomberg or Reuters links. We put their data processing through an absolute grinder here. Let's start with Gemini 3 .1 Pro. It had perfect accuracy down to the decimal. It nailed the exact GDP numbers for Vietnam and Thailand. Yeah, it has really great Google search integration built in. But we did notice it was a bit lazy with the layout. It basically just dumped all the links at the very

04:15

end. Claude Opus 4 .6 took a different approach. It was incredibly smooth and highly professional. It read like it was written by a real financial expert. But here is the most crucial detail of this entire test. We planted a deliberate fake data trap in that PDF file. Yeah, we actively tried to trick it. We swapped some key export numbers around. Claude actually caught the trap. It immediately warmed the user. It stated it

04:41

would use correct real -world data instead. It earns a massive high score for careful self -checking. That is exactly what you want in an analyst. Then we evaluated the report from GPT 5 .4. It was highly detailed. It initially looked very impressive. But it hallucinated several extra filler parts entirely from scratch. It just invented things to make the report look longer. Why does GPT feel the need to invent filler content? It seems wired to provide the most exhaustive answer

05:09

possible. It tries to draw extra connections to look more thorough. It prioritizes creating a long response over strict factual accuracy. It just wants to impress you with sheer volume. It prioritizes looking comprehensive over sticking strictly to the facts. It's a classic case of trying way too hard to please. Let's transition to our next fascinating evaluation. The human touch. Being accurate is important, but sounding human is a totally different challenge. Oh, absolutely.

05:38

We asked the models to write an apology letter. It was for a late package sent to a very frustrated customer. We explicitly wanted to avoid classic robot speak here. We all know those phrases. We hate phrases like, in today's fast -paced world. Or the dreaded, not only, but also structure. Those are massive red flags for AI text. They instantly break the illusion of empathy. Claude was the absolute undeniable winner in this category.

06:04

A perfect 10 out of 10 for human style. It used natural pauses and beautifully short sentences. It genuinely sounded like a truly sorry friend talking to you. GPT 5 .4 felt completely different in its emotional approach. It sounded like a legal department trying to avoid a lawsuit. The sentences were much too long. Overly professional. It completely lacked warmth. We gave it a 7 out of 10. Gemini found a somewhat awkward middle ground here. Yeah, an 8 out of 10 for human style.

06:32

It was easy to understand throughout the main paragraphs. But the final ending felt like a canned corporate template. Is it harder for AI to mimic empathy than to do math? Math just follows a strict set of logical, unbreakable rules. Empathy is subtle. It's full of strange human contradictions. AI struggles heavily when there is no objective right answer. What feels warm to you might feel deeply condescending to me. Yeah. Math has strict rules, but human empathy is incredibly messy.

07:05

Messy, subjective, and highly dependent on cultural context. We're going to pause for just a brief moment. Be right back. This edition of The Deep Dive is brought to you by our premium sponsors. Support for our show helps us continue bringing you these in -depth analytical reviews of the latest technology. Check the show notes for exclusive listener discounts. And we are back. Let's look at analyzing deep data insights. The big Excel tests. We wanted to see how they handle heavy

07:28

unstructured information. So we uploaded a massive 50 ,000 row Excel spreadsheet. It contained raw sales data from a local retail shop. We asked them to find strange hidden shopping patterns. We wanted insights that a human analyst might never notice. Each model looks at raw numbers in its own special way. It's literally like having three different experts in your office. Let's discuss Gemini 3 .1 Pro first. It leveraged that massive 1 million token memory perfectly. Whoa.

08:00

Imagine scaling to a billion queries. It read 50 ,000 rows in exactly one second. The sheer processing speed is genuinely hard to comprehend. And it actually found a completely hidden shelf placement trend. Yeah, it noticed people bought umbrellas next to sunscreen. Because they were prepping for extreme weather shifts. A human would rarely cross -reference those two random items. Clytopis 4 .6 took a very different analytical path. It acts much more like a trained consumer

08:28

psychologist. It focuses heavily on the feelings behind the raw numbers. Right. It looks for the reasons behind the customer trends. It explains the why instead of just listing cold percentages. However, its smaller standard memory is a real liability here. It actually crashes when processing these massive, messy files. It just can't hold all that context at once. GPT 5 .4 firmly establishes itself as the ultimate math expert. It writes Python code directly inside the chat window.

08:55

It builds beautiful, customizable charts in real time for you. It's incredible for data visualization. Does Claude's psychological approach make up for its smaller memory? It absolutely does if you have a dedicated marketing team. Understanding the emotional drivers behind a purchase is deeply valuable. Marketers need to connect with human feelings, not just raw numbers. You just have to feed it smaller chunks of data. Smaller data sets get deeper emotional analysis, which marketers

09:22

desperately need. Quality of insight often beats sheer volume of data. Let's shift our focus to coding complex software games. Building a game from scratch is a massive technical challenge. It requires intricate logic and deep structural understanding. We asked them to build a JavaScript roguelite game. Some prompts asked for a cyberpunk snake game variant. It required dividing the code into very specific functional parts. We needed a game logic section, a UI section, and

09:49

an input handler. This tests how well the AI organizes a multi -file project. Claude Opus 4 .6 was simply a beast in this arena. It executed the complex coding task flawlessly on the first try. It even explained exactly why it organized the files that way. It handles big apps well due to a massive output limit. It prints the whole game without stopping halfway through. GPT 5 .4 was also a very helpful coding assistant. It provided the full code. And suggested really

10:18

cool sound effects. But it used an outdated save score function in the code. Yeah, that function actually breaks in newer web browsers today. You had to check its code very carefully for deprecations. Gemini 3 .1 Pro was easily the fastest coder of the group. It generated the game code much quicker than the others. But we noticed the enemy logic was honestly pretty stupid. The enemies just kept walking straight into blank walls. They couldn't figure out basic pathfinding.

10:44

Why does Gemini struggle with game logic if it's so smart at science? Science often involves processing known facts and established formulas. Game logic requires understanding fluid, dynamic spatial relationships. The AI has to predict how moving parts interact constantly. Gemini prioritized speed over thinking through those complex spatial interactions. Speed sometimes sacrifices the intricate spatial logic a game demands. Right, and you end up with enemies stuck in corners.

11:15

Let's move to our final rigorous evaluation today. The super prompt test. This evaluates following strict rules without getting confused or lost. Look, I have a vulnerable admission to make right here. Go ahead. I still wrestle with prompt drift myself when I give the AI too many rules. Well, prompt drift is when the AI slowly forgets your original instructions. It happens to all of us constantly. You give it five rules and it completely

11:37

ignores the last one. To test this fairly, we created a brilliantly difficult super prompt. We asked the three models to write a movie review. But we gave them four very strict formatting rules to follow. First, they could not use the word great anywhere. Second, Paragraphs had to start with the letters C -I -N -E -M -A. Third, they had to mention the director exactly three times. Not two, not four. Finally, they had to

12:00

include a three -movie comparison table. Which requires planning the entire response before typing a single word. Claude Opus 4 .6 was 100 % obedient here. It followed every single rule with perfect, careful execution. It planned the acronym paragraphs flawlessly from the very start. GPT 5 .4 struggled significantly with this complex constraint list. It forgot the acronym rule by the fourth paragraph entirely. It got way too focused on the narrative story it was telling.

12:29

It sacrificed formatting rules to write a more compelling review. Jim and I followed the basic story rules reasonably well overall. But the final comparison table was incredibly lazy. It was extremely simple and lacked any deep comparative information. What's the best way to avoid prompt drift entirely? You really need to stop sending massive walls of text. It overwhelms the model's attention mechanism. You should logically break your instructions down into individual, sequential

12:57

steps. Guide it through the process, one clear rule at a time. Just break long instructions into smaller, bite -sized steps. That is the most reliable way to guarantee consistent performance. We've covered a massive amount of ground today. That's a lot to process. Let's synthesize this into a big idea recap for you. We need to understand what this all means for your workflow. If you are a writer or a busy programmer needing perfection, choose Claude. It provides that essential human

13:24

touch and flawless, careful logic. It's definitely worth spending the extra money for that reliability. If you're a student or a small business... Look at Gemini. It is the ultimate value workhorse of early 2026. You can process huge files and videos incredibly cheaply. It handles massive context windows faster than anything else available. And if you're doing complex math or creating charts, GPT 5 .4 remains your classic, highly

13:49

reliable go -to tool. Its Python integration is still wonderfully smooth and deeply technical. It's the analytical engine you want for heavy data lifting. I highly recommend you go out and try these models yourself. You can test small versions for free on their respective websites. Or you can use Open Radar to test the premium versions side by side. AI changes daily, and you desperately need hands -on experience. Reading

14:11

about these models is simply never enough. You have to feel how they respond to your specific, unique workflows. You need to see where they shine and where they break. I want to leave you with a final provocative thought. Beat. Think deeply about the technological trajectory we're on right now. If Claude is already writing apologies that feel more sincere than a human's. And GPT is creating complex charts faster than a trained

14:37

analyst. At what point do we stop using AI to assist our thinking and accidentally start letting it replace our empathy? Two secs silence. Thank you for joining us on this deep dive. Take care.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript