#350 Neil: GPT 5.3 Codex Vs Opus 4.6 Review Which Model Writes Better Code

00:00

I was sitting in my car before coming in just, you know, thinking about memory, not our kind where you forget your keys, but digital memory. It's so fragile. You close a browser tab and poof, it's gone unless you explicitly saved it somewhere. It just vaporizes. But then I was reading about this time capsule test in the research for today. An AI builds this whole simulated computer inside a browser and it remembers everything.

00:24

A half finished calculation, a sticky note, even after the whole session was, you know, nuked. And that's the thing that gets me. It wasn't hard -coded. A human didn't tell it, hey, save this variable here. The mall just figured out how to save its own state. It understood that's what a computer's supposed to do. It basically invented its own long -term memory just to survive a reboot. It's on best. They can. Bordering on a survival instinct. Welcome back to the deep

00:50

dive. Today, we are unpacking something that's been making a lot of noise and, frankly, creating a bit of anxiety. in developer circles. We're looking at GPT 5 .3 codecs, and I want to set the tone right away. We're not here for the hype cycle. We need to be calm, measured, and just figure out if this is another incremental update or if this actually changes how software gets built. It's the right question to ask. And the context here is everything. GBT 5 .3 dropped

01:18

on the exact same day as Opus 4 .6. And if you follow this space, you know Opus is the flashy one, the one that makes beautiful charts and writes poetry. But the rumors around 5 .3 are different. People are whispering that this model helped create itself. Now, that sounds like sci -fi marketing. I know. But when you dig into the performance metrics, things get a little weird. Weird how usually the numbers just go

01:37

up. Weird because it seems to be breaking that cardinal rule of AI development from the last decade, which has always been bigger is better. So our mission today is to cut through all that. We're not reading the press release. We're gonna walk through a gauntlet of what the source calls very hard tests. They were run by a reviewer who just locked themselves in a room, stopped reading the news, and just started coding with

01:59

it. We're talking everything from browser -based operating systems to simulating 3D printing physics. I appreciate that approach. A hands -on, real -world test. So let's start with the architecture, because this is where the philosophy of the model really comes through. The research points to a more with less kind of approach. Usually a smarter model means a huge computational tax. It's a gas guzzler. This seems to be the opposite.

02:22

Exactly. And there's this one data point in the source material that just stops you in your tracks. GPT 5 .3 uses fewer tokens to solve problems than the models that came before it. And for anyone listening who isn't deep in this stuff, a token is basically a word or part of a word. It's the AI's building block for language. Usually to get smarter, you just throw more tokens at the problem. You ramble until you find the answer. Right, the old scatter gun approach. Just keep

02:47

guessing until something works. Precisely. But GPT 5 .3 is succinct. It uses fewer tokens, but gets much higher accuracy. There's this chart in the report. It shows the model hitting 77 .3 % accuracy in these really complex terminal tests. And just to put that number in perspective, where was the last version, 5 .2? It was sitting at 64 .0%. Wow. That is a massive 13 % gap. And we can't just glaze over that number. In AI development, getting 1 % or 2 % is a huge win, a 13 % jump

03:17

in one generation. is. It's almost unheard of. That gap is where all the tricky logic lives. It's the difference between an AI that gives up when the code gets tough and one that actually grinds through the problem. So it's not just regurgitating answers it saw on Stack Overflow during training? No, and that is the critical distinction here. The source talks about a big

03:35

behavioral shift. Unlike these one -shot models where you ask a question, get an answer, and just hope it works, GBT 5 .3 loves to iterate. It plans, it builds a little piece, tests it, sees that it broke something, and then it fixes it. It's acting less like a search engine and more like a junior engineer who knows how to debug their own work. That brings me to a question then. In this context, does efficiency actually equal intelligence? Yes. Because it's solving

04:05

harder logic with less noise. Think of it like a writer. A novice uses 50 words to describe a sunset. A master uses three, but they're the exact right three. That's what this model is doing, but with code, it's throwing reasoning at the problem, not just volume. That's a great analogy, the density of the thought. Let's move to that first big stress test in the source. This is the browser operating system test. The reviewer asked it to build a functional OS right

04:28

inside a Chrome tab. Start menu, windows, apps, the whole thing. Right. And this is what we get to what the source calls the ugly truth of 5 .3. The first attempt. Visually, it was a disaster. It had a 1999 aesthetic, square icons, horrible gray backgrounds, Times, new Roman font. If you showed it to a client, you would be fired immediately. I actually love that detail. It didn't try to impress with flashy visuals. It reminds me of some backend engineers I worked with. They build

04:55

these incredible databases. But the UI looks like a spreadsheet from 1995. That is exactly the vibe. But underneath that ugly skin, the functional brilliance was just... terrifyingly good. The reviewer opens the calculator app in this fake OS, types 77 times 7, and it just spits out 539. It worked. Then they open the notes app, type hello world, close it, reopen it, the text is still there. That's that local storage bit we mentioned. Can we unpack why that's so

05:23

hard? To a user, saving something seems so basic. It feels basic to us, yeah. But for an AI -generated simulation in a browser, it's really complex. The AI has to understand the concept of state. It has to write code that uses the browser's own storage to cache data, so when the virtual app closes, the data doesn't just vanish. Most AI models forget the context the second you change tasks. This one built its own persistence layer. And that leads right back to that time capsule

05:51

feature. The user asked for a way to save the whole desktop state. Right. And the AI just coded a system to snapshot every open window, its exact coordinates on the screen, the data inside it, all so you could restore the session later. That means it understood the entire hierarchy of the application it just built. It wasn't just pasting code snippets. It understood the architecture of the machine it created. It built a save game feature for an operating system it just invented.

06:16

Yeah. But the reviewer didn't just leave it ugly, right? No. And this is that iteration part we were talking about. The reviewer just said, the logic is great, but the design is very ugly. The model didn't argue. It didn't break. It just went back in and applied a dark mode, a sunset wallpaper, and a custom right click menu. So if you look at that sequence, the ugly but working draft, then the polish, does it imply that logic comes before beauty for this model? Precisely.

06:46

It prioritizes function, then aesthetic. And if you're a developer, that is exactly how you want it to think. Make it run, then make it pretty. That function over form distinction seems to carry into the next test, which honestly, I found this the most mind -bending part of the whole review, simulating the physical world. The reviewer asked it to build a 3D printer simulation. Yeah, this was a real moment of wonder for me when I read this. And we're not talking about drawing

07:09

a picture of a printer. The AI built a full core XY printer simulation. OK, for the non -engineers listening, and for me, what exactly is core XY? It's a specific kind of system in high -end 3D printers. Instead of one motor for X and one for Y, you have two motors working together with belts to move the print head. It involves some pretty complex trigonometry. If you get the math wrong, the print head just crashes into the wall.

07:33

And the source said it wasn't just animating a box moving around, it was actually calculating motor positions. A equals zero, B equals zero. It was simulating the stepper motors step by step, but the Benchy test is where it just gets wild. The Benchy. That's the little tugboat everyone prints to test their machines, right? the hello world of 3D printing. Exactly. So the reviewer uplides a real 3D model file, an STL file of the Benchy, and they ask the AI to write a slicer.

08:03

Now, a slicer is a serious piece of software. It takes a 3D object, cuts it into thousands of thin layers, and then generates G code, which is basically coordinate instructions for the printer. Wait, wait. So the AI wrote the software to read the 3D file, sliced it into layers, and then simulated the printing of those layers on the screen. Yes. It parsed the geometry. It didn't

08:22

just look up how to draw a boat. It calculated the actual path the nozzle would need to take to physically create that object in a virtual space. That feels like a massive leap. So is it truly simulating or is it just copying slicer code it found somewhere else? That's the skeptics question and it's a good one. But the source argues it's true simulation because of how it handled that specific unique file the user uploaded. You can't just copy paste that kind of real time

08:48

interaction. It has to understand the coordinate system, the x, y, and z axes, and how they relate to the math of the motors. It understands physical space. That's a lot to process. It suggests a kind of spatial intelligence we didn't think these text models had. Let's pivot to something a bit more fun, but equally revealing. The gaming tests. The source mentioned a flight combat simulator. This part was actually pretty funny. The reviewer notes that GPT 5 .3 has a bit of an attitude.

09:16

An attitude, like a personality. Yeah. The model is slow. It thinks a lot. And during the flight sim build, the reviewer got impatient, you know, like we all do, and typed, just give me the file. And the AI actually snapped back at him. It said something like, I am verifying the physics. Wait a moment. It told the human to chill out. Basically. It was prioritizing the integrity of the code over the user's impatience. That's a very senior engineer kind of move. Do you want it fast or

09:44

do you want it right? And when the game finally loaded, again, visually, it looked like paper triangles. Ugly. But the radar worked. The enemies tracked you. When you hit a plane, smoke particles came out. The logic was solid. And then they moved to C++. Now, I have to make a vulnerable admission here. I've dabbled in code. But C++ memory management? It keeps me up at night. It is just notoriously unforgiving. Oh, it's the final boss of programming languages for a lot

10:11

of people. It forces you to manage the computer's memory by hand. So the reviewer asks for a skateboarding game in C++. Specifically, they wanted grinding logic. Grinding seems incredibly difficult to code. You have to know the exact moment the board intersects a rail and then lock it on while keeping momentum. Exactly. It's a collision detection nightmare. And the AI. It nailed it. The source says the reviewer could jump, land on the rails,

10:38

slide, and jump off. It even built a combo system where the score multiplier would reset if you fell. So going back to that attitude you mentioned, the delay, the snippiness, does that attitude actually improve the code? Yes, because that delay is the model reasoning. It's running through the logic. If it had rushed to give a good enough answer to make the user happy, the skateboard would have clipped through the rail or the game

10:59

would have just crashed. That's a trade -off I think most developers would take any day, quality over speed. Now before we get into the project management side of this, which might be the most practical part, we're going to take a very quick break. Okay, we're back. We've talked about operating systems, 3D printers, skateboarding physics, but let's be real. Most listeners aren't building physics engines every day. They're building apps.

11:23

They're managing projects. And this is where GPT 5 .3 seems to shift from just being a coder to being a manager. This is the plan mode feature. The reviewer tested it with a game called Neon Arena, a first -person shooter. But instead of just spitting out a wall of code immediately, the AI paused. It started interviewing the user. Interviewing, like gathering requirements. Exactly. It asks things like, what kind of enemies do you want? Do you want this in one massive file

11:49

or split into multiple files? That question alone, one file or multiple shows a level of seniority. A junior dev just jams everything into one script. A senior dev modularizes because they know it's easier to maintain later. And when the reviewer said multiple files, what happened? It created a professional file structure. player .py for movement, enemy .py for the AI, main .py to run it all. It organized the chaos before it started. It wasn't just a chat bot. It was acting like

12:18

a tech lead, setting up a project. That structure is everything. But there's another side to this management role, interpreting the vision, the Stevie Slapis test. I love this name, by the way. Stevie Slapis. It sounds like a bad cartoon character. So the reviewer drew this deliberately ugly wire frame on a piece of paper for a fake portfolio website. Just boxes labeled skills and projects. They took a photo, uploaded it. And the prompt was just... Make this beautiful.

12:45

Pretty much. Make it beautiful. Add a wow factor. And the AI was incredibly faithful to the drawing. It put the skills section exactly where the ugly sketch had it. It didn't try to fix the layout, but it polished the execution. It added these nice hover glow effects. It made the whole thing responsive for mobile. So it took a napkin sketch and turned it into a real website. This raises a big question for me. If it can plan the architecture like a tech lead and execute code like a developer.

13:13

Is it replacing the coder or the manager? It's acting as both, creating a bridge between them. It structures the chaos of the code, but it still needs the human to provide that ugly sketch, the initial vision. It interprets our intent. The AI didn't know why Stevie Slapis needed a portfolio, but it knew how to build one. Okay, let's bring this all together. The source material ends with a direct comparison between GBT 5 .3

13:38

and that other big release, Opus 4 .6. If you're a listener trying to decide which one to use, how does it break down? It's a classic straight off. It's almost like hiring two different kinds of people. Opus 4 .6 is the client -ready model. It gives you beautiful visuals instantly. It's polite. It's expensive. If you need a demo for a CEO in five minutes and it has to look pretty, you use Opus. And GPT 5 .3. GPT 5 .3 is the engine

14:01

room model. It gives you ugly first drafts. It uses fewer tokens, so it's cheaper to run for these long, complex sessions. But the deep logic, the C++ memory management, the physics, the iterative problem solving, It's just superior. It's best for real software, where the plumbing matters more than the paint. So the big idea recap here seems to be that we need to adjust our expectations. The source concludes that GPT 5 .3 is not a magic wand. Right. It's a co -worker. And just like

14:31

any co -worker, you have to guide it. You have to say, hey, that UI is terrible. Fix it. Or that physics is slightly off. But the difference is it works incredibly hard. It doesn't complain well, except for that one time. And it iterates. Instantly that shift from magic one to co -worker feels really important We spent the last few years treating AI like a vending machine You put a prompt in you get a product out if it's bad You kick the machine exactly and this model

14:54

forces you to treat it more like a partner. It's let's build this together You're the architect. It's the builder. It's closing the gap between having an idea I want a browser OS and actually holding that product. It's messy, it requires feedback, but the barrier to entry is just crumbling. It really is. And for the first time, it feels like the AI is actually meeting us halfway. So here's our final thought for you to take away

15:18

today. If this model can remember the state of a simulated computer without being told how, and if it can slice a 3D model by understanding geometry, what happens when we ask it to optimize not just the code, but the systems around the code. What happens when the coworker starts suggesting changes to the business logic itself, not just the syntax? That is the billion -dollar question when the tool starts suggesting what to build, not just how to build it. We'd love to hear what

15:47

you think. Are you ready for a coworker that argues with you about facts? Thanks for listening to this deep dive. See you the next one. Take care.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript