It's December 2nd, 2025. You walk into the OpenAI headquarters in San Francisco. Now, usually, this is the time for holiday parties, right? You're expecting champagne corks popping, maybe some high fives that are the kings of the hill. You would certainly think so, yeah. But instead of a party, the mood is tense. It's frantic. It is officially a code red. A code red. And
why? Because the era of the bodybuilder AI, you know, those massive hulking models that rely on pure size to solve problems, is officially over and the era of the gymnast has begun. That's a vivid way to put it, but honestly, it's not far off. The raw strength we've all been obsessing over, it's still there, sure. But suddenly, the game isn't just about strength anymore. It's about precision. It's about agility. And if these leaks are real, the whole board has been reset.
Welcome to the Deep Dive. It's Thursday, January 22, 2026. Today, we're unpacking something that feels like a genuine pivot point in AI history. We're looking at the leaked details of OpenAI's GPT -5 .3, which has the internal, and let's be honest, slightly hilarious, codename, Garlic. Garlic. It's definitely a choice. A lot earthier than, you know, Orion or Gemini, all that celestial stuff we usually get. It really is. But don't
let the name fool you. Our mission today is to figure out... why this specific model caused such a panic. This code red at OpenAI. We're going to look at the engineering breakthroughs, specifically around memory and this new self -checking thing. And we'll see how it stacks up against the current heavyweights, Google's Gemini 3 and Anthropic's Claude Opus 4 .5. This is going to be fun because the specs here aren't just number go up. Right. It's not just we added
more zeros. This is a whole philosophical shift in how we built this stuff. So let's start with that context. We mentioned the code red. This leak came from the information back in December. Mark Chen, the chief research officer at OpenAI reportedly shared these details. But help me understand the panic. I mean, GBT 5 .2 was already out. It was a good model. Why the fire drill? It all comes down to momentum. In tech, if you aren't leading, you're basically dying. And frankly,
OpenAI was losing ground. Losing ground to who? To everyone. I mean, look at the last six months. Google drops Gemini 3. And Gemini 3 just dominates anything multimodal video, messy real world data images. It was the king of stuff. Right. And then on the other side, you had Anthropic with Claude Opus 4 .5. And let's be honest, if you were writing code in late 2025, you were probably using Claude. I could vouch for that. I switched to Claude for all my scripting. It just felt
less, I don't know, robotic. Exactly. So internally, GPT 5 .2 was seen. as a Band -Aid. It kept them in the conversation, but it wasn't bribing it. They knew that shipping GPT -5 at a little bit bigger just wasn't going to cut it. They needed a response that fundamentally changed the metric of success. And that response is garlic. You used this analogy before we started the bodybuilder versus the gymnast. I want to double click on
that because for. what, the last five years, the headline has always been parameter count. Yeah. Trillions of parameters. Fairer clusters, more GPUs. Yeah. If you weren't building bigger, you weren't trying. And that's the bodybuilder approach. You solve problems by just adding more muscle. The model doesn't get physics. Add a trillion more parameters. Can't write a sonnet. Add another trillion. Total brute force. But garlic represents a shift toward density. It's
the gymnast. It's physically smaller architecturally. more compact. But because of that, it can do these complex reasoning maneuvers, these mental backflips, if you will, that the massive, clumsy bodybuilder just can't do. Okay, but I have to play the skeptic here. Making it smaller but smarter sounds like... Marketing fluff. It sounds like the holy grail everyone promises, but nobody delivers. Usually when you make a neural net smaller, it gets dumber. How did they actually
do that? It wasn't a straight line. It comes down to a merger of two different research tracks inside OpenAI. They had Shallot Pete, which was just their standard stability update track. Kind of boring. But then they had this experimental branch called Garlic. And the big breakthrough is a technique called EPTE. BPTE. Enhanced Pre -Training Efficiency. Okay, break that down for me. No jargon. What is EPTE actually doing that's different from GPT -4 or 5? Think of it like
a garden. Or, you know, better yet, the human brain. When a baby is born, their brain just has this explosion of connections, synapses everywhere. It's a mess. Right. As we grow up, we get smarter not by adding connections, but by pruning them. We cut out the noise so the signal can travel faster. So we get smarter by deleting parts of our brain. In a way, yes. We delete the inefficiency. Traditional AI training is like letting a forest grow wild connections everywhere. Vines, weeds,
you name it. It's huge, but it's messy. EPCE introduces a pruning phase during the training. The model actively discards redundant neural pathways. So it's Marie Kondo -ing its own brain. This neuron does not spark joy. Basically. It's cutting out the noise. So the result is you get GPT -6 level reasoning capabilities. So higher logic scores. Exactly. Higher logic scores. Because you've pruned away the inefficiency, it runs on a faster, smaller architecture. It's compressing
thought. That's fascinating. It really does challenge that whole more is better assumption. But I have to ask, is efficiency really as exciting as raw power? As a user, do I care if the model is efficient? I usually get excited about the next biggest thing. Absolutely. And here's why. Latency and cost. Think of it like the difference between a muscle car and a Formula One racer. The muscle car has raw power and makes a ton of noise, burns a ton of gas. But the F1 car, it has precision
engineering. It turns on a dime. In this case, precision means the model is cheaper to run and faster to answer. And when intelligence gets cheap and fast, you can use it in ways you never could with the big, slow bodybuilder. So it's shifting from raw horsepower to agility, and that changes what it's useful for. Exactly. It moves from a consultant you hire once to a worker
that lives in your computer. Let's move to the specs, because while the philosophy is cool, the actual numbers here are, well, they're staggering. And I want to make sure we really get what they mean day to day. Let's do it. So, memory first. The context window. Garlic is reportedly shipping with a 400 ,000 token context window. Now, just to play devil advocate here, Gemini 3 has 2 million tokens. So on paper, Garlic looks smaller. Why
should I be impressed by 400k? On paper, yeah, Gemini is bigger, but this is where the nuance really matters. If you've ever used Gemini's huge context window, like you dump a whole novel in there and ask about a character from chapter three, you might have noticed middle of the context loss. Right. I have seen this. It remembered the very beginning and the very end of your prompt, but it gets hazy on everything in the middle. It's like it skimmed the book. Exactly. It's
the needle in a haystack problem. Yeah. Gemini has a huge stomach, but imperfect digestion. Garlic uses a new attention mechanism that reportedly gives it perfect recall. cross the whole 400 ,000 tokens. It doesn't just store the data. It actually remembers it. So I could feed it literally my entire company's documentation. Every PDF, every Slack policy. Every messy confluence page. And it wouldn't just have it. It would
know it. It would know it. You could ask, what was that compliance rule we changed three years ago from that one Tuesday memo? And it just pulls it instantly. No skimming. That's the difference between a hard drive and a brain. Precisely. But here's the part that actually stopped me in my tracks. It's not just how much it can take in, it's how much it can put out. The output limit. This is the big one. This is that wonder moment. 128 ,000 tokens. In a single response.
It's huge. I mean, just for context, for everyone listening right now, we're all used to the model just stopping. You ask it to write code. It gets halfway through a function and it just cuts off. And you have to type continue. And then it repeats the last line or it forgets the indentation. It's so fragmented. It feels like pulling teeth sometimes. It totally breaks your flow state. But 128 ,000 tokens. That's not a response. That's a novel. That's an entire software library. It
is. And I want you to really just pause and imagine the experience of that. Imagine sitting at your terminal. You ask for a complex legal brief or maybe a full backend architecture for a new app, not just the outline, the actual code. You hit enter. And instead of a summary or a little piece of it, you just watch the cursor move. And it keeps moving. And it writes the files. It writes the documentation. It writes the test suite. And it just doesn't stop until the thought is
complete. One coherent stream of creation. That is, wow. It's almost overwhelming to think about. It feels like we're moving from just chatting with a bot to manufacturing with a machine. That's the shift. No more chunking. No more stitching things together. You're not pasting snippets into VS Code anymore. You're reviewing a finished product. But does infinite memory actually change how we work? Or does it just change how much junk we dump into the chat box? I worry we'll
just get lazier. I think it fundamentally changes the human's role. It stops us from being librarians of data, you know, constantly fetching context, pasting files, reminding the bot what we're talking about, and lets us be architects of ideas. You design the structure, the model pours all the concrete. So the output limit frees us from being data fetchers. To be designers instead. Spot on. It shifts the cognitive load from memory to strategy. That distinction architect versus
librarian, that really lands. Because the other feature Garlick supposedly has leans heavily into that architect role. We're talking about native agents. Yes. This is another area where that code red was necessary. Because right now, everyone tries to make AI agents go do this task. And, well, usually they fail. They get stuck in loops. Or they hallucinate a file path that doesn't exist and then crash. Right. But Garlic isn't pretending to be an agent. The tool use
is native. It understands file systems. It can run tests. It can debug like a developer. It treats APIs not as some external thing it has to awkwardly reach for, but as part of its own cognitive process. So it's the difference between me trying to speak French with a dictionary in my hand, looking up every word. Versus actually being fluent in French. Exactly. It thinks in
tools. It thinks in execution. If it writes code that... fails a test it sees the error corrects it and reruns it all before it even gets back to you now speaking of thinking and correcting itself yeah there's one feature here that i think solves the single biggest anxiety i have when i use these tools now i want to be a little vulnerable for a second go for it i still struggle with trusting the output You know, I'll spend 20 minutes
crafting this perfect prompt. I get an answer that looks incredible, super confident, polished. And then I spend 40 minutes fact checking it because I've been burned by hallucinations before. Oh, yeah. The universal experience. Confident liar problem. Exactly. And it creates this weird friction where I'm like, is this actually faster if I have to babysit it? But garlic has a self -checking mechanism. This is a game changer for exactly that anxiety. Before the model answers
you, it enters a verification state. It just pauses. It checks its own internal knowledge graph to see, do I actually know this or am I just statistically guessing? So it has a conscience or at least a built -in BS detector. It has a system two thinking process, to use the Daniel Kahneman term. System one is fast, intuitive. System two is slow, deliberate. Garlic uses system two. If it isn't sure, it reassesses. The report says this leads to drastically fewer hallucinations.
For lawyers, for developers, this is everything. But I have to play devil's advocate again. If the model is stopping to check its work, doesn't that make it slower? We were just talking about speed being the new king. It might pause briefly before the first token appears, maybe a second or two of thinking time. But think about the time you just described, the 40 minutes you spend fact -checking. It might be slower per second, but it saves hours of human rework later. It's
that Navy SEAL mantra. Slow is smooth, and smooth is fast. So the pause pays for itself by eliminating all that cleanup time. Exactly. It trades milliseconds of latency for hours of reliability. I'll take that trade any day. Okay, I want to zoom out a bit. We have the specs, the philosophy. But OpenAI isn't operating in a vacuum. They're in a war. We mentioned Google and Anthropic. How does Garlic actually stack up? This is the battle for the leaderboard. So let's look at Google
first. Gemini 3. The heavyweight. The heavyweight, yeah. If you look at the leaked benchmarks, the battle here is scale versus density. Gemini 3 wins on multimodal. If you have messy, real -world data video, audio, weird images, Gemini is still the king. It has that massive context and parameter count for a reason. So if I'm analyzing a movie, I use Gemini. Correct. But Garlic wins on pure text, code, and complex reasoning. The benchmark is something called GDP Val for reasoning. What's
that measuring? It's measuring logic puzzles, multi -step reasoning where you can't just memorize an answer. Garlic is scoring 70 .9%. Gemini is at 53 .3%. Wow. That is not a small margin. That's a generational gap. It's a blowout on reasoning. So the verdict is, analyze a three -hour video. Use Gemini. Build the back end of a banking app where logic is everything. You use Garlic. Okay, so that's Google. What about Antropic? Quad Opus 4 .5. I know a ton of developers who swear by
Claude. It feels warmer. It writes really readable code. Yeah, this is the battle for the developer's soul. Claude is known for that warmth and readability. But Garlic is coming in with a ruthless value proposition. It matches Claude's coding proficiency 94 .2 % on Human Evil plus R, which is the gold standard. So it's just as good at the actual coding. Okay, so it's a tie. Not quite, because Garlic does it at two times the speed and half the cost. Half the cost. Because of that pruning
we talked about. The model is physically smaller. It burns less electricity. It costs OpenAI less to run, so it costs you less to use. That's significant. Yeah. But does price really trump everything? I mean, if Claude feels more human to interact with, won't people stick with it? For a casual chat. Maybe. If you're brainstorming, you might stick with Claude. But for running a business at scale, if you're an API customer processing millions of requests, half price is everything.
The unit economics alone will shift the market overnight. It's the difference between a boutique shop and like industrial scale. Exactly. If you're building a startup and your API bill just got cut in half, you're not going to care how warm the model feels. You care that it works and it's cheap. So garlic. competes on unit economics and logic and sort of cedes the human touch to Claude. For now, yeah. It's an industrial revelation, not a dinner party. So we have the what and the
why. I want to take a brief pause here. And we're back. We've looked at the code red, the tech, and the competition. I want to wrap our heads around the timeline and the big picture. Leaks are great, but shipping is what matters. When are we actually going to see this thing? Well, looking at how the leaks and vendor updates are converging, it feels imminent. We're expecting a preview release. Probably the ChatGPT Pro users and some partners in late January 2026. Late
January. That's basically this week. It's happening now. Then the full API availability is slated for February. And this is interesting. They're expected to integrate a version of this into the free tier by March. A counter to Gemini's free access. Exactly. They have to capture that user base. They can't let Google own the entry -level market. Pulling all this together, the pruning, the 128K output, the self -checking,
what's the big idea here? If I'm a listener trying to make sense of all this, what is the core shift? The core shift is that the definition of AI progress has changed. For five years, progress meant bigger. It meant more parameters. Now, progress means cognitive density. It is about intelligence per dollar and intelligence per watt. Cognitive density. I like that. It sounds focused. It's about doing more with less. And for you listening, the so
what is pretty direct. If you're a developer, you can finally refactor entire code bases without losing context. You don't have to choose which files to upload. If you're a business, you can build automation that actually works because of the agentic capabilities. And if you're a creator, you can generate long form content books, scripts, courses without having to manually stitch it together. It feels like the training wheels are coming off. They are. The limitations we've
all learned to work around. Oh, I can only paste half this file. Oh, I have to check its math. Those limitations are just evaporating. So before we sign off, let's give everyone listening something to do. Because if this is dropping in weeks, we shouldn't just be waiting around. How do we prepare for the garlic era? Two things. First, organize your data. If you want to use that 400 ,000 token context window with perfect recall,
your data needs to be ready. Clean up your documentation, merge your repositories, get it ready to go. Don't feed the gymnast junk food. Exactly. If you feed it garbage, you'll still get garbage. Just perfectly recalled garbage. Good point. And second, map your workflows. Start thinking in terms of agentic workflows. Don't just think, what question can I ask the bot? Think, what multi -step process can I hand off entirely? Check these invoices against these emails and
update the spreadsheet. If you map those out now, when garlic drops, you can plug it in and it will just work. That's great advice. Whether it ends up being called garlic or GPT 5 .3 or something else entirely, the message is clear. The era of... giant, slow, forgetful AI is ending. And the people who are ready to build with these new efficient models are going to have a massive advantage. A massive advantage indeed. Thank you for walking us through the Code Red. Always
a pleasure. And to you listening, thank you for diving deep with us. Go clean up your data, get your workflows ready, and we will see you on the next one. Take care.
