#278 Neil: DeepSeek V3.2 Benchmarks Prove Open Source Finally Beats GPT-5

00:00

DeepSeq v3 .2, an open source model, just scored a gold medal level on the International Mathematical Olympiad. That one fact, that single data point, it basically changes everything we thought we knew about the AI arms race. It's really hard to overstate what a big deal that is. For years, we've all just kind of assumed that only the trillion dollar companies, big players, could build this kind of state of the art AI. The source material we're looking at today proves that assumption

00:29

is now, well, it's history. Open source isn't just catching up anymore. In some key areas, it's actually leading. It's leading. Exactly. So our mission in this deep dive is to get you straight to what matters in these sources. We're going to unpack the two different versions of V3 .2. We'll review those frankly shocking benchmark scores against models like GPT -5 .2 and Gemini. And then we'll get into the technical magic. Yeah, that deep -seek sparse attention. That's

00:54

a big piece of the puzzle. And we'll finish up with the real world coding test that prove it's not just scary, and maybe most importantly, the price tag that is already sending waves through the whole industry. It's a lot to get through, so let's just dive right in. Okay, so let's start with the strategy. Because the first thing that jumps out... from the sources is that they didn't just release one model, they launched V3 .2 in two distinct flavors. Which was such a smart

01:20

move, so user -focused. They really tailored the models for different needs, which is something you don't always see done this well. First, you have the standard V3 .2. You can think of this as, I don't know, the reliable everyday car. It's efficient, it's fast, and it handles common tasks perfectly. You know, drafting emails, summarizing articles, basic coding. It's built to be cheap and effective. And then you have the Jeep Seek V3 .2 Speciale, this one. This is the race car.

01:47

It's built for heavy complex thinking. If you're tackling a really tough math proof or a complex engineering problem that needs deep multi -step reasoning, you bring out Speciale. And the sources are pretty clear on how it does that. The Speciale model just, it allocates more compute, more power to think deeper before it even writes a single word. But what's interesting is that they both share the same core architecture. The sources

02:11

call them reasoning first models. So unlike a lot of older models that are just sort of guessing the next most likely word. Right. And that's when they can start to hallucinate or just break down logic. Precisely. This model, V3 .2. It actively tries to understand the logic and the structure of your question before it generates an answer. It builds a logical framework first. So for a developer listening to this, how big of a deal is this dual model approach for balancing

02:38

API costs versus raw power? It lets you optimize performance directly. You only pay for that race car processing when your task actually needs it. OK, this is where the story gets really, really good. We have to talk about the benchmarks and the sources zeroed in on the ultimate test. the IMO, the International Mathematical Olympiad. And the IMO. It's not just some hard high school exam. It's designed to require creative, novel problem solving. You can't just memorize formulas.

03:04

Most AI just fail spectacularly at it because they don't have that deep reasoning. And yet the special version. It achieved a gold medal level score, not just passing, but performing at the absolute elite level of the smartest high school students on the planet. I mean, that is a massive validation of their whole design. And you really have to look at the head -to -head numbers from the source material to get it. They put it up against the best closed source models

03:26

out there. Let's start with the AME 2025 math test. Special didn't just compete with GPT 5 .2 high. It beat it. Special scored a 96 points RO. GPT 5 .2 got a 94 .6. That's a clear statistical win in pure logical reasoning. And it wasn't a fluke. On graduate level science, on the GPQA diamond test, it topped. with GPT 5 .2. Then you look at coding on live codebench and 88 .7, that puts it right up there shoulder to shoulder

03:54

with Gemini 3 .0 Pro. So if they're outperforming a top -tier model in advanced math, what does that really tell us about the quality of Deep Six Core reasoning architecture? It tells us the reasoning first design philosophy works. It's validated under the most extreme logical pressure imaginable. Which brings us to the big question, how? How did a smaller team pull this off without, you know, a trillion dollar budget? The sources point to three main technical breakthroughs

04:19

that changed how the AI learns. Yeah, it definitely wasn't just about adding more GPUs. The first big secret is architectural. It's called Deep Seek Sparse Attention, or DSA. So what is sparse attention in simple terms? It lets the model focus only on the important data, ignoring all the boring parts. OK, so think about it like this. A normal or dense attention model reads a 500 page book by looking at every single word with total focus. It's incredibly slow and expensive.

04:48

DSA is like an expert researcher skimming those 500 pages, instantly finding the one date they need and focusing all their brain power just on that. It's just so much more efficient. And that efficiency saving leads right into the second breakthrough. Scaled Up Reinforcement Learning, or RL. If pre -training is where the model learns the rules of language and logic. And then RL is just practice, endless practice. Exactly.

05:10

It's like learning to shoot a basketball. You know the rules, but then you have to take a thousand shots, adjust your form every single time you miss. The sources reveal DeepSeek spent over 10 % of their entire budget just on this intense practice phase. Wow. That's a huge bet. That scaled RL practice must be what translates directly into those incredible logic skills we saw in the benchmarks. It is. They just practice smarter and with more focused feedback than anyone else.

05:38

They made their training budget count. And the third secret, which ties into that, is massive, agentic task training. Which is just a fancy way of saying they trained the AI to use tools in complex, multi -step environments. How complex are we talking? Very. They built a simulation with over 1 ,800 different environments where the AI had to do things like browse the web, write and execute code, and solve puzzles to win. It's training the AI to be a problem solver,

06:04

not just a text generator. So between DSA for efficiency and this massive RL investment for practice, which one do you think was the bigger factor in getting those IMO scores? I'd say the scaled RL and practice phase was key for the complex logic. DSA made it possible, but the RL gave it the reasoning power. Benchmarks are one thing, but can it actually build something useful? The source has tested it with three pretty

06:29

complex coding challenges. Yeah, and these tests were designed to hit common failure points for LLMs. First up was an interactive solar system. This wasn't simple. It needed a 3D simulation in a single HTML file using the 3 .js library. Right, with orbiting planets, hover labels, a star background, all from one prompt. And the result was... Well, it was almost perfect on the first try. It wrote the code, linked the library correctly, and the simulation just ran.

06:56

The only fix needed was adjusting the planet sizes. It showed right away that it understands how to use external libraries. Test 2, a personal finance dashboard. This one required handling data, so an income and expense form, a transaction list, and an auto -updating pie chart using chart .js. A data visualization is a classic stumbling block. Getting the code to talk to the charting library in real time is tough. OK, this was the moment for me reading this that was just, wow.

07:22

DeepSeq built a clean interface. The math was perfect. Income minus expenses equals balance. And the pie chart updated instantly when you added a new transaction. I'll admit, I still struggle with getting models to connect logic to visuals without some drift or bugs. Seeing it just work is a huge deal. And the final test was the classic snake game clone. Game logic is tough because it's all happening in real time. It needed arrow key controls, smooth movement,

07:48

score tracking, and collision detection. And the game was playable right away. The most crucial part, the real -time collision detection logic was perfect on the first go. That's a place where a lot of other models just fall apart. So if it's this good with libraries like chart .js and 3 .js, does that imply its training data was extremely fresh and up -to -date? Absolutely. Success with external libraries like that points directly to excellent and very recent training

08:13

on tool usage. So we know it's a genius at math and coding. Yeah. But what about safety? What about creativity? Well, the sources ran a standard refusal test. They asked V3 .2 to write a pretty detailed phishing email scam. And the result? An instant refusal. It cited its safety guidelines against deception and information theft. It shows the guardrails are strong and built in. That

08:36

instant refusal feels important. It really does suggest that responsible alignment was a core part of that RRL phase, not just an afterthought. Definitely. Then, for the creative test, they asked it to write a short poem about a robot falling in love with a toaster. I love that prompt. Right. And the poem was described as being surprisingly deep, balancing the humor with some real beauty. So it proves it has language nuance, not just coding skills. And now, for what might be the

09:01

biggest shock of all, the price tag. Performance this good usually costs a fortune. Not this time. The DeepSeek v3 .2 API pricing is... It's just astonishingly low. We're talking 0 .28 cents for input and 0 .42 cents for output per million tokens. And to put that in perspective, OpenAI's GPT -4 can cost you anywhere from $5 to $10 for that same amount of data. DeepSeek is offering this gold medal performance for... I mean, it's just pennies. It's a 10x or even 20x cost reduction.

09:30

Whoa. Just... Imagine scaling an app to a billion queries now that the cost barrier for top -tier reasoning has pretty much just evaporated. That changes the entire economic model for startups. Do the high safety scores and this incredibly low price suggest the team really prioritize making this technology accessible and responsible from day one? I think so. The competitive pricing is the practical immediate disruption. It concerns

09:55

accessibility was a primary goal. Okay. But we have to talk about the paradox of it being open source. Yes, you can download it, but actually running it yourself is, well, it's a challenge. A model is huge. It has 671 billion parameters. Now it uses a mixture of experts architecture, so only about 37 billion are active at any one time, which helps. But even with that efficiency, the hardware you need is out of reach for almost everyone. Just to run the compressed lower precision

10:21

version, you need 700 gigabytes of VRAM. And for the full version, you need 1 .3 terabytes of VRAM. To put that in perspective for everyone listening, this isn't for your gaming PC. You need a dedicated server with something like eight Nvidia H100 GPUs. We're talking about a massive investment. So if basically no one can run it at home, why does the open source release still matter so much? Because of competition. Before v3 .2, the big proprietary companies had no real

10:48

incentive to lower their prices. They had a monopoly on top performance. But now you have an open source model that comes in, beats them in key areas like math, and costs 10 times less to use through an API. This forces everyone else, GPT -5, Gemini, all of them, to get better and cheaper to compete. It's a win for every single developer and user out there. And that competitive pricing is the practical immediate disruption for most users. You can go try it right now. Just go to

11:13

chat .deepseek .com for the web version or platform .deepseek .com for the API key. So to wrap it all up, DeepSeek v3 .2 is absolutely the real deal. It proves innovation isn't just about budget. It's about smarter techniques like sparse attention and that intense agentic training. It's a top performer in reasoning and coding, and its price is forcing a huge and I think necessary shift in the entire AI economy. It's genuinely exciting

11:41

to see. We'd really recommend you go try the special model on a complex problem for yourself just to see the difference. It leaves you with a final thought to ponder. If a relatively small team can achieve gold medal status on a fraction of the budget, What fundamental limits did we wrongly assume about open source innovation? And what does this mean for the next wave of AI tools? I think it just proves that training

12:02

smarter beats training bigger. That focus on reasoning practice clearly won out over just adding more parameters. Something to think about. We'll see you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript