🎙️ EP 208: A 3B Model Just Beat Qwen?!

00:00

Imagine you walk into a Ferrari dealership. You point at their top of the line million dollar hypercar. The one on the pedestal, yeah. Exactly. And the dealer just looks at you and says, actually, we're going to give you that same engine, but we're putting in a sedan and the gas is 67 % cheaper. And the sedan somehow gets better mileage. That is, I mean, effectively what just happened this week. Welcome back to the Deep Dive. It's good to be here. Today, we're not just talking

00:26

about new features or specs. We are tracking a really massive shift in the industry. This pivot from just raw power to what you might call radical efficiency. It's the moment the tech really matures, you know. We're moving from the wow phase to the ROI phase. Right. Can it do the job cheaply? And can it do it without hallucinating halfway through? That's the key. Lay out the map for us today. How are we going to tackle

00:51

this? All right. We've got a heavy lineup. First, we're starting with Anthropic's new clawed updates on it, 4 .6. It feels less like an update and more like a declaration of war on the whole flagship model category. Then we've got a lightning round NPR suing Google over a voice that sounds a little too familiar. Apple's hardware roadmap for 2027. All the big stuff. And we end on the underdog story. The biggest story of the week, in my opinion.

01:20

A recruiting startup in China, not a tech giant, a recruiting firm, just released a tiny AI model that is absolutely beating the giants at their own game. I saw those numbers. It doesn't seem mathematically possible, but we'll get there. So let's start with the big dog. Anthropic, Claude Son at 4 .6. Right. Normally a .6 release is, you know, a patch, some bug fixes. But these benchmarks feel... Very aggressive. The whole headline here is the price to performance ratio.

01:50

Anthropic is saying this new Sonnet delivers near Opus performance. And Opus is their Einstein model, the expensive heavy lifter. Sonnet's the mid -tier. Correct. But the pricing didn't change at all. So the pricing didn't move up. No, it stayed flat. $3 per million input tokens, 15 for output. Opus is 5 and 25. OK, let's just sit with that for a second. If I'm getting near flagship intelligence for those prices, doing the math. That's roughly 67 % more tokens for

02:18

every dollar I spend. Which is huge. But think about what that lets you do. As a developer, you can now afford to have the AI think longer, maybe check its own work three times for the same cost as one answer from the old flagship. It totally changes the economics of reliability. It does. It turns intelligence into a commodity you can afford to waste a little bit of. And it seems like the users are really noticing.

02:42

I was looking at the sentiment analysis. 70 % of users prefer this new Sonnet over the last one. Right. But the stat that blew my mind, 59 % prefer it over the previous Opus. That's the cannibalization metric right there. When your mid -tier model beats your old flagship, you've successfully raised the floor for the entire industry. But why the preference? Is it just about speed? Because usually better means smarter, not just faster. It's bloat reduction. That's

03:09

what users are saying. Fewer hallucinations and, this is interesting, less over -engineering in the answers. Oh, I know exactly what that means. You ask for a simple two -line Python script and it gives you a lecture on coding ethics and a five -paragraph history of the language first. Exactly that. Sonnet 4 .6 just cuts the waffle. It's pragmatic. In fact, on the honest world benchmark, which tests how well an AI can navigate computer apps. Like a human would. Yes. It's

03:38

matching the current opus. It's a worker bee with the brain of a queen. Speaking of worker bees, there's a new feature they rolled out called context compaction. This sounds technical, but I have a feeling it's going to matter. It really might. Do you ever have that issue where you're in a long chat with an AI, maybe working on a project, and by message 50, it starts forgetting what you told it in message one? Prompt drift. I'll be honest. I struggle with this constantly.

04:02

I spend half my time reminding the bot, hey, remember, we aren't using that library or don't use that tone. It feels like the model gets dementia. Right. And that happens because the context window, the AI's short -term memory, it gets filled up with every single word. It's trying to remember every typo from the last four hours. And eventually it just collapses under its own weight. Exactly. So compaction fixes that. Think of it like a court stenographer versus a really good executive

04:31

assistant. Okay. The stenographer writes down every single word verbatim. The assistant listens for 30 minutes, shreds the full transcript, and then writes a perfect three -bullet summary of what was actually decided. So the AI is editing its own memory in real time. Yes. It summarizes the older parts of the conversation. It keeps the intent and the decisions, but it throws away all the verbatim fluff. This means you can have a never -ending conversation. That is huge for

04:57

autonomous agents. If I have a bat monitoring my emails for a week, it has to remember the rules I set on Monday without crashing on Friday. Exactly. It keeps the context clean. So let me ask you this. If Sonnet is this good and this cheap, does the flagship model category just die? Why would I ever pay for Opus or GPT -5? It doesn't die, but it forces flagships to justify their premium price. They can't just be good anymore. They have to be geniuses. Raising the

05:25

floor forces the ceiling up. Okay, let's pivot. The industry never sleeps. Let's hit the lightning round. Let's do it. First up, a legal battle that feels like it's straight out of a sci -fi novel. NPR versus Google. David Green, the former host, is suing. He says Google's notebook LM tool is ripping off his voice. This is fascinating because Google's defense is pretty standard. We hired a paid actor, and they probably did, but Green isn't arguing about the voice print.

05:54

He's arguing about the cadence. Exactly, the public radio voice. The pauses, the intonation, the way they sort of lean into the microphone. It raises this huge legal gray area. Can you copyright a pause? Can you copyright a vibe? Right, but I think there's a real identity theft angle here. If I listen to it and think it's... David Green, isn't that the real problem? That's the danger zone. Legally, Cadence isn't copyrightable,

06:17

not yet. But if a jury believes Google told the AI to sound like David Green, that gets into right of publicity. It's the same thing Scarlett Johansson dealt with. It's not about the raw data. It's about the intent to mimic. So regarding the NPR lawsuit, where is the line between inspiration and theft? in AI voices. Its blurry cadence isn't copyrightable, but identity theft is a real legal risk. Okay, moving from legal battles to actual tools. A new tool called Manos just launched

06:48

agents inside Telegram. This is wild. It's OpenClaw -style agents, basically autonomous research bots, but they live inside your chat app. So no coding needed? None. You just ask it to research a topic or process some data, create a PDF, and it just does it right there in the chat. It's putting a research assistant in the app we use for memes. Okay, let's look further down the timeline. Apple. Rumors are swirling about an acceleration on three wearables. Yes, the roadmap

07:13

seems to be pointing toward late 2027. We're hearing about an AI pin, smart glasses, and this is the big one, AirPods with really deep Siri integration. 2027 sounds so far away, but in hardware terms, that's basically tomorrow. It is. If production starts in December 2027, we are looking at a very, very interesting holiday

07:32

season that year. Apple is clearly betting the future isn't just a phone it's AI whispering in your ear and overlaying your vision meanwhile on the infrastructure side open ai just introduced lockdown mode this is for the enterprise crowd right totally the security focus teams it blocks risky tools cuts off live web access it's like putting the ai in a clean room so it can't leak secrets or download malware in meta they're just buying up all the chips all of them millions

08:02

of nvidia chips grace cpus the next gen viewer rubin systems they are building a massive war chest of compute. Speaking of war chests, Blackstone just led a $1 .2 billion round for a company called Nasa. An Indian AI data center startup. This is a huge signal that the demand for AI infrastructure is just exploding in emerging markets. It's not just a Silicon Valley game anymore. Not at all. The physical backbone of

08:27

AI is going global. So when you zoom out on all this, lawsuits over voices, agents in Telegram, AI in our glasses, data centers in India, what's the big picture? The big picture is AI becoming more human, more accessible, and just more ubiquitous. The gap between computer and partner is really starting to vanish. And that transition needs new tools. It's not just about the big models. It's about that layer in between. For sure. We've seen a few interesting ones pop up. We can do

08:56

a quick fire on these. All right. First, figure AI. This one's for the designers. It maps out user flows, it spots edge cases, and it can even build out A -B variations for you. It's automating the logic part of design. Okay, then there's layers. That's a growth planning tool. It generates content, social posts, ads, basically a marketing agency in a box. And boost .spacev5. This one

09:17

calls itself a persistent context layer. Think of it as the glue that holds all your different AIs together so they can actually talk to each other. And finally... QN3 .5, an open -weight vision language model. The key stat here is just fascinating. It delivers the capabilities of a nearly 400 billion parameter model. It's huge. But with the inference speed of a 17 billion parameter model. Okay, hold on. Inference speed. Let's define that. Simply put, it's how fast

09:46

the AI thinks and replies. It's the delay between you hitting enter and getting an answer. Usually, smart models are slow and heavy. When 3 .5 is smart, but fast. So with tools like figure and layers, are we automating the creative director? We're automating the grunt work. The director still needs to choose the vision. Sponsor. Okay, we've talked about the giant centropic, Google, Apple. But this is where the story gets really interesting. The biggest surprise of the week,

10:12

for me, didn't come from Silicon Valley. It came from a recruiting startup in China. This is the story everyone should be paying attention to. The lab is called Nanbije LLM Lab, and they released a model called Nanbije 4 .13b. 3b, as in 3 billion parameters. In the world of LLMs, that is. Tiny. That's pocket -sized. It's incredibly small. For context, GPT -4 is rumored to be in the trillions of parameters. Lama 3 is 70 or 400 billion. 3 billion is something you could theoretically

10:42

run on a high -end phone. But it's crushing benchmarks, despite being so small. It is playing David to the industry's Goliath. On Arena Hard V2, a really tough stress test, it scored a 73 .2. Okay, numbers are numbers. Let's talk about coding. That's where the rubber meets the road. This is the shocker. On LeetCode Weekly Contests... These are real coding challenges that actual humans take to get jobs. Nan Beige reported an 85 .0

11:09

% pass rate. 85%. Now compare that to QEM332B, a model 10 times its size, which scored 50%.

11:17

Wait, hold on. A 3 billion parameter model? beat a 32 billion parameter model by 35 percentage points yes that shouldn't be possible that's like a go -kart beating a formula one car imagine fitting a grandmaster chess player in your pocket that is what this really represents it just defies the logic that bigger is always better so how did they do it what's the secret sauce here it's all about the methodology they didn't just throw more data at it they trained it smarter they

11:42

used an upgraded fine tuning mix but the real magic is in the reinforcement learning or rl okay break that down for us They used a two -stage process. First is pointwise RL. The model generates eight different answers for a single prompt, and a reward model scores each one. Right. And they optimize this using something called GRPO group relative policy optimization. Jargon alert. GRPO. Think of it as a way to reduce repetitive

12:09

errors. It's like a teacher not just marking an answer wrong, but showing the student specifically where they started to ramble or make a mistake so they don't do it again. It tightens the logic. To the second stage. That's pairwise RL. They take a strong response and a weak response and force the model to compare them. It uses something called a swap consistency regularizer. Okay, simpler analogy. It's like wine tasting. It's

12:34

one thing to say this wine is good. It's much harder and more educational to taste two wines side by side and articulate why one is better. It just sharpens the model's judgment. So it's teaching the AI to recognize quality, not just generate text. Exactly. And because of this intense training, this tiny model supports up to 250... 56 ,000 tokens of context. It can do deep search with hundreds of tool calls. It's a real agent. Does this prove that training technique matters

13:00

more than raw parameter size? Absolutely. Smart training beats brute force computing power every time. And this really brings us back to the big idea of this deep dive. It all circles back to efficiency. We started with Cloud Sonnet, which gives us flagship level power. for a fraction of the price. Then we saw tools like Quinn and, of course, Nan Beige proving that smaller models are becoming incredibly smart through better

13:25

architecture, not just bigger data centers. The hardware from Apple and Meta is ramping up to support this whole ecosystem. But the software itself is actually getting leaner. We're not just building bigger brains anymore. We're building more efficient ones. And that's crucial. If we want AI everywhere in our glasses, our chats, our phones, it can't rely on a massive server farm for every single thought. It has to be efficient. It has to be. Before we go, I just want to leave

13:50

you with one final thought. That Nan Beige model, the world -beating, code -crushing AI, it came from a recruiting startup. Not Google, not OpenAI. A company that helps people find jobs. If a non -AI native company can build a state -of -the -art model to solve their specific problems, like coding and hiring, what happens when every company starts doing that? A law firm building the best legal model. A hospital building the

14:15

best diagnostic model. It means the era of renting these giant general purpose brains from a handful of companies might be ending. We could be moving toward a world of specialized pocket -sized experts that live on our own devices. It's a profound shift. It's a future I'm very curious to see unfold. If you want to check out the links to the new Claude features or download that Nambige model yourself, check out the show notes. And give that context compaction a try. It might

14:40

just save your sanity on a long project. Until next time, stay curious. Stay curious.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript