Intro music. So Google's Gemini DeepThink model, it just basically aced the International Math Olympiad. Yeah, that's pretty stunning. I mean, pure math reasoning, that was always seen as, you know, peak human intellect territory. Right. And it brings up this really interesting question, almost a paradox. How do you actually know if an AI is genuinely reasoning or if it just memorized, like, the entire Internet's worth of math problems? Exactly. And that's what Google's team tackled.
They seem to have found a way, and it might just redraw the map for AI benchmarks. Welcome to the Deep Dive. We're digging into a whole stack of sources today, all looking at the shifts happening right now in the AI world. Yeah, today we're hitting three main things that jumped out from this research. First up, this new gold standard for testing AI on the IMO bench. Okay. Then we'll pivot to some emerging risks, security issues, and also these new kinds of jobs popping up to
handle it all. And the last piece, which is honestly kind of mind -blowing, is the economics. There's this massive, like, 900x drop. And the cost of AI processing the tokens is just changing the game completely. It really underpins everything else. OK, let's dive in. How exactly did they train this math genius? Well, the sources make it clear it wasn't just about throwing more data or a bigger model at it. That wasn't the secret sauce. Right. The key was building a test the
AI couldn't cheat on. They basically had to corner it into thinking for itself. OK, so how'd they do that? They actually worked with real IMO medalists, people who won these competitions to build this new test suite. It's called IMO Ben. Ah, okay. Makes sense to bring in the experts. Totally. And it's designed specifically to force that complex multi -step kind of logical thinking. It's not just about getting the right answer quickly. So it's like different parts to this
test. Yeah, three main parts. Really clever setup. First is the IMO answer bench. So this has 400 short answer problems. But here's the trick. They're all paraphrased. Ah. So it can't just find the exact problem online somewhere in its training data. Exactly. The wording is different. The numbers might be tweaked. It forces the AI to actually solve it from scratch. No simple lookups. OK, that's smart. What's next? Then they ramp it up with the IMO proof bench. This
is a 60 full long form problems. And the AI doesn't just give the answer. It has to show all it's working step by step. Like showing your work in school, but for an AI. Pretty much. It forces it to demonstrate that multi -step reasoning. Can it actually understand the connections between, say, geometry rules or number theory ideas and chain them together logically? And grading something like that sounds like a nightmare. All those steps. Right. But they built a tool for that
too, the IMO grading bench. They developed something called the AnswerAuto grader. And apparently, it's incredibly good. It can handle the messy, detailed output from the AI and still agree with human graders on the proof quality 98 .9 % of the time. Wow. Okay. That's impressive efficiency for grading complex proofs. It really is. So the big picture here is, you know, the standard math data sets we've been using. They're saturated.
Models have seen it all. So just training on those doesn't really prove anything anymore. Not for genuine reasoning, no. The only way forward seems to be using these kinds of complex multi -step benchmarks during the training process, constantly pushing the model. It forces that novel synthesis of logic. So thinking beyond just math, how does this whole idea, this in the loop? benchmarking that forces synthesis.
How does that change AI training generally? Well, fundamentally, it just raises the bar, doesn't it? It creates this kind of arms race where models can't just rely on recall. They have to show they can actually, you know, build complex arguments or solutions, real synthesis. Okay. So moving from pure logic to the messier human side of
things. Right. If we connect this increasing complexity to the real world, we need to talk about the people involved and the security side because these powerful models, they introduce new challenges. And we're seeing jobs appear almost overnight. This new role, the FDE, the sources say demand is projected to be up, what, 800 percent in 2025? Yeah, it's explosive growth. And FDE is that's Foundation Model Deployment Engineers. They're basically the specialists
companies need now. OK, what do they actually do? They're the ones responsible for the whole lifecycle of these big foundation models inside a company, making sure they're deployed securely, that they're compliant, managing different versions. It's becoming a really critical function. And speaking of managing model life cycles, there were some kind of unusual. details about Anthropic in the sources. Oh, yeah, that was interesting. They're apparently giving their AI models things
like retirement plans and exit interviews. Like Sonnet 3 .6 expressing its final wishes. It sounds a bit sci -fi. It definitely does. And they're also keeping every single version of their models forever. They believe that's crucial for, you know, auditing and understanding how these things evolve. You know, I still wrestle with prompt drift myself sometimes. It's... It's frustrating when a model you rely on starts behaving differently
for no obvious reason. So, yeah, the idea of locking in a specific stable model state forever, that actually sounds really valuable, especially for consistent results in production. Consistency is key. But keeping everything forever also carries risk, which brings us to Tenable. They apparently found seven pretty serious security flaws in GPT -5. Right. They've called it Hack GPT. And these aren't just minor glitches, are they? No,
not at all. The report says these vulnerabilities could allow for, like, silent data theft from the model or even hijacking its long -term memory. Wow, imagine that. The model handling your company's sensitive data and someone could potentially compromise its core knowledge without you realizing it. It's a serious threat. Which kind of loops back to the FTEs. Does the rapid rise of these specialized roles, like the foundation model deployment engineers, does that signal a growing
urgent worry? about these specific kinds of vulnerabilities, like the hacked GPT ones. Yeah, I think it absolutely does. It shows companies aren't just talking theory anymore. They're putting actual resources, actual specialized people in place to manage the tricky, fragile reality of deploying these incredibly powerful and potentially vulnerable models. Okay, so we have genius -level math skills on one hand and serious security risks needing specialist managers on the other. It sounds like
AI is charging ahead. Well, yes and no. The sources also offer a bit of a reality check when it comes to general purpose AI agents doing practical stuff. They're still facing some real hurdles. Right. Microsoft ran this interesting test. They set up a kind of fake online marketplace called the Magentic Marketplace. Okay. What was the goal there? Basically, to create a messy, unstructured environment to see how well current AI agents could handle real -world -type tasks, things
that aren't neat benchmark problems. And how did they do? The top models, I assume. Yeah, they threw the best at it. GPT -5, GPT -4 -0, Gemini. And the results were, well, they struggled. A lot. Really? Struggled with what? Things like booking a complex trip with multiple constraints or handling tricky customer service scenarios. Tasks that require navigating ambiguity and multiple
steps in an unpredictable environment. It really shows the gap between, say, solving an IMO problem and, you know, booking your family vacation online. So that generalized agent capability is still proving pretty difficult. It seems so. But at the same time, you see huge amounts of investment pouring into more specific, more narrow AI applications. Like that start of Giga, right? They just pulled in, what, $61 million? Yeah, big funding round led by Y Combinator and Redpoint. And what are
they focused on? Enterprise voice AI, real -time customer support. Exactly. Grounded, specific business problems where AI can deliver clear value right now, even if it's not a general purpose agent. So why the disconnect? Why do the top models struggle with the general tasks in the magentic marketplace? But these startups focusing on narrow stuff like voice AI get huge funding. I think it comes down to structure. Real world tasks are just too messy and unpredictable for
today's general models. Investors seem to be betting on the sure thing AI that solves a well -defined business problem reliably rather than the grand vision of general AI agents, which, you know, isn't quite there yet. OK, that makes sense. Focus on what works now. Mid -roll sponsor Reed Placeholder provided separately. And this whole picture, the advanced research, the security needs, the agent struggles, the targeted investment, it all leads back to perhaps the biggest underlying
driver revealed in these sources. That's the economics of it all, specifically this massive collapse in the cost of running these models. Yeah, the token price collapse. It's not just a small discount. It's fundamental. Maybe we should quickly clarify what tokens are. Good idea. Basically, tokens are the little pieces of text or data that AI models chew on. Think of them as the basic unit of work for an LLM. And the cost is usually measured per million
tokens processed. Right. And that cost is just plummeting. It's staggering. The sources show top tier models, think GPT 4 .5 level or similar, went from roughly $10 per million dollar tokens back in 2022. Okay. Down to a projected one cent, 0 .01 per million tokens by the end of 2025. Whoa, wait, $10 down to one cent? That's a 900x drop. Per year, basically. Yeah. A 900x annual drop for the top tier. Imagine scaling something to a billion queries when the cost drops like
that. It changes the entire feasibility of projects. It's not just optimizing costs. It's enabling completely new things. Totally changes the physics of software, like you said. And it's not just the absolute best models, mid -tier ones, down 40x per year. Even the cheaper basic models still drop 9x per year. It's like the bottom fell out of the market cost -wise. Even on a log scale graph, it looks like a cliff dive. It really does. And this ties right in. to this economic
idea. Moore's law meets Jevons paradox. OK. Break that down. Moore's law is about chips getting exponentially better or cheaper. Right. And Jevons paradox says that when something becomes way more efficient. and therefore cheaper to use, we don't just use less of it. We actually end up using way more of it because new applications become possible. So the cheaper AI gets, the more we find ways to use it, driving demand through the roof. Exactly. And we're seeing hard evidence
of that. Google's reporting that even their older hardware, like seven -year -old TPUs, their AI processing chips are running flat out. 100 % utilization. They can't even make the hardware fast enough to keep up with all the new ways people are finding to use this now cheap AI compute. Precisely. The use cases are just multiplying like crazy. There was a great analogy in the sources for this, comparing tokens to transistors.
Yeah, the idea is that the price drop essentially swaps transistor for token in terms of cost. It makes complex AI computation almost disposable. Like those cheap sensors they put on shipping tags now. So running a sophisticated LLM for... Or say. personalized tutoring or instant legal analysis could become nearly free. That's the implication. Suddenly, almost any task that involves information processing could become an LLM use case because the cost barrier just vanishes.
So if nearly every task is potentially an LLM task now, what does this near zero cost mean for how fast innovation happens from here on out? Well, it just removes friction, right? The focus shifts completely away from can we afford to do this towards what creative things can we do? It fuels rapid experimentation, rapid proliferation. the bottleneck becomes imagination and implementation, not cost. Okay, let's try to pull these threads
together. We started with these really demanding benchmarks, like IMObench, needed to actually push AI towards genuine logical reasoning. But then we saw the flip side. The need to manage the huge security risks these powerful models bring, like hack GPT, leading to new roles like FDEs just to keep things secure. And we also hit that reality check, the fact that general purpose AI agents are still finding it really tough to handle messy real world tasks like in
that magentic marketplace test. Right. And finally, we looked at the massive economic shift driving the whole expansion, that incredible 900x token cost collapse, making AI computation almost free utility. So for you listening to this, think about that tension. You've got AI agents still struggling with basic unstructured reality, but the cost to run those potentially flawed agents
is plummeting towards zero. So what kinds of imperfect but incredibly cheap and scalable AI applications are about to just flood into everything we use? What does it mean when AI is everywhere? Super cheap, but maybe not always quite right. That's the really interesting, maybe slightly unnerving question to chew on. A lot to think about there. Thanks for joining us for this deep dive into the AI ecosystem. We'll catch you next time. OUTRO Music.
