🎙️ EP 122: GPT-5's Big Math Fail + The 1 Sentence That Killed Prompt Engineering

00:00

Okay, we really have to kick off this deep dive talking about that sensational claim, the one that just blew up the AI headlines for a bit. Yeah, the deleted tweet. Exactly. From an open AI VP suggesting, well, something almost unbelievable. That GPT -5 had made real progress on multiple really tough, long unsolved math problems. The Urdu's list stuff. Which, you know. Solving even one of those. That's huge for a mathematician. Career defining. Absolutely. So it went viral

00:30

like crazy. But the reality check came fast. And it was public and honestly kind of brutal. Yeah. And the core lesson, the thing that rival CEOs and academics jumped on. was this key difference. Confusing just finding information retrieval with actual like new thinking, real reasoning. Precisely. Welcome to the deep dive. We've sifted through your latest pile of sources. It's a real mix this time. High stakes drama, some really practical advice, and one genuinely surprising

00:58

new technique. Yeah, we're going to distill all that for you. Our mission today, cut through the noise, the hype. We need to really unpack what these big AI models actually do versus what the labs, well. sometimes claim they can do. So first up, we'll dig into that GPT -5 math thing properly. Then we're shifting gears. We'll talk about some essential tools and workflows, stuff for the builders and professionals listening. And finally, there's this technique out of Stanford.

01:24

Simple, costs nothing, but it really changes how we prompt these models. Might even be the end of complex prompt engineering as we know it. So yeah, let's get into this source material. Okay, segment one, the GPT -5 controversy. The initial claim itself was, it was pretty well. Totally. The VP tweeted, GBT -5 solved 10 unsolved Erdos problems. 10. And showed progress on 11 others. And these Erdos problems, just to be clear, they're notoriously hard. Right, yeah.

01:54

They need genuine mathematical creativity. Formal proofs. They're not like, you know, multiple choice questions. Right. So then mathematician Thomas Bloom, who actually tracks this stuff officially, he looked into it. And he quickly confirmed, nope. The model found old solutions. Stuff already in the training data. Exactly. Not new, original proofs. Yeah. Just found stuff it had already seen. And the reaction from the big names. Pretty intense. Yeah. Jan LeCun from

02:19

Meta apparently used some sharp words. Said they got hoisted by their own petard. Yeah. Basically meaning their own hype backfired. Ouch. And Demis Hassabis from DeepMind. Called it embarrassing. Straight up. Wow. But what's really interesting is that even OpenAI's own Sebastian Bubik kind of acknowledged the core issue here. Which is? The huge difference intellectually between just retrieving an old paper from the training data, which DBT5 did well, and actually inventing a

02:48

new proof. By implied invention. But delivered retrieval. That's the crux of it. So, okay, it was one deleted tweet. Why does this really matter in the bigger picture? Because AI labs are always using these leaderboards, right? Yeah. GSMAK, the math benchmark, they boast about these scores constantly. We see those headlines all the time. New model tops the charts. Right. But those scores often just test, well, stored reasoning, pattern matching, stuff the model learned during training.

03:14

So it sees a new question, finds a similar solved one in its memory, and kind of copies the method. Pretty much. It's retrieval disguised as reasoning. To claim real discovery, you need that formal proof verifiable novelty. Which wasn't there in this case. Nope. Proving those quick test scores can be, frankly, pretty misleading sometimes. Okay, so if these standard reasoning tests have this flaw, this susceptibility to just retrieving data, how should we actually measure real AI

03:42

discovery going forward? What's a better way? We must prioritize provable novelty over mere retrieval. Simple as that. Right. Verifiable newness. Exactly. Now, moving from those big claims to something more practical. Segment two, essential tools for builders. Yeah, this is important because the sheer volume of new AI tools, new features, it's overwhelming. Totally. Causes real analysis paralysis for people trying to actually use this stuff professionally. You need

04:09

a system to cut through it. Our notes mentioned a specific builder's framework. What's the core idea there? The main thing is focusing on time to value. How quickly can this tool actually help you solve a real problem you have? Rather than just chasing the highest score on some benchmark. Precisely. Evaluating tech based on your needs, not just the market hype. It's a solid system. Okay. And for people using, say, ChatGPT every day, there's a feature that gets missed. Yeah,

04:36

the projects feature. Often overlooked, but it lets you create separate, dedicated contexts for different tasks or workflows. Ah, so it stops the chatbot forgetting what you were talking about 10 minutes ago. Exactly. It gives it much better memory within that specific project. Yeah. For anything complex, multi -step, that continuous memory is a huge productivity win. Yeah, that continuity is massive. Honestly, I still wrestle with prompt drift and context windows myself

05:04

sometimes. Oh, me too. It happens. Trying to keep a consistent style or persona across a bunch of outputs, it can be tricky. For sure. But then when you need those really high quality, reliable results, like agency level stuff, but without the agency price tag. The source is pointed towards Google AI Studio. Absolutely. AI Studio gives you much deeper controls than your basic chatbot interface. Like what kind of controls? Well, the material details five specific professional

05:28

methods. One example is demanding output in really structured formats like JSON. Okay. Or forcing it to generate, say, a detailed negotiation brief with specific risk parameters clearly defined. Gotcha. So it's about making the output reliable and usable for serious work. Exactly. for when critical decisions are involved, or you need that consistency for creative or analytical tasks.

05:52

Okay, so beyond just better memory, what would you say is the single biggest productivity game for a professional using these more specialized AI features, like in AI Studio? Gig controls allow for agency -quality work without high cost. Right, that control is key. Definitely. Okay, let's shift gears again. Segment three, the economics, the talent side. Because things are getting intense. Intense is one word for it. The talent wars are real. We're seeing reports of AI startups and

06:20

SF leasing actual luxury apartments. And offering $1 ,000 rent stipends just to lure top engineers away from the Googles and open AIs. Whoa. I mean, imagine scaling that kind of competition across the whole industry. Billions being thrown at talent. It just shows the insane value placed on people who can actually push the research forward, you know, make fundamental breakthroughs. And if you're someone listening who wants to

06:43

get into that hot market. Yeah. The sources actually shared a super practical guide, how to get a job at an AI company. And this wasn't just random advice, right? No. It came straight from Jure Leskovec, Stanford, professor, founder, and he's actively hiring right now. So real insider stuff. Very useful. What else is happening? Quick headlines. Okay, rapid fire. OpenAI hired a black hole physicist. Oh, interesting. Google AI Studio. They just combined all their features into one single UI.

07:12

Big win for usability. Nice. Europe is deploying AI for massive water projects in its driest areas. Resource management focus. Important stuff. And OpenAI is pitching ChatGPT login features to other companies now, letting them use it for their own user authentication. Okay, the job market's clearly on fire. But let's go back to that black hole physicist joining OpenAI. What does something like that signal about where AI

07:38

research might be heading? AI is moving beyond language models into core scientific discovery. Got it. Fundamental science. Yeah. Thinking bigger. All right. This next segment, this could genuinely be a game changer for pretty much everyone listening. Segment four. Yeah. This is about the innovation killer. And the surprisingly simple fix. Okay, what's the killer? It's called typicality bias. It's the reason AI often gives you the same kind

08:04

of bland, boring answers. Or that same poem about misty mornings every time you ask for something creative. Exactly. It avoids risk. It plays it safe. Why does it do that? What's the root cause? It comes down to the training, specifically RLHF reinforcement learning with human feedback. Okay. Basically, typicality bias means when human reviewers rate the AI's answers, they tend to prefer the safe, familiar, typical ones. Ah, so the humans themselves are kind of biased towards average?

08:30

In a way, yeah. And this trains the model. To suppress the randomness, the statistical outliers that you actually need for real creativity. Which leads to this mode collapse thing you mentioned. Right. Mode collapse is when the model just keeps defaulting to the statistical average. The most common, safest, blandest response. Kills innovation. So we've basically been prompting wrong, or at least inefficiently. Pretty much. Yeah. We've been fighting the model's tendency towards the

08:58

average. But Stanford found this fix. Zero cost. called verbalized sampling. And the fix is almost ridiculously simple. It really is. You just add one line to your prompt. Go on. Instead of just asking for, say, five jokes, you add a statistical frame. You ask it. Generate five jokes with their probabilities. That's it. Just adding with their probability. That's it. You're basically telling the model, hey, think about the diversity of your possible answers. Show me the spread. Wow.

09:27

Okay. And the results. The source material seemed pretty emphatic about this. Oh, the results are nuts, especially considering it costs nothing extra. Creative writing. They saw a 92 % jump in diversity for poems. 92%. And for jokes. Right. 109 % increase in diversity. Double the variety, basically. Yeah. And it wasn't just fluff. High -stakes stuff, too. Dialogue simulation. The AI's responses became twice as close to actual human donation behavior in a test scenario. So

09:56

more realistic, human -like variation. Exactly. An open -ended Q &A. Seven times better answer spread. And four times lower KL divergence. Right. Which is just a fancy way of saying the answers were much less statistically similar, much more varied. That's impressive. Does it work everywhere? Seems like it. Yeah. Works across different models, GPT, Claude, Gemini. And it scales with model size. GPT -4 got twice the diversity boost compared

10:20

to the smaller GPT -4 Mini. Just reframing the request statistically unlocks all this built -in variety. It's kind of mind -blowing. It really is. Okay, so does this super simple technique basically mean that the whole complex art of prompt engineering, you know, crafting these elaborate instructions, is that kind of over now? The shift is towards statistical framing for built -in creativity. Less hand -holding,

10:41

more guiding the statistics. Interesting. Okay, let's try and pull the main threads together from this deep dive. Two big takeaways for you as you navigate this constantly changing AI landscape. All right, takeaway number one. First, be skeptical of the hype. Just trust it. That GPT -5 retrieval mess, it shows the line between just accessing knowledge and real innovation is still super blurry. Demand proof, demand novelty. Exactly. Verifiable novelty is key. And takeaway number

11:14

two. Trust the simple, smart fixes. That verbalized sampling technique. It gives you a huge, zero -cost creativity and diversity boost just by changing how you ask. Yeah, it's about applying these simple, elegant, structural changes. Right. To get way better, less predictable results from the models you're probably already using every day. That's where the real efficiency gain is, isn't it? Getting more out of the tools you have. Totally. It's about smarter interaction, not

11:38

just bigger models. So the challenge for you listening right now is maybe to go try this today. Yeah, test it out. Think about your own work, your field. How could adding that simple phrase, generate X results with their probabilities, how could that unlock something new? Move beyond the same old answers the AI usually spits out. Right. How can it help you break out of that typicality trap, that mode collapse in your specific context? That's the path forward. Definitely

12:06

something to experiment with. And thanks again for sharing all these sources with us for the deep dive. Absolutely. We couldn't do it without you. We'll catch you next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript