🎙️ EP 101: GPT‑5 Codex Is Quietly Replacing Junior Devs (And Nobody Noticed)

00:00

You know, AI is achieving AGI capabilities right now, but maybe not where you'd expect. It's happening pretty quietly, actually, inside production code bases. We're talking about AI autonomously fixing really complex bugs, submitting finalized code changes, no direct human intervention. Welcome to the Deep Dive. This is where we take the week's key AI research and news, and while we distill it into a quick, deep analysis for you, we're really trying to jump into the progress that

00:28

often goes completely unreported. Yeah, exactly. We're looking past the sort of viral chat apps today, getting right into the, let's say, the technical engine room of autonomy. So we've structured this dive around three main things for you. First, why coding is perhaps the real epicenter for AI autonomy right now. Second, some key technical highlights, stuff like major infrastructure fixes, public performance checks, that kind of thing. And finally, we'll take a really detailed look

00:51

at Delphi 2M. That's an AI forecasting serious disease risk, maybe 20 years out. Okay, let's unpack that first part, starting with software development. The central idea from the sources seems pretty clear. Coding, you know, writing, debugging, implementing functional programs. That's the real canary for AGI. And this revolution, it's been so gradual. Maybe people kind of missed

01:15

the seismic shift happening. Well, what's really fascinating is how these capabilities got added, like in layers, often without even changing the user experience much. So the change felt subtle, you know. Back in, what, 2021, we got... GitHub Copilot. That was basically smart code completion. Right. Simple tab completion. And then by 2022, things like ChatGPT were good enough to write

01:35

like short standalone scripts. Exactly. Fast forward to now, 2025, and you've got tools like Cloud Code, Codex Manalign Interface, the CLI actually building pretty complex mini projects and fixing bugs, submitting those code changes, what developers call pull requests or PRs autonomously. Okay, when we talk about that leap, we have to define autonomous agents because this isn't just glorified autocomplete anymore, is it? Oh, not

02:00

at all. We define autonomous agents as AI that can, like, understand a task, plan out multiple steps to solve it, run the code it needs, make decisions if something goes wrong, and actually complete the objective without constant human hand -holding. It's a whole loop operating on its own, like stacking Lego blocks of data. And there's actual data backing this up. These projects tracking agents in the wild, they show the Codex web agent has merged over a million pull requests

02:27

already. One million. Yeah. Yeah. And those PRs, they're getting merged at an impressive rate, like 80 plus percent. That's for real production code changes, even from agents that are basically, you know, first time users on a code base. Yeah. Plus, we've seen huge adoption like Claude Code. It has something like 20 times more NPM downloads as the node package manager developers use than

02:48

the code. Codex CLI. Honestly, I still wrestle with prompt drift myself sometimes, you know, when I'm just trying to fix simple bugs in my own little weekend project. So it's genuinely humbling to see these production systems hitting that consistent 80 % merge rate. Wow. So if coding really is the AGI canary, that consistent 80 % merge rate feels like a pretty loud chirp. What does that percentage really tell us about

03:11

where software autonomy is right now? I think it tells us the AI has achieved a level of reliability. We're definitely past simple suggestions here. These agents are delivering tangible, trusted, ready -to -use output. Okay, let's shift from that underlying code autonomy to some of the headlines. Model performance, public scrutiny. It often feels like the biggest perceived problems in AI turn out to be, well, pretty mundane infrastructure

03:35

stuff. That is spot on. Like if you felt Claude seemed a bit nerfed recently, you know, its capability felt kind of reduced. It wasn't some secret downgrade. Anthropic actually put out a postmortem. They explained the perceived change was just down to three overlapping infrastructure bugs. Simple as that. And they're all fixed now. So, yeah, model stability often comes down to pretty boring infrastructure work, not some fundamental drop in capability. Right. And speaking of control,

04:00

there is also that new feature for GPT -5. Users can now toggle its thinking time. It's web only for now, but it lets you choose faster answers or maybe smarter, more deliberate quality. Yeah, that's potentially huge for efficiency, letting the user decide the tradeoff. And speaking of public perception, Meta had that high profile moment recently. During their public demo, trying to show off live tasks, the system kind of hung for about a minute. Left the audience a bit unimpressed,

04:28

apparently. But look, that was likely just real -time latency or maybe an API connection issue during data fetching. That's a really common bottleneck. Doesn't necessarily mean the model was hallucinating or broken. Still, takes guts to demo that stuff live. Definitely. And on a more creative note, there's this new tool, NanoBanana,

04:45

allows pretty sophisticated photo merging. The interesting part is it handles images with more than two people and lets you control the exact aesthetic, the pose, for everyone involved simultaneously. So thinking about that GPT -5 toggle, why is giving users control over thinking speed actually such a useful option? Well, it really lets you, the user, consciously prioritize. Speed for simple stuff or deep analytical quality for complex tasks. Tailoring it. Okay, let's pivot now to

05:15

the strategic side. Financial moves, where the money's flowing, and why so many projects seem to fail. Right, we're seeing some serious investments still. Databricks, for instance, just raised a billion dollars. One billion. And they launched an AI accelerator program. They're giving early stage startups like $50 ,000 in platform credits, helping them scale compute quickly. That sounds fantastic. Pure opportunity. But then there's this other data point that kind of balances that

05:41

optimism. Research showing that, what, 95 % of AI automation projects actually fail. That seems incredibly high. And the successful ones, they apparently rely on a process -first method to get a positive ROI. Whoa, just imagine Databricks scaling that accelerator funding globally. That could cause a massive, almost immediate shift in compute allocation for startups everywhere. Yeah. But yeah, that 95 % failure rate is sobering. It really underscores the need for something

06:04

called context engineering. Okay, let's dig into that. For listeners who are informed but maybe don't use that term daily, what exactly is context engineering and why is it apparently the key to avoiding that 95 % failure rate? Context engineering is basically about connecting the AI directly to your stuff, your company's internal data, your live workflows, your proprietary databases. You're giving the large language model the specific relevant context it needs to do its job well

06:34

for you. Not just feeding it generic web data. That's how you get consistently high quality outputs that actually, you know, matter to the business process. So based on the research then, what's the core strategic reason the successful projects manage to avoid that huge 95 % failure rate? It seems they really focus on defining the underlying business process before they even think about implementing the AI solution. Process

06:55

first. All right, let's turn now to what feels like one of the biggest scientific breakthroughs mentioned in the sources. Delphi2M, this new AI system from Europe. researchers, it seems genuinely set to redefine proactive health care. Oh, its capability is absolutely stunning. Delphi2M forecasts the risk for 12 ,258 distinct diseases. Think diabetes, neurological disorders, heart conditions, the whole gamut, up to 20 years into the future. And it does this just using a patient's

07:24

standard electronic medical records. Nothing more exotic than that. The training data must have been immense. It was initially trained on, what, data from over 400 ,000 UK patients, including everything from doctor visits, hospital records, even known lifestyle choices factored in. Exactly. And critically, this is super important to make sure it wasn't just good for the UK population. They validated it. They tested it against 1 .9 million entirely separate Danish patient records.

07:49

That validation step is crucial. It proves the model has generalizability. It proves it's not just hallucinating predictions based on some bias in the original UK data set. The potential implications for actual patient care seem profound here. Delphi2M could help doctors shift from just reacting to symptoms that already exist to actively anticipating future health risks years in advance. That could fundamentally change medicine towards really personalized prevention.

08:14

Yeah, but it's really important to stress the caveat here. Human doctors are still absolutely necessary. They need to interpret the AI's predictions. The system analyzes risk. The physician provides the judgment, the empathy, the actual treatment plan. It's designed to augment the doctor, definitely not replace them. So going beyond the headline number of diseases or years, what's maybe the biggest operational shift Delphi2M could bring to something like a standard annual checkup?

08:40

Well, it fundamentally redefines that checkup, doesn't it? From just a snapshot evaluation of current status to proactive long -term risk mapping for the individual. Okay, so just to quickly synthesize everything we've covered for you today. First, AI encoding. It's quietly crossing some really crucial autonomy thresholds. That 80 plus percent PR merge rate is a key example. Second, model stability issues, often tied to boring infrastructure fixes, not fundamental nerfing.

09:07

And users are getting more control, like toggling speed versus intelligence. And finally, predictive AI like Delphi 2M is really starting to redefine human health and the whole concept of prevention. So what does this all really mean for you? I think the takeaway is that the most impactful AI progress is often happening silently. It's in the background, deep inside complex systems, the infrastructure, the code bases, the hospitals. It's not just in those viral chat apps that grab

09:32

all the headlines. You really have to pay attention to the infrastructure, the less flashy stuff. Thank you so much for joining us for this team dive today. We really hope you continue learning about these advancements. They're fundamentally reshaping tech and health right now. And maybe here's a final thought for you to chew on. Consider that China is apparently already teaching children AI principles from age six, and that the DeepSeek foundational model, a pretty capable one, reportedly

09:56

cost only $294 ,000 to train. Not millions, thousands. So how rapidly do you think the center of gravity for foundational AI breakthroughs might shift in, say, the next five years? Something to think about. Until next time.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript