🎙️ EP 234: The AGI Test That Every Model Just Failed

00:00

Human beings scored 100 % on their very first try at a new intelligence test. Yeah, and that's a completely wild statistic. It really is, because the smartest, most expensive AI models on Earth, they scored under 1%. Right, which is just a staggering reality check for the whole industry. We kind of, um, we think these machines are invincible. Today, we're looking at proof that they really aren't. Physics and logic always get a vote in

00:25

the end. Welcome to the Deep Dive. If you've been watching the AI space lately, you probably feel like the ground is shifting under your feet daily. Oh, absolutely. It's moving incredibly fast. You are not alone in feeling that way. Today, we are taking your sources and looking at a massive reality check on artificial general intelligence. And we're also digging into the escalating boardroom drama happening behind the

00:47

scenes, which is intense. Yeah, we'll see how platforms like Apple and Reddit are adapting. And finally, we'll look at a terrifying new wave of vibe -coded malware. That malware story, I mean, it changes the security landscape entirely. We're going to get to that. But first, let's talk about actual capabilities to, you know, ground this discussion. Yeah, we really need to start with the ARC Prize Foundation. They just released ARC -AGI -3. Right. And this is

01:13

a very highly anticipated benchmark. It is. It's designed to test true adaptability, not just what a model has memorized from the Internet. It tests how a model reacts to the completely unknown. The frontier models had zero prior training on these puzzles. None at all. We're talking about 135 brand new environments. And there are roughly 1 ,000 puzzle levels in total, right? Exactly. And visually, these puzzles are just simple grids. You have colored squares arranged

01:41

in patterns. So you just have to figure out the underlying logical rule to solve it. Yeah, it's highly abstract. But humans find it... incredibly intuitive. Right. So when I hear that humans ace this, I assume GPT or clot is right behind us. You would think so, but it was an absolute bloodbath. They completely fell flat. How bad was it? Well. Gemini 3 .1 Pro scored 0 .37%. Wow. And GPT 5 .4. GPT 5 .4 scored 0 .26%. That is unbelievably low. Claude Opus 4 .6 scored

02:13

0 .25%. Right. And Grok 4 .2. It scored exactly 0%. Zero. Not even a fraction of a percent. Yeah, they just hit a brick wall. the gap between human and machine adaptability is still massive because humans solved every single environment on the first attempt we can look at a grid and just intuitively grasp symmetry or gravity and ai models don't possess that intuition at all not yet you can actually play these public games yourself to see You just see the pattern and

02:41

apply it. It's about spotting underlying logical rules on the fly. It's not about reciting memorized facts from a database. Exactly. That's the core difference we're talking about here. And we should probably define our terms clearly. Good idea. Let's define it. AGI. Software that can learn any cognitive task humans do. Right. That is the holy grail for these companies. But right now, we're falling incredibly short. Although, I mean, we should mention the critics of this

03:06

specific test. Not everyone agrees the ARC benchmark is fair. That's true. The scoring system is notoriously demanding on the models. Because the AI must match or exceed human problem -solving speed, right? Yeah. If the model takes too long to process the logic, it scores poorly. And critics say that demands too much compute overhead. Exactly. They argue it skews the final results artificially low. But the philosophical conclusion remains incredibly profound either way. It really does.

03:38

If a future model passes this test, it will be fundamentally different. Right. True AGI likely won't just be a scaled up version of current tech. It needs a totally new type of intelligence. Current models are brilliant at pattern matching because they've read the whole Internet. But they lack flexible reasoning. They don't know how to think laterally. Yeah, they really struggle to adapt when the rules suddenly change. So if I'm understanding this right, current AI is just

04:03

matching data points. It is like stacking Lego blocks of data. Yeah, exactly. Eventually, you need a totally different toy to build a working engine. You can't just keep adding more plastic bricks. That is a brilliant way to visualize it. You can build a massive, beautiful plastic car. But it won't actually drive down the street. Right. You fundamentally need a real combustion engine for that. Plastic bricks won't help. So is throwing more compute at these models basically

04:28

a dead end for adaptability? Well, scale definitely brings better pattern matching. You get better coding syntax and fewer hallucinations. But it doesn't magically create that spontaneous reasoning AGI requires. No, it doesn't. You eventually hit a fundamental architectural wall. Got it. Brute force alone will not build true adaptable intelligence. Precisely. And honestly, that looming

04:52

technological wall is causing sheer panic. Yeah, if pure computing power won't solve this AGI wall, that totally explains the panic in those leaked Slack messages. The CEOs know they can't just buy their way to the finish line anymore. The human scramble is intensifying. Titans are fighting to control the narrative and, well, the funding. The boardroom drama is escalating to a fever pitch. The stakes are unimaginably high. We have these leaked messages between major

05:18

players. Sam Altman apparently tried to, quote, save Anthropic. Yeah, and this was during a massive Pentagon contract clash. The messages are incredibly revealing. They show the raw, unfiltered tension behind the scenes. This isn't just friendly competition. Not at all. Altman accused Anthropic CEO Dario of actively undermining open AI. He claimed this sabotage had been going on for years. They're fighting over these lucrative government defense

05:46

contracts. It really highlights the intense psychological pressure these leaders are under right now. They are racing toward a wall, and they know it. And they need infinite capital to break through. Which brings us to the fundraising. It completely reflects those stakes. Look at Reflection AI. They're backed by Nvidia and they're raising $2 .5 billion. Just let that number sink in. They're targeting a $25 billion valuation. That is an astronomical amount of capital for a relatively

06:13

new player. It is. And they want to compete directly with Chinese AI dominance. JP Morgan might even join the round. It's a monumental financial mobilization. It feels like a new space race entirely. Meanwhile, the political maneuvering is just as intense. Regulation is rapidly becoming the new battlefield. Whoever writes the rules essentially controls the future market. Right. And the incoming Trump administration just appointed a new tech advisory

06:39

panel focused on AI regulation. Mark Zuckerberg from Meta is on it. Larry Ellison from Oracle is there. Jensen Huang from NVIDIA is also included. It's a fascinating dynamic. You have these massive tech titans sitting at the government table, drafting the playbook. Regardless of where you sit politically, just objectively, it's a massive alignment between tech and government. Oh, absolutely. It's a major shift in power dynamics for the next decade. You know, I still wrestle with prompt

07:06

drift myself. So imagining CEOs navigating multibillion dollar boardroom clashes is just wild. Yeah, it feels completely surreal. You're just trying to get a chat bot to format a simple email. Exactly. Meanwhile, these guys are fighting over Pentagon contracts and billions of dollars. It's almost Shakespearean. But it makes you wonder about the actual technological progress. Does all this boardroom and political maneuvering actually speed up innovation or just stall it? Massive

07:33

capital. and regulation usually define the playing field. They might distract from the core science momentarily. But they ultimately dictate who gets the resources to build the future. Exactly. The science requires immense resources, massive data centers, and favorable loss. So the boardroom fights today will dictate the technology we get tomorrow. Unfortunately, yes. It doesn't exist in a vacuum. And while the executives fight over the big picture, the consumer platforms are moving

08:00

fast. They're quietly building the actual infrastructure you use every single day. They want to capture users completely before the dust settles. They're building invisible walled gardens. Yeah, they want you securely locked into their specific ecosystem. Look at Apple. Apple is making a massive uncharacteristic move with iOS 27. They're opening up Siri. To rival AI assistance, which is a huge philosophical shift for Apple. They usually keep everything tightly closed off. Right. It uses

08:28

a brand new extensions system. Gemini, Claude, and others can plug right into the OS. It turns the iPhone into a multi -model AI hub. You aren't stuck with just one brain anymore. It's a brilliant strategic play. Apple owns the hardware in your pocket and the user interface. By opening it up, they let the models fight for your queries. Apple still wins either way. And Google is aggressively fighting for your loyalty, too. They just launched

08:55

brand new switching tools. Yeah, you can easily import your entire chat history and pull memories from chat GPT directly into Gemini. Google is making it frictionless. Right. And basically offering to move your digital furniture for free. Exactly. They know the biggest barrier to switching is losing your data. They want to remove literally any excuse you have for staying behind. But Reddit is taking a completely different approach. They aren't trying to absorb AI. No, they are actively

09:21

fighting automated content. They're heavily testing bot labels right now. And implementing caskies for stricter authentication. They're even testing optional world ID scans. Right. AI posts are still allowed, but they desperately want to clearly separate humans from bots. They want to slow the massive surge of fake, automated activity on the platform. It's a genuine existential threat. Think about the dead internet theory. If users can't trust who they're talking to, the platform

09:49

dies. For listeners trying to navigate all this chaos, there are good resources out there. Yeah, the Stanford course mentioned in the sources is fantastic. If AI explanations feel either too basic or way too technical, it sits perfectly right in the middle. I highly recommend it for anyone feeling overwhelmed by the constant noise. It balances clarity and depth beautifully. It

10:09

really helps cut through the hype. But looking at all these platform wars, I have to ask, will everyday users actually care enough to migrate their chat histories between these massive models? History shows that convenience and ecosystem integration always drive consumer behavior in the end. People take the path of least resistance. Make the transition effortless and the users will definitely follow. It is the absolute golden

10:31

rule of tech platforms. So we're integrating AI into all of our devices, our phones, browsers, social networks. But the foundational open source code building these systems is incredibly fragile. This is where the story takes a very dark turn. We are basically building massive skyscrapers on a foundation of sand. The security implications are genuinely terrifying. Two big stories in Silicon Valley just collided. Yeah, regarding late LM. It's a massive open source project with

11:02

over 40 ,000 GitHub stars. Thousands of commercial forks depend heavily on it. It's a foundational building block for AI apps. And it was just hit by heavily hidden malware. The malicious code was buried deep inside a software dependency. It was discovered by an independent researcher named Callum McMahon, right? Yeah, he was just trying to install the package normally. And suddenly, his computer randomly shut down. Just went completely black. And that weird behavior led to the discovery

11:28

of the malicious code. If it hadn't crashed his machine, it might have gone completely unnoticed for months. And here is the truly alarming part. Andrej Karpathy and other top researchers weighed in on the code. Right. They believed this malware was vibe -coded. Meaning it was quickly generated using AI without deep human oversight. Whoa, imagine scaling malware creation at the speed of thought just by vibe coding. You just tell an AI what you want to exploit and it writes

11:57

it. It lowers the barrier to entry to almost zero. And the malware was incredibly aggressive. It tried to steal local login credentials immediately. It also tried to access connected developer accounts. It desperately wanted to spread into other open source packages. It acts like a digital virus looking for a new host. It's a classic supply chain issue. We should define that for clarity. Supply chain attacks, hackers hiding malware

12:21

inside trusted software updates. It's like hiring the best security guards for your bank, but the architect used a blueprint written by a robber. The guards can't protect you because the vault itself is compromised from the inside. Exactly. Now, Lightelum developers responded very quickly. They're working directly with Mandiant. They need to fully investigate the extent of the issue. But there is a glaring detail here regarding

12:43

corporate compliance. Yeah, LightTelM actually displayed prominent security certifications on their site. They featured SOC2 compliance and had ISO 27001 certifications. And these were issued by a prominent AI compliance startup named Delve. Delve has faced intense criticism recently. People question how reliable its automated certification process really is. The company vehemently denies these claims, but it highlights a crucial nuance about security theater. Certifications show good

13:13

organizational practices. They cannot guarantee actual protection against sophisticated supply chain attacks. They only check high -level policies. They don't check every single line of code in a buried third -party dependency. There is a massive blind spot. So if the attackers use AI to generate these hidden exploits, can automated compliance systems ever keep up if the attacks are generated by AI? Right now, offensive AI is moving. moving much faster than defensive

13:37

certification checklists. The attackers definitely have the advantage. Meaning our current defense systems are fundamentally outmatched by the threat. Yes, we are relying on slow, static defense in a hyper -dynamic war. It's not sustainable. We are going to pause right here for a brief sponsor break. Sponsor placeholder. And we are back. Let's synthesize this incredible journey we've been on today. We have covered a tremendous amount of ground. And it all connects in a rather unsettling

14:06

way. We are chasing an elusive, highly adaptable new intelligence. That is exactly what the ARC -AGI -3 benchmark showed us. True AGI remains fundamentally out of reach for our current brute force architectures. But to get there, the Titans are fighting fiercely for absolute control and massive funding. We see this with the open AI drama, Anthropic, and the new Trump tech panel shaping the future. While the billionaires fight,

14:33

consumer platforms are making their moves. Apple and Google are locking you into their ecosystems. They are turning your everyday devices into these massive multi -model AI hubs. And meanwhile, the very foundation of all this technology is crumbling. Open source tools are under active attack. They are under attack from AI -generated code itself. Vibe -coded malware is bypassing enterprise security. It is a highly volatile mix of relentless ambition, massive capital,

14:59

and extreme technical vulnerability. It really forces you to deeply consider the unseen risks of moving this fast. If current AI can already vibe code malware that bypasses enterprise security, what happens if the first true AGI doesn't announce itself with a high benchmark score, but simply adapts silently inside an open source library? That is a profoundly chilling thought. It might

15:23

just hide quietly in the noise. waiting it is something to seriously ponder as we build this future thank you for joining us on this deep dive we will be back with more of your sources soon until then keep questioning the systems around you

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript