🎙️ EP 183: This Robot “Imagines” the Future Before Moving

00:00

Imagine this. There's a robot, Neo, and it has one simple task. Pull a tissue from a box. Okay. But before it moves, it just stops. And inside its own process, it generates eight short five -second video clips. Some of those videos show it working perfectly. In others, it fails. The tissue rips. Maybe the box falls over. It sees it all first. It's incredible. It's not just running a program. That robot is using what you could call synthetic imagination. Synthetic imagination.

00:30

Yeah, it's mapping out possible futures to guide what it does in the real world, which is messy and unpredictable. It's a huge shift. Welcome back to the Deep Dive. That idea, that hook, it really captures the two big themes we're looking at today. It really does. Okay, so let's unpack this. In this Deep Dive, we're focusing on two major things. how physical AI is learning to imagine, and then the surprising realities of how AI is being adopted around the world. That's

00:57

the mission. We've synthesized the key insights from the sources you shared with us. We'll start with that breakthrough from 1X Robotics. Then we're going to do a rapid -fire look at the current market. Some big deals, some friction and a few clever research tricks. And then the global report. And then, yeah, global report that really challenges how we think about where the U .S. actually stands in all of this. There's a lot of crucial ground to cover. Let's jump right in with how these

01:21

robots are starting to teach themselves. So the one next world model for these neo robots is it's a really critical inflection point. This is the step that allows them to learn new physical tasks. Without a person having to code every

01:36

single. movement exactly without that intense you know bespoke human coding for every single action it's a move away from the old way of doing things and the how is what's so important here it's not the traditional approach where you calculate exact joint angles and all that no it's much more visual much more predictive the system takes a simple text prompt like your pull a tissue example and it takes the robots current camera view its context it feeds both of those into

02:04

this world model which basically acts like an internal simulator. And that's what generates those little five -second imagined videos of what might happen? Yep. They use a concept for this, right? Yeah. Something like video diffusion for motion planning. That's the technical term, and we should probably define that. It just means the robot plans its moves by trying to generate a successful visual outcome first. It's all based

02:29

on pixels, not just lines of code. Which is so much better for dealing with new situations. Fundamentally better. Because in the old way, if the box was tilted or the light was a bit dim, the whole thing could fail if you didn't code for it. Right. With this, the model just imagines what success looks like, the successful pixels, and works backward from there. So it runs, what, eight different imagined futures? Yeah. And then it picks the best one. Correct.

02:55

An internal critic looks at those eight rollouts, and it selects the one that seems most likely to succeed. Only then does a second model translate that chosen video into the actual joint commands. And the results really back this up. The sources were clear that for that tissue -pulling task, the success rate jumped from 30 % to 45%. Just by sampling. Just by sampling and choosing the best imagined future before it even moves. Yeah. That's a huge improvement in reliability. It

03:22

is. And what's really exciting is that It works for tasks it's seen before, what they call in distribution, but also for totally new tasks. It can generalize. OK, but here's the reality check. Speed. Ah, yes. The latency. The sources highlight a big limitation. It takes 11 seconds for that whole imagination and planning phase. Plus one second for the action itself. So 12 seconds total. Which is a long time. 11 seconds is definitely the defining challenge for this

03:48

to be practical, you know, in real time. It shows the AI race has moved beyond just... virtual agents like ChatGPT. Now it's about physical agents and 1x is shipping soon. This tech is the key, but that latency has to come down. So that raises the question, how does this 11 second planning latency affect its real world usefulness right now? It's too slow for instantaneous work, but it definitively proves the concept. Using visual imagination for physical planning works.

04:17

It's a proof of concept. Okay, so while 1x is figuring that out, let's pivot to the... absolute chaos of the immediate market. This is where the pace is just relentless. It's a flurry of news for sure. Let's start with a kind of wild story. The Shopify CEO went viral for using Claude to build a custom tool to analyze his own x -ray. Yeah, I saw that. The ability to use an LLM, not just for text, but to code a specific tool for yourself in an afternoon. Right. That speed

04:46

is remarkable. It really blurs the line between a developer tool and just a consumer app. I mean, it still feels kind of wild that people are doing their own medical reads, even if they're technically capable. And moving over to the media side, Google just upgraded VO 3 .1, their video model. The quality enhancements are huge. We're talking. Full support for vertical, you know, 9 .16 videos, 4K upscaling, and the clips just feel more dynamic, more natural, even with really short prompts.

05:15

The gap between just typing something and getting a high -quality video back is shrinking almost every week. It's going to have a huge impact. And speaking of unexpected things, this was a real aha moment from the sources. Google researchers found what the source called a dumb trick. A dumb trick, as in simple but incredibly effective. Exactly. And this trick boosted accuracy by 76 % on certain tasks. And it worked across the board. Gemini, GPT, Claude, DeepSeek, all of

05:45

them. So what was the trick? It wasn't some complex new architecture. It was just specific phrasing in the prompt itself. To stop the model from forgetting things. Precisely. They found that phrasing it like, let's generate four detailed options first and then evaluate them one by one before you give me a final answer, made a massive difference. You're forcing it to show its work. It's like metacognition for an LLM. It is. And that just shows why prompt engineering is still

06:11

so hard. You know, I still wrestle with prompt drift myself. What do you mean by that? It's when you make a tiny change to the wording and suddenly the quality of the output just plummets. So to find that one simple instruction can give you such a huge cross -model return, it's... It's humbling. Right. Now, shifting to market power, the huge deal between Apple and Google. The Gemini deal. Yeah. The reporting says it's around a billion dollars a year for Gemini to

06:35

power Siri and other Apple AI features. That is just massive. We're seeing this profound centralization of capability. You have two of the biggest tech companies in the world combining their AI reach. Which, of course, drew immediate criticism. Elon Musk was very vocal, saying it represents an unreasonable concentration of power. Well, and that concentration of power is creating friction on the ground. Literally, we're seeing physical protests against AI infrastructure. Microsoft.

07:02

Yeah, Microsoft unveiled these new community first data center plans. They're promising things like no local electricity bill hikes, more jobs, all of that. But the sources say that locals in 24 different states are still actively protesting the rollouts. The energy and land use are just too visible. And the promise of jobs tomorrow doesn't always outweigh the noise and environmental impact. of a huge data center today. Yeah, that's

07:27

a key conflict. And finally, on hardware, while NVIDIA is king, challengers are getting serious funding. A startup called Etched just raised half a billion dollars. 500 million. To take on NVIDIA directly. They're betting on specialized chips optimized just for large language models. Which tells you the market sees the GPU bottleneck as a real vulnerability. That specialized hardware might not beat NVIDIA on everything, but for certain tasks, it could be way more efficient.

07:54

So does that massive Apple and Gemini deal pretty much confirm the market is moving toward... A centralized, maybe two or three player landscape. Yes. The sheer value of that deal suggests strong centralization, despite understandable criticism about concentrated power. Mid -roll sponsor read insert here. OK, we are back. And now we're going to transition from that fast paced, high value U .S. market to the broader world stage. We're focusing on this Microsoft AI Economy Institute

08:21

report. And the data on global adoption is genuinely surprising. It really is. This is where we see the difference between who's developing the models versus who is actually adopting them. The sources show that despite all the innovation here, the U .S. ranks only 24th in actual AI adoption. 24th. 24th. The adoption rate here is reported at 24%. Now, compare that to, say, the UAE. They're

08:45

at 64%. That is an enormous gap. It suggests that, you know, a national strategy and top -down prioritization really matter, especially in smaller economies. And globally, the average is just over 16%. But developed countries are adopting at nearly double the rate of developing ones. But the report highlights an even more critical story, a counter -narrative to the big tech dominance. Because while everyone is focused on GPT, Gemini, and Claude, The model that's really succeeding

09:13

in underserved markets is DeepSeek. Right, especially across Africa and Southeast Asia. Why them? They are running the classic Android playbook. It's a strategy focused on being everywhere. Be free, be open source. Be free, be open source, be low cost, and be good enough. That's the formula for high volume adoption. And it is working. DeepSeek is seeing two to four times higher usage than the big proprietary models in some African markets. And it's not just about the model. It's

09:41

an infrastructure play. They're using key partnerships, especially with Huawei, to roll out the necessary hardware and cloud support. So open source democratizes access. Exactly. The world doesn't always need the most expensive, leading -edge model if 90 % of people can't afford it. It needs tools that are functional and accessible right now. And this confirms a major point from the sources. AI adoption in 2026 isn't just a tech stat anymore.

10:08

It's a development stat. It is. If you ignore the next billion users in emerging economies, someone else is going to capture that market. Whoa. I mean, imagine scaling that kind of open source service to a billion queries. That's a colossal opportunity, and it's being realized through open access. It's a totally different way of thinking about the market. So this raises one last important question for this segment.

10:31

What's the long term implication of these open source models leading adoption in emerging markets? Open source accessibility is closing the adoption gap, proving that AI tools must work for everyone to achieve true global reach. So to synthesize the big ideas for you, the learner, let's connect these two threads. On one hand, you have robotics, which is making this fundamental shift. Right. Physical agents moving from rigid code to what

10:58

we're calling synthetic imagination. And that visual planning sets the stage for robots that can actually teach themselves. It's a move from pure reaction to to real internal deliberation, even if it's a slow 11 second deliberation right now. And at the exact same time, the global landscape is full of friction. You have protests against infrastructure. You have this intense market centralization with the Apple and Gemini deal. And also incredible opportunity. Right. And that.

11:24

opportunity is being seized by accessible open source models like DeepSeek. They're the ones driving true adoption in emerging economies. It's a fascinating race, absolute performance at the very top versus sheer accessibility for everyone else. And that 1x Neo technology, that really feels like the fundamental marker of progress

11:41

here. It is. The shift from an AI predicting what should happen based on code to an AI that actively imagines and then selects the best path for many possible futures, that changes everything about how a robot can learn. They're modeling reality before they touch it. It's the difference between simulating a single line and generating a whole field of possibilities. That's intelligence. That really is the next frontier. So we have to leave you with a final thought to mull over.

12:08

What new autonomous capabilities will unlock when that AI imagination latency, that critical 11 second planning time, drops down to just one second? When a robot's imagination becomes instantaneous, physical AI changes everything. Until the next deep dive, keep learning. Out to you, music.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript