Multimodal AI: How Machines Are Learning to See, Hear, and Reason

Speaker 1

00:01

Welcome to the Sentient Code, where intelligence is engineered, autonomy is emerging, and a line between human and machine grows thinner. Each episode, we decode the algorithms, explore the robotics, and examine the ideas shaping the future of artificial minds.

Speaker 2

00:23

I want to start today by asking you to do something that feels incredibly simple, almost you know, trivial, but it's actually a miracle of biology. Right now, just pause for a second and notice exactly what you're doing. You're listening to my voice, obviously, but maybe you're also driving, so your eyes are scanning the road, watching for break lights. You feel the texture of the steering wheel under your hands. Maybe you're drinking coffee and you can smell the roast.

Speaker 3

00:47

It's the sensory soup. We're swimming in it.

Speaker 2

00:49

Exactly, it's a soup. But here's the thing, and I really want you to catch this. You aren't toggling between these senses like you're switching apps on a phone. No, you don't stop hearing to start seeing. You don't pause your sense of smell to process the texture of the wheel. Your brain is this incredible fluid mixing board. It takes audio visual tactile and textual inputs and weaves them into this single, seamless narrative we call reality.

Speaker 3

01:18

And it's completely effortless for us. I mean, it is the defining feature of biological consciousness. So we don't really think about modalities, do we. We just think about the world.

Speaker 2

01:29

But and this is the big concept we're unpacking today. Until very very recently, artificial intelligence was not like that at all. In fact, it was the exact opposite.

Speaker 3

01:38

Oh, it was completely fragmented. You look at the history of AI really from the nineteen fifties up until well the early twenty twenties, we were building a fractured mind. We had what we call the island problem.

Speaker 2

01:49

The island problem. I like that image paid the picture for us.

Speaker 3

01:52

Okay, so picture an archipelago on one island. You have these brilliant computer vision systems. They were specialists. They could look at a photo of a cat and tell you that's a tabby with like ninety nine percent accuracy. Superhuman vision in some respects. Right. But if you showed that same system a handwritten note that said this is a cat, it was blind. It couldn't read. It had no concept of what letters were.

Speaker 2

02:16

Okay, so that's island one the eye they cannot read exactly.

Speaker 3

02:19

Then on the next island over you have the text spots. The ancestors of you know, chatchypt and the like. They could write you a sonnet about a cat. They could define the biology of a feline. They could translate cat into fifty languages. But if you showed them a picture of a kitten, nothing, just static. They were effectively brains in a jar that only knew the world through symbols.

Speaker 2

02:43

So you have the eye that cannot read and the brain that cannot see precisely.

Speaker 3

02:47

And the worst part, they were built by different people. The computer vision engineers didn't hang out with the natural language processing engineers.

Speaker 2

02:55

They were in different departments.

Speaker 3

02:56

They used different math, they used different architectures. They were effectively different species of intelligence.

Speaker 2

03:02

So for fifty sixty years we were building these savants. One savann could see perfect pixels, one savant could pars perfect grammar. But they couldn't have a conversation.

Speaker 3

03:12

They couldn't even acknowledge each other's existence.

Speaker 2

03:14

And today, because the reason we're doing this show is that something fundamental changed.

Speaker 3

03:19

Today. The bridges have been built the water between the Islands is gone. We are witnessing the rise of multimodal AI and I want to be really clear to everyone listening, this isn't just a feature update. This isn't just now your chatbot has a camera icon.

Speaker 2

03:34

It feels much much bigger than that.

Speaker 3

03:36

It is fundamental. We are moving from the era of the specialist to the era of the generalist. We are giving machines the ability to integrate senses in a way that well, it mimics that human sensory soup we started with.

Speaker 2

03:50

That's our mission for this discussion. We've pulled together a stack of research, technical papers, and industry analysis to figure out how this happened, because it seemed like for decades we were stuck and then in the last few years everything just collided right.

Speaker 3

04:04

We're going to look at the architecture, the actual aha moment that let machines see and read. At the same time, we'll look at the superpowers this unlocks, like reading X rays while reading patient notes.

Speaker 2

04:16

And we absolutely have to talk about the limitations because the research shows that all these machines can see, they hallucinate in brand new.

Speaker 3

04:23

Ways they do, and we need to ask the big philosophical question if a machine looks at a photo of a funeral and writes a palm that makes you cry, does it actually understand grief or is it just really really good at math.

Speaker 2

04:36

Let's get into the mechanics then, Section one. How did we get here? Because I remember reading about AI in say twenty fifteen, and it was all about these specialized tools. You had one tool for chess, one tool for translating French. When did the walls come down? Was there like a single invention.

Speaker 3

04:54

To understand the solution, you really have to understand why the walls were there in the first place. And it all boils down to the architecture, the literal shape of the neural networks.

Speaker 2

05:03

Okay, break that down for us, you know, don't go too heavy on the jargon, but give us the reality. Why couldn't vision bot talk to the text butt?

Speaker 3

05:11

Okay? So, for a long time, the king of computer vision was something called a CNN, a convolutional neural network.

Speaker 2

05:17

We've touched on these before. These are the ones that scan an image like a grid, right, looking for edges and shapes. Right.

Speaker 3

05:23

Imagine a sliding window moving over a picture. It looks at a tiny patch of pixel, say a three x three square and asks is there an edge?

Speaker 2

05:30

Here?

Speaker 3

05:30

Is there a curve? Is there a color gradient? It builds up from lines to shapes, to ears to eventually a cat. It is designed mathematically to process grids of spatial data. It understands space.

Speaker 2

05:42

Okay, so that's the eye. It deals in grids. It thinks and grids exactly.

Speaker 3

05:45

But for text, text is in a grid. Text is a stream, It's a sequence the quickbat brown dot fox. The order matters immensely. Of course, you can't just look at brown without knowing quick came before it. So for that we used Ours were current neural networks. These were designed to remember the past. They process the word fox while trying to hold onto the memory of the.

Speaker 2

06:09

So you have one kind of math designed for grids, which is space, and a completely different kind of math designed for streams, which is time.

Speaker 3

06:17

You've got it. And you couldn't just plug one into the other. They spoke different languages. It was like trying to put a VHS tape into a toaster. The inputs just didn't match the machinery.

Speaker 2

06:25

So what changed? I know the answer involves transformers, because that seems to be the answer to everything in AI. Lately. But why what did the transformer do that the others couldn't.

Speaker 3

06:33

Well, the date was twenty seventeen. The paper was attention is all you need. We talk about it all the time on this show. But the hidden revolution in that paper wasn't just that it was better at language. It was that the transformer was a universal substrate.

Speaker 2

06:48

Universal substrate, Yeah, that sounds impressive, but what does it actually mean? In practice?

Speaker 3

06:53

It means it's a structure that can process any kind of information as long as you can turn that information into a sequence.

Speaker 2

06:59

So text is obviously a sequence, word after word after word that fits right.

Speaker 3

07:03

In AI terms, we call those tokens. But then the researchers have this, this real aha moment. They realized, wait a minute, we can treat an image as a sequence too.

Speaker 2

07:13

Hold on, how do you turn a picture into a sequence. A picture is a flat two D object. It doesn't have a start and end like a sentence does.

Speaker 3

07:22

That was the stroke of genius. The researchers asked, what if we forced it to be a sequence?

Speaker 2

07:26

Forced it? How?

Speaker 3

07:28

Imagine taking a photo of a dog. Now imagine taking a pair of scissors and cutting it up into a grid of little squares. Let's say sixteen by sixteen pixel squares. You have a pile of these tiny patches.

Speaker 2

07:39

Okay, I'm with you.

Speaker 3

07:40

Now, you just line them up square one, score two, scure three from top left to bottom right.

Speaker 2

07:45

You flatten the grid into a line exactly.

Speaker 3

07:47

They turn the image into a sentence of visual words. They call them patches. And once you did that, once you turn the image into a sequence of patches, the transformer looked at it and said, I know what to do with this.

Speaker 2

07:58

Because of the transformer, a patch of pixels is just another token, the same way a word is a token.

Speaker 3

08:03

Precisely, that is, the everything is a token realization. And it didn't stop at images. Audio that's just a sequence of spectrogram slices. Video that's just a sequence of frames in temporal order, even code or molecules.

Speaker 2

08:17

So the machine stops seeing image versus text versus audio, and just start seeing data stream versus data strue.

Speaker 3

08:24

Correct. It was like discovering that French, Mandarin and mathematics are all actually dialects of the same underlying language. Once they realized that the transformer could handle all of these as sequences, the barrier between the senses just it evaporated.

Speaker 2

08:39

That is wild. So the architecture was the lock, and this idea of tokenization was the key that fit everything.

Speaker 3

08:45

That's a beautiful way to put it. And once that architectural problem was solved, the floodgates opened. We moved into this phase of connecting the dots, of teaching these different senses to talk to each other.

Speaker 2

08:56

Okay, I get the architecture. That makes sense. We can now feed everything into the same kind of machine. But I'm still stuck on the understanding part. Just because I feed a picture of a dog and the word dog into the same machine, how does the machine know they refer to the same thing. Surely it's not just looking it up in a dictionary.

Speaker 3

09:14

No, no, it's not a lookup table at all. It's geometry.

Speaker 2

09:17

Geometry. You're gonna have to explain that one. How does a picture of a dog become geometry?

Speaker 3

09:21

This is where we have to talk about vectors and high dimensional space, and to do that we have to talk about how these things are actually trained. The most famous example is a model called clap from open ai clip.

Speaker 2

09:34

I've seen that mentioned contrast of language image pre training.

Speaker 3

09:37

It's a mouthful, but the concept is really elegant. Imagine you have a massive bucket of data, and I'm talking four hundred million images scraped from the Internet and the text captions that came with them.

Speaker 2

09:50

So like IMG zero zero one dot jpeg and the alt text that says a golden retriever catching a frisbee on the beach.

Speaker 3

09:58

Right now, you start with a blank brain. It knows nothing. You show at the image and you show at the text. Initially, the machine thinks these are totally unrelated things. It turns the image into a set of numbers. We call that a vector, and it turns the text into another set of numbers. And those numbers are in this mathematical space, nowhere near each other. The strangers in the map total strangers. But then you apply something called contrastive loss. This is

10:22

the training mechanism. You essentially punish the machine. You say, hey, these two sets of numbers, they belong together. Pull them closer.

Speaker 2

10:31

You're forcing them to be neighbors exactly.

Speaker 3

10:33

And simultaneously you show at the text a golden retriever catching a frisbee and a picture of a toaster, and you say, push these apart. These are not the same. These live on opposite sides of the universe.

Speaker 2

10:43

So it's this constant game of hot and cold, pushing and pulling.

Speaker 3

10:47

Done billions and billions of times, over and over, and eventually the machine builds a map. We call it a high dimensional vector space. Imagine a graph, but instead of two or three axes, it has thousands. In this map, the coordinates for the visual pattern of fur, floppy, ears and tail end up located at the exact same coordinates as the linguistic pattern for the word dog.

Speaker 2

11:11

Wow. So it's not using a dictionary. It's not looking up dog equals animal. It's mapping the concept of dogness to a specific location in this massive, invisible space.

Speaker 3

11:22

Yes, and this is why it feels like it understands because that space has geometry. It has a kind of logic.

Speaker 2

11:27

Okay, give me an example of that logic, because logic implies it can do reasoning, not just matching.

Speaker 3

11:32

Okay, think about the classic relationship between king and queen in text. If you take the math vector for the word king, subtract the vector for man, and then add the vector for woman, you land almost perfectly on the vector for queen.

Speaker 2

11:44

Right. That's the famous example King minus man plus woman equals queen. It's like vector arithmetic.

Speaker 3

11:50

Now do it with images. If you take the visual vector of a king a photo of a guy in a crown, subtract the visual features that represent man, and add the visual features that represent woman, the machine generates an image of a.

Speaker 2

12:06

Queen that is mind blowing. The logic, the geometric relationship it holds up across the senses it does.

Speaker 3

12:15

It means the machine has found a concept layer that sits deeper than language and deeper than pixels. It has found the meaning that connects them.

Speaker 2

12:22

It's performing analogical reasoning.

Speaker 3

12:24

It is. That's how the system can look at a photo of a funeral and connect it to the text A moment of grief. It's not because it memorized that specific photo and caption pair. It's because the visual information in the funeral photo and the concept of grief from the text live in the same emotional region of this mathematical space.

Speaker 2

12:42

It's math the geometry of sadness.

Speaker 3

12:43

In a mathematical sense. Yes, it has aligned the visual features of sadness with the linguistic features of sadness.

Speaker 2

12:49

That explains so much about why these systems feel like they get it. They aren't just matching keywords. They are navigating a map of meaning.

Speaker 3

12:56

And usually the architecture that runs this map, the sort of central brain, is a large language model. You have these specialized encoders. You can think of them as the eyes and ears that project all this information into the brain. The LM does the reasoning in that shared space, and then it can send information back out.

Speaker 2

13:15

So the LLEN is the conductor of the orchestras, making sure the strings and the woodwinds are all playing from the same sheet music. Ceaseely, all right, So we have the history. The silos are gone with the science. It's a geometry of concepts. Now I want to talk about the utility, because cool math is great, but what can this actually do?

Speaker 3

13:32

The capabilities are substantial, and I think we should start with what we can call vision language nuance, because we're not just talking about identifying objects anymore.

Speaker 2

13:40

Right. This isn't just drawing a box around a cat and saying cat ninety nine percent confidence. That was like twenty fifteen era. AI.

Speaker 3

13:47

No, No, Now, it's about identifying relationships emotional tenor. It can look at a scene and say this is a tense negotiation happening in a corporate boardroom based on the body language, the lighting, the arrangement of people. But one of the most practical superpowers is something called OCR integration optical character recognition.

Speaker 2

14:05

But OCR has been round since the nineties. My scanner came with it. Why is this a big deal? Now?

Speaker 3

14:11

Old OCR was dumb. It just scraped text off a page. It didn't know where the text was or what it meant in context. Multimodal AI reads the text in context. It can look at a street sign in a photo, read the sign, look at the cars, look at the time of day, and tell you if parking is legal right.

Speaker 2

14:27

Now, or to go back to our earlier point. It could read a handwritten note on a medical scan and understand how that note relates to the X ray.

Speaker 3

14:35

Itself exactly, which segues perfectly into the second big capability document understanding. This is what some people are calling the ultimate office assistant.

Speaker 2

14:45

This is the one that I think is going to change a lot of white collar work. I want you to walk me through a scenario here, because I deal with PDFs all day and they are where data goes to die.

Speaker 3

14:55

Okay, picture this. You have a fifty page annual report. It's got three columns of text, complex bar charts, photos with captions, footnotes. For old AI, that was a complete nightmare. The text would get jumbled, the chart was invisible.

Speaker 2

15:11

I was soup. You'd copy paste it into a text file and just get absolute garbage. Right.

Speaker 3

15:15

But a multimodal system sees the document like a human does. It understands the layout. It can look at the bar chart, extract the data from the visual bars, I mean literally measuring the pixels of the bars, read the surrounding text, understand what that data means, and answer a question like based on the chart on page three, which quarter had the highest revenue.

Speaker 2

15:34

Without a human having to manually turn that chart into an Excel sheet first zero preprocessing.

Speaker 3

15:39

It just looks and understands.

Speaker 2

15:41

That's incredible. It basically unlocks all the information that is trapped inside images, within documents. What about audio and video You mentioned earlier that video is just a sequence of frames.

Speaker 3

15:51

Audio and video are huge frontiers now. In audio, we aren't just transcribing speech to text anymore. We are analyzing the vocal characteristics. The system can detect emotion. Is the speaker angry, nervous, sarcastically happy?

Speaker 2

16:08

You can hear the scare quotes in your voice.

Speaker 3

16:10

It can absolutely and it can analyze music, not just the genre, but the rhythm, the mood, the instrumentation. When you combine that with video, you get narrative understanding. It can track events over time and start to build a story.

Speaker 2

16:22

But the real magic, and the research we looked at was really emphatic about this is the killer app of true integration. Is it not just being good at video or good at text. It's the combo.

Speaker 3

16:33

It is the synthesis. That's where the real power is. Let's look at a coding scenario. Imagine you're a developer. You're stuck. You get some cryptic error message. You take a screenshot of your error message. You just paste it into the AI. The AI reads the screenshot, looks at your actual code file, consults the official software documentation online, and synthesizes an answer.

Speaker 2

16:51

So it's using its eyes and it's reading comprehension at the exact same time to solve one problem.

Speaker 3

16:56

Or take medicine, the radiologist assistant idea. It looks that the CT scan, that's vision. It reads the patient's history notes, that's text. It checks the latest research papers for medical journals. More text, and it synthesizes a potential diagnosis based on all three modalities.

Speaker 2

17:12

It becomes the ultimate second opinion engine.

Speaker 3

17:15

Right or a final example in design, you sketch a rough idea for an app on a napkin, You take a photo, you upload it, and you say, make this look like a sleek modern app interface, but use our official brand colors from this attached pdf. It sees the sketch, it reads your brief, it consults the PDF for the color codes, and it generates the final image.

Speaker 2

17:35

It's closing the loop between idea, instruction, and creation. It feels like we're getting closer to that Jarvis from Ironman Fantasy, the as system that just handles them.

Speaker 3

17:44

We are getting closer. But and this is a very very big up. We have to talk about where it breaks because it is not Jarvis yet, and you and I need to be clear that this isn't magic. It breaks in some surprisingly dumb ways.

Speaker 2

17:55

You don't want to play the skeptic here for a minute, because it sounds perfect, but I know it's not. Where does the machine stumble? What trips it up?

Speaker 3

18:04

It stumbles in some surprisingly fundamental areas. The first one, and this is almost ironic is spatial reasoning, which.

Speaker 2

18:11

Is funny, right because you'd think a computer vision system would be great at space.

Speaker 3

18:15

It sees pixels, you would think, But remember these systems are trained on flat two D images from the Internet. They struggle to build an intuitive three D model of the world. If you show it a picture of a table with a messy pile of objects, and you ask is the apple behind the book or in front of it, it often gets really confused.

Speaker 2

18:34

It sees the pixels of the apple and the pixels of the book, but it doesn't get the depth the physics of one object including another.

Speaker 3

18:42

It lacks a physics engine in its head. It doesn't intuitively understand that solid objects occupy space and can't pass through each other. And this is a massive problem for robotics. If you want a robot to clean your kitchen, it needs to know exactly where the cup is relative to the edge of the table. Close enough isn't good enough. When you're hands fine china.

Speaker 2

19:01

That makes a ton of sense. It sees the picture, but it doesn't understand the physical reality behind the picture exactly.

Speaker 3

19:07

Then there is temporal understanding. We can process video as a stream of frames, but Understanding causality over time is really hard for these models.

Speaker 2

19:15

Causealit you mean, like the glass broke because the ball.

Speaker 3

19:18

Hit it exactly that. Or even following a complex argument in a lecture, if I say A happened which led to B, but then C came along and prevented D from occurring, the AI might track all the nouns but lose the thread of the logic that question what happened because of what is a surprisingly difficult cognitive task for it.

Speaker 2

19:39

It's the difference between seeing a series of snapshots and understanding a story with a plot precisely.

Speaker 3

19:45

And then we have to talk about the big one hallucination.

Speaker 2

19:48

We know text models lie, they make up facts, they make up sources. Do multimodal models lie in the same.

Speaker 3

19:54

Way, they lie in new and excitingly creative ways. We call it multimodal hallucinating.

Speaker 2

20:00

That sounds terrifying. Give me an example of what that looks like.

Speaker 3

20:03

It can be. A system might read the text in a financial report perfectly, but then look at a graph on the same page and completely hallucinated trend that isn't there. It might say sales are trending upwards when the line is clearly going down whoa. Or it might describe a photo and just invent details about objects that are partially hidden.

Speaker 2

20:23

So it sees half a car behind a building and says, there is a red convertible with a dog in the back seat, even though it can't possibly see the back seat.

Speaker 3

20:33

Right, it's confabulating based on probability things well, usually cars have things in back seats, and it just fills in the blanks with a plausible story. In a creative context, you might call that imagination. But in a medical or illegal context that's malpractice.

Speaker 2

20:48

That's a crucial distinction. It's just guessing, and sometimes a guess is wrong with total unwavering confidence.

Speaker 3

20:54

And there's also the problem of compositionality. This is what I like to call the catenoid at the dog problem. Explain that one identifying cat and dog in a picture is easy, that's basic object recognition, but understanding their relationship the cat is annoyed at the dog requires understanding subtle cues and the interaction between the two. Current systems often struggle to bind those attributes correctly. They might see a happy dog in an angry cat and output the sentence

21:19

an angry dog and a happy cat. They mix up who owns which emotion, so.

Speaker 2

21:23

They get all the ingredients right, but they get the recipe completely wrong.

Speaker 3

21:26

A perfect analogy, and finally, the deepest and most philosophical limitation grounding, or what we can call the fire problem.

Speaker 2

21:35

This was the part of the research that really stuck with me, the idea that the AI knows fire, but it doesn't know fire.

Speaker 3

21:41

It's the gap between data and experience. The AI has seen a billion pixels of fire, it has read a trillion words about heat, burning, smoke, and danger, but it has never felt heat. It has never reflexively pulled its hand away from a hot stove. It lacks sensory consequence.

Speaker 2

21:58

But does that matter. I've ever been to Mars, but I feel like I know a lot about it. I learned it all from books and pictures. If the AI tells me don't touch the fire it's dangerous, does it matter that it's never been burned itself.

Speaker 3

22:11

That is the big counter argument. Maybe you don't need a body to understand, but there is a very strong hypothesis in cognitive science that true intelligence requires embodiment. That you can't really think about the physical world unless you have a body that risks being hurt by it. If you don't fear the fire, do you really understand danger or do you just know the statistical correlation between the token danger and the token fire.

Speaker 2

22:38

That's deep. We should definitely circle back to that later, but first let's look at where this is hitting the ground right now, despite the hallucinations and the lack of a body, where are these systems actually transforming the world today.

Speaker 3

22:50

Medicine is the big one. I really can't overstate this. Medicine has always always been multimodal.

Speaker 2

22:55

Right, you go to the doctor, they look at you, they listen to your lungs, they re your chart, they look at your lab results. It's a mix of everything exactly.

Speaker 3

23:04

A doctor is at their core an information integrator. Multimodal AI is the first tool that really matches that workflow. It acts as a high speed second opinion. It's not replacing the doctor's judgment, but it's synthesizing the data faster than any human could ever hope to.

Speaker 2

23:20

It's the ultimate intern.

Speaker 3

23:22

It's an intern that has read every single medical paper ever published and fields like pathology, radiology, genomics. It's starting to find patterns that humans miss. It might see a faint correlation between a genetic marker mentioned in the text of a patient's file and a specific cell shape in a microscopy image that a human would never connect because the data is just too vast.

Speaker 2

23:42

Then there is accessibility. This feels like one of the most immediate and unambiguously positive impacts.

Speaker 3

23:48

It is democratization on a massive scale. Think about what this fluidity between senses means. If you are blind, the world is opaque to visual signals. Multimodal AI can describe the visual world to you in text or audio.

Speaker 2

24:01

There's a blue car approaching on your left, or the light just turned.

Speaker 3

24:04

Green exactly, or even you are holding the can of soup upside down. It gives you eyes. For deaf users, it can translate audio to text, but also describe the emotion in the speaker's voice. It bridges the gap between the sense you have and the information you need.

Speaker 2

24:20

It completely removes the friction of format.

Speaker 3

24:22

Domain number three science and education. In science, we are absolutely drowning in data. We have microscopy images, protein structures, satellite data, research papers. No single human can read it all.

Speaker 2

24:37

So the AI becomes a kind of research partner.

Speaker 3

24:39

It becomes an active participant. It can read all the latest papers, look at all the new experimental slides, and say, hey, this pattern in the satellite data over the Amazon matches this obscure theory from a paper published in nineteen ninety. It connects dots that are separated by decades and disciplines.

Speaker 2

24:55

And in education, how does it play out there?

Speaker 3

24:58

Responsive tutoring imagine AI that watches a student solve a math problem on a piece of paper, literally watches the pen move through the camera and at the same time listens to them talk through their reasoning. It can pinpoint exactly where the logic broke down. It doesn't just say wrong answer, It says you forgot to carry the one in the tens column right here.

Speaker 2

25:18

That's the difference between a textbook and a real teacher. A teacher watches the process, not just the result it is.

Speaker 3

25:25

And finally, of course, creative work. This is the controversial one.

Speaker 2

25:28

Text to image, text to video. We see this everywhere now it's exploded.

Speaker 3

25:33

It has democratized visual creation in a way we've never seen. You don't need to know how to draw or paint to create the stunning image anymore. But it creates this massive tension regarding the displacement of professionals and the ethics of using copyrighted work in training.

Speaker 2

25:47

Data artists are rightfully saying, Hey, you train this model on my entire life's work without my permission, and now it's competing with me for jobs, and it's.

Speaker 3

25:56

A completely valid conflict. There's no easy answer. But the tenential is also there for using these tools as extensions of human vision. It's a tool that can amplify a creativity, not just replace it. A director can visualize an entire storyboard in seconds. An architect can iterate on a dozen building facades instantly.

Speaker 2

26:15

I want to pivot back to that philosophical moment we touched on earlier, the grief example. So this is the one that really keeps me up at night.

Speaker 3

26:23

Let's go back to it. It's the most important question, I think.

Speaker 2

26:26

Okay, So if the machine recognizes the funeral, it correctly identifies the sadness in people's faces, and it writes a poem about loss that makes me cry, has it understood grief?

Speaker 3

26:37

This is the critical distinction between what we call behavioral performance versus experiential understanding.

Speaker 2

26:42

Behavior versus experience. Okay, break that down for me.

Speaker 3

26:46

Behaviorally, yes, absolutely, it performed the task of understanding grief perfectly. It recognized the symbols, It generated the appropriate linguistic response. It passed the Turing test for sadness with flying colors. But experientially, experientially know it is a hollow shell. It processes the symbols of grief without the reference. It has the map, but it has never visited the territory. It has never lost anyone. It has never felt that hollow ache of absence in its chest.

Speaker 2

27:12

So why does this distinction matter? I mean, if the poem is good, who cares that the poet is a sad robot or a sad human? If the output is the same, why does the internal state matter so much?

Speaker 3

27:22

For writing a poem, maybe it doesn't matter. For generating ad copy it definitely doesn't matter. But for moral judgment, for empathy, for wisdom, that gap is critical. If we ask an AI to make decisions about elder care or legal sentencing or childcare, do we want a system that just mimics wisdom or one that actually has it?

Speaker 2

27:41

That is a chilling thought. We shouldn't mistake a high confidence output for lived experience.

Speaker 3

27:46

Exactly, A human radiologist brings years of seeing patients, of knowing the fear in their eyes when they get a bad diagnosis, of understanding the weight of that responsibility. The AI brings pattern matching on a massive data set. Those are not the same thing. Even if the diagnosis it gives is correct. The AI doesn't care if the patient lives or dies. It just cares about minimizing the loss function in its training.

Speaker 2

28:09

So where does this all go next? If this is where we are now on the frontier, what's beyond the frontier?

Speaker 3

28:15

The trajectory is becoming very clear. We are moving from processing information to acting on it.

Speaker 2

28:21

The agentic shift. I keep hearing this term popping up more and more.

Speaker 3

28:24

Yes, right now, most people interact with AI by chatting, write this for me, analyze this data. The next wave is agents that do things.

Speaker 2

28:33

So not just tell me how to book a flight, but literally book me the cheapest flight to Chicago next Tuesday.

Speaker 3

28:39

Book the flight, email my boss to let them know I'll be out, update my calendar, and order a car to take me to the airport. These are systems that will browse the web, operate software on your computer, and execute code to accomplish goals. They will have eyes to see the screen, and hands whether virtual or robotic, to click the.

Speaker 2

28:58

Buttons, and robots real physical robots in the world.

Speaker 3

29:01

That's the physical manifestation of the same idea. Robots that watch a human demonstrate a task, say folding laundry or assembling a circuit board, and then replicate it. The visual understanding guides the motor control. The senses are connected to the limbs.

Speaker 2

29:16

This raises the stake significantly.

Speaker 3

29:18

It completely changes the risk profile. A chat butt that writes a bad poem is embarrassing. A robot that misunderstands the command clean up the kitchen and throws out your vital medication is dangerous.

Speaker 2

29:29

Or an autonomous software agent that misunderstands a financial instruction and executes a code that deletes a critical.

Speaker 3

29:35

Database exactly when perception leads directly to physical or digital action. In the world, safety isn't just about content moderation anymore. It's about physical safety and operational security. We are giving these systems hands, we need to be very very sure about the brain that's guiding them.

Speaker 2

29:53

It really feels like we are standing on the edge of a profoundly different world.

Speaker 3

29:56

We are the walls between the senses are gone, the silo are broken.

Speaker 2

30:01

So to kind of summarize our journey today, we started with the island problem. AI was completely fragmented. We moved to the universal substrate, transformers and tokens connected everything. We saw the magic of vector space, where meanings are mapped out geometrically. We looked at the superpowers, the medical assistant, the coder's buddy. We acknowledged the very real limitations no body, no real spatial sense, the problem of hallucination. And we've just looked ahead at the agentic future.

Speaker 3

30:29

That is the arc. It's the collapse of separation.

Speaker 2

30:31

What's the final thought here which we walk away thinking about as we go about our day.

Speaker 3

30:35

I think it's this. The machines are learning to see and hear, They are developing senses. For decades, we have spent so much time worrying about whether they can think that we maybe haven't paid enough attention to what it means that they can perceive, perceive us, perceive the world. We are building a new kind of observer. It's not human, but it's not blind anymore. The question we all need to ask is are we paying enough attention to what that means for the world we are building.

Speaker 2

31:03

That is a question I think we will be wrestling with for a very long time. Next time you look at your phone, just remember it might be looking back at you and it's finally starting to understand what it sees. Thanks for joining us on this exploration A pleasure, as always,

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript