Teaching Computers to See

Speaker 1

00:15

Pushkin.

Speaker 2

00:21

Over the past year, we've heard a lot about artificial intelligence models that are really good at manipulating language. We've heard somewhat less about AI that deals with images. It's called computer vision, and it's a huge deal, which you know obviously. Like language, vision is this core part of the experience of being human, And on a more practical level, computer vision is key for self driving cars, and for

00:48

drones and for all kinds of industrial robots. As it turns out, there was this one key moment in the development of modern AI, for both vision and language, and if you understand this moment, you understand a lot about how AI works today. I'm Jacob Goldstein. This is What's Your Problem, the show where I talk to people who are trying to make technological progress. My guest today played a central role in that key moment in AI history. Her name is faith A.

Speaker 1

01:25

Lee.

Speaker 2

01:26

She's a Stanford computer scientist, the author of a memoir called The Worlds I See, the former chief scientist of AI and machine learning at Google, and just generally one of the most important innovators in the history of computer vision. I started our conversation with really a pretty general question. I asked Fay fe just to explain what computer vision is and why it's so important.

Speaker 3

01:53

So computer vision is about enabling computers and machines to have visual intelligence. What is visual intelligence? Well, the best example comes from humans who are extremely visually intelligent animals, So that we can make an omelet by knowing what is in our fridge. How do we go and take the egg out, how do we take the tomato out?

02:24

How do we plan the cooking of the omelet? How do we interact with every ingredients, and how do we understand all the changes of the objects and all this is part of visual intelligence.

Speaker 2

02:41

Yeah, I mean you write in your book that vision isn't just an application of our intelligence, it is synonymous with our intelligence, which is something I want to talk more about. But before we get into human vision and how that led you into computer vision, just give me a sense of some of the applications, both the current applications of computer vision and potential future applications of computer vision.

Speaker 3

03:06

In fact, we're already using computer vision to do a lot of things. The most obvious example is all kinds of driver's assistant programs. Right, we're not having even about self driving cars. We're talking about lane detection or talking about avoiding curb sized pedestrian alert. You know, we are using computer vision in our healthcare system, in radiology, in pathology, or you know, in protecting of species. A lot of the camera traps in the in the deep forests are

03:45

using computer vision to track to track different animals. So we're using computer vision already on a daily basis.

Speaker 2

03:54

And then when you dream of some applications that are not here yet but that might be here in whatever five or ten years, what do you think of what's at the top of the list.

Speaker 3

04:04

So when I dream of computer vision, I dream of all kinds of robotic application, so from self driving car to personal robots using computer vision. I dream of our biodiversity being mapped using computer vision. I dream of exploration using computer vision.

Speaker 2

04:21

Wonderful. So I want to talk about your work in computer vision, which goes back well decades now, and I want to start with work not on computers actually, but on human beings right, on understanding of how humans process visual information, right, how we make sense of what we're seeing. And in the book, you write in particular about this nineteen ninety six paper with a boring name that was a huge deal. It was called speed of processing in

05:00

the human visual system. Tell me about that paper and what it meant.

Speaker 3

05:05

It's a paper of using EG, which is recording electrical brain waves to make a link between how fast can humans make a very complex visual decision when they sees something, And the particular decision humans were to make is to separate images from images containing animals and images not containing animals. And if you think about the pool of possibilities is extremely complex. It's actually mathematically just an infinite possibility because there are so many different types of animals, so many

05:48

different different types of non animals. That's infinite for practical purposes. And then you put them in photos, you can get infinite possibilities of photos. Yet you show them one by one two humans they make decisions really quickly, and they make correct decisions really quickly.

Speaker 2

06:09

That really quickly, but like mind bogglingly quickly at the time, right, it was shocking just how fast it was, right milliseconds.

Speaker 3

06:18

Yeah. So the thing is we kind of sort of know we're good at see, right as a species. We know we open our eyes we see the world, but we don't really know how good and how fast.

Speaker 2

06:33

And this is we underestimate. It's it's a rare case where human beings underestimate ourselves exactly.

Speaker 3

06:38

This is a rigorous scientific study put a time, actual time to that speed of visual intelligence, and it's using modern technique. It's very smart and very very exciting.

Speaker 2

06:54

What did it mean to you when you saw that result? When you read that paper?

Speaker 3

07:00

When I read that paper, it means north star. Let me explain what does north star mean? As north star?

Speaker 2

07:07

Okay?

Speaker 3

07:07

Yeah, As a scientist, I'm driven by finding answers to the most audacious question. But as high Einstein has said, in science scientific inquiry, the hardest job is not finding solutions asking the right question because you you know, when you like we talk about visual intelligence, it's such a vast topic. What is the topic to pursue? What is the question to ask that is fundamental to visual intelligence?

07:42

And how do we unlock it? When when we read that Simon Thorp paper, it convinced me that complex object categorization, the ability to classify you know, animal versus no animal, chair versus you know, table, hot dog, hot dog versus hamburger. You know, this is fundamental told to humans. It's a building block of visual intelligence. Is it has a neural correlate in human brain that shows how evolutionally evolutionarily optimized

08:23

it is. So with all that evidence, it comments me object categorization is a north star to pursue.

Speaker 2

08:32

And you were a grad student at the time, right, this is sort of the thing and any ambitious grad student that's going to be doing it's like, I know, I'm interested in this field, but I need my question, I need my thing, right, And so now you've got your thing, yes, and it's categorization in particular. And you describe how earlier theories of how humans process visual input

08:56

was not so categorization focused, right. It was kind of like if you just sort of thought from first principles, you would think, well, we see color and we see shapes and then we kind of make sense of it. But with this paper and related work, it show is like that's actually not it, right, and in fact our brains.

09:13

You write about how there are specific regions of the brain like this region is just face, the face categorization region, and this region is the like place as we go all the time region And so it's a really different and interesting way of thinking about seeing, and it's fundamentally

09:29

about just incredibly quickly putting things into categories. And so you decide to take this idea of vision and categorization and try and figure out how to how to get computers to do this, right, how to get computers to be able to categorize objects from the world. And you start building these data sets essentially of labeled images, right, And you build what seems in retrospect like a relatively small one at Caltech, and then you decide to build a really big one. Right. It comes to be called

10:07

the image net. It's a thing you're famous for, NERD famous for, And I want to talk about building image net, right, So tell me about deciding to build what becomes image net.

Speaker 3

10:19

So the image net is the north Star. To me, I was in the field long enough because I finished my PhD, I started my own lab. I had this unwavering faith and believe that, you know, unlocking object recognition is part of the north is a north star, is a critical north star. And I became impatient because I

10:45

realized we were not making enough progress. I realized that, especially algorithmically, we were like running in circles a little bit of optimizing very small algorithms that are not really getting to the essence of the problem, and part of the essence, which a lot of people overlooked, is actually the scale of the problem. What was really bothering me is that we were not seeing the problem. We're not seeing the mathematical problem with the scale thinking, because it's

11:21

not just about being big. It's about the mathematical reason of why we should go big, and it's it's a very deep reason in general, it's a reason for what we call generalization. You have to learn enough to be able to see everything and that minds You've got.

Speaker 2

11:43

To see a lot of pictures of things that are cats and not cats to understand what.

Speaker 3

11:47

Is right that that mindset was just not you know, that's a big data mindset. It was not in the world at all at that time.

Speaker 2

11:59

So how did you get there? How did you Because what you end up doing is building this just gargantuan uh thing full of labeled images, bigger than anybody'd ever built before. Like, how did you arrive at that?

Speaker 3

12:12

That's a great question. I think that's actually the most fun but difficult part of the book to write, is you know, like dig in to my own brain. In hindsight, it's just little by little the insight and the realization epiphany, But honestly, I don't know how to analyze my own brain. I had the mathematical intuition that scale makes a difference, bigger difference than most people give credit to. I also had the neurocognitive science inspiration that early human development was

12:51

exposure to the world in continuous ways. We don't like lock the baby in a dark room and show them, you know, one hundred cats. They just go out and experience. You know that experience is actually driven by big data. Maybe I was also inspired by this Internet age coming our way right like that part, I do think it's a little bit moment of just being alone, and somehow

13:22

all the stars aligned in my head. I decided I'm going to try the craziest thing, and I did have a faith and believe that it was the right thing to do.

Speaker 2

13:37

And specifically, like, what was this thing that you were going to build?

Speaker 3

13:41

I'm gonna get the entire Internet of images, consisted of all the objects I can get my hands on that humans have ever taken pictures of and catalog them in a gigantic, big database. And I will use that to do two things. To train machines to recognize the entire world of objects, and also to benchmark everybody's progress. You know everybody, I mean the international community of computer vision scientists.

Speaker 2

14:15

So you will have this database and then everyone can train their computer vision models on your database and see how they do on new images.

Speaker 3

14:23

Yes, so.

Speaker 2

14:25

You have to decide. There's this interesting part of the book where you're like, Okay, I want to build a database with everything in it. How many categories of everything are there? Right, somebody's actually done that research. If you take all the things, how many kinds of things are there? What's the number?

Speaker 3

14:43

The number is the Beaderman number, and the Beierman number is a I'm proud of really giving professor or Piederman that credit. Yeah, nobody noticed that number. He wrote. He's a cognitive scientist who wrote it very very good, But I don't think it's a famous paper in the nineteen eighties guestimating or estimating with the back of the envelope computation of how many visual concepts humans see? And that is a very hard number. How do you interrogate a

15:20

person and say, list me all the visual concepts. It's impossible. But he had a way of using dictionary and using visual structure to estimate, and he put a number of thirty thousand visual concepts.

Speaker 2

15:36

They're thirty thousand different sort of kinds of things, right, people can identify differentiating. Yeah, it's a lot.

Speaker 3

15:44

That's a lot.

Speaker 2

15:46

Yeah, And every concept you're setting out, if you're setting out, sorry, is that your number? Is your number?

Speaker 3

15:53

That was my number. I was obsessed with that number, and I was obsessed in a way that I feel I was kind of crazy because nobody was obsessed with that number. Nobody even knew. I think my book is the book that gave the number A which is be the most number, and I'm very proud of that.

Speaker 2

16:13

Do you can you just rattle off some of the categories.

Speaker 3

16:17

Star nosed mole, star No's mole category to itself, that's my favorite, one of my favorite categories. And Guardenian windsor flower has windsor chair. There were hundreds of dogs. I remember there were uh different kind of cars like like sports sedan and uh monocycles. It's it's a lot.

Speaker 2

16:54

So Faithley has her number, she has her big idea. She knows what she needs to build a gigantic image database. But how do you actually do that?

Speaker 1

17:05

Well? I have the answer in just a minute.

Speaker 2

17:19

Okay, so you've got your giant north Star task ahead of you. Not only do you have, you know, thirty thousand ish categories to deal with, presumably for each category you need many many thousands. So it's thousands of images per category, tens of thousands of categories. What is the order of magnitude?

Speaker 3

17:44

We're talking about tons of millions?

Speaker 2

17:47

A million this is and this is not a time where you can do this in an automated or semi automated way like you could now.

Speaker 3

17:55

No, I mean the point is, the machines cannot do it. We have to. This is a north start to push machines towards that, So you have to do it by human hen and the good.

Speaker 2

18:06

News downloading and labeling. Yeah, millions or tens of millions of images.

Speaker 3

18:11

Downloading, cleaning, labeling, and yes, that's that was the task.

Speaker 2

18:18

So now you're like Henry Ford or something right now you need an assembly line, you need a factory for creating this database.

Speaker 3

18:26

Yeah, you can put it that way. And we needed a global workforce, and eventually we found them on Amazon Mechanical Turk. It's an online global market.

Speaker 2

18:39

It's a market for project based work, right, people doing project based work. And so, so how long does it take you to build this thing? And how big is it when it's done?

Speaker 3

18:51

It took us three years. When it was done, it's fifteen million hand cleaned, sorted, curated, labeled images across twenty two thousand categories.

Speaker 2

19:04

So now you have this thing. It's called image net and basically the function of it is it itself is not useful, right, It is there to train. Well, it's useful as a means to an end. It's there for people who have models that aim to teach computer's vision to see and understand to train their models. Now there's this giant database. I mean people talk about this as kind of one of the beginnings of big data.

Speaker 3

19:30

Yes, yeah, I think it should be properly recognized as the beginning of big data in AI, because before this, there isn't this concept of big data in AI. It was just a paradigm shift from that point of view.

Speaker 2

19:49

And so you create this contest where people can come and train their models on image net, on this giant database that you've built, and then and then in the contest their models will be shown new images images not in the database, and you'll see how how good they are. And for a while it's like going okay, right, but kind of slow, like you're in the book you write

20:14

about like you get a little worried. You've built this giant thing with people all around the world and it's not for a while leading to the breakthroughs that you had imagined.

Speaker 3

20:23

Yeah, it was first of all, we open source this. We didn't even though we spend a lot of sweat and tears, you know, building this, but we know the real value is to open source. So we gave it for free to the whole community. And then I wanted everybody to use it. I wanted to see this driving all of us towards the north Star. I want the field to work out. But it wasn't like an overnight success. It wasn't like everybody's running around and say, oh my god,

21:02

there's image Net to use. And of course we were not. You know, we were disappointed, but we were not sitting there crying. We were just disappointed.

Speaker 2

21:16

And so there is this big moment right after a few years at one of the contests, there's a new model in three years, So tell me about that moment.

Speaker 3

21:25

So the result of two thousand and twelve came in and we saw this result coming out of Professor Jeff Hinton's lab using your network, and the difference that the error reduction compared to previous years was just much bigger, you know, and we started to realize this is a very very significant moment because there's a serious, serious breakthrough in terms of the results of image that which is

22:05

the north Star problem. Right. So it was so important for me that I, you know, bought last minute plane ticket to fly to Italy to announce the Imagnet Challenge winner that year, and.

Speaker 2

22:22

You weren't going to go otherwise.

Speaker 3

22:25

I wasn't planning to go because I was still a nursing mom, so I was taking you know, I was mostly working from home at that home in that month. But I was like, this is so important that I needed to go.

Speaker 2

22:41

And so I mean, so this was someone working with Jeff Hinton and using a neural network like today Jeff Hinton. You know, if you know two names in AI, Jeff Hinton is probably one of them. People call him the godfather of kind of modern AI, right, and neural networks are essentially the thing that has worked right both for

22:59

vision and for language. You know, jat GPTs neural network, and so this was a moment was like, oh, this technique that a lot of people thought wasn't going to work had kind of given up on its back.

Speaker 3

23:11

Yeah, yeah, exactly. I think it's actually a parallel story of two groups of people that had that determination seeing something that you know, maybe the mainstream wasn't seen, and then had the resilience and just perseverance to keep marching on. I was doing my north Star pursue. I was doing the big data approach. They're doing the new network algorithm, and then we converged.

Speaker 2

23:45

Uh huh. That's really elegant, right, because it's like your big data is just sitting there and you don't maybe entirely know it, but you kind of need a neural network to come in and train on it. Right, And they're over there building their neural network and they may or may not know it, but they need the big data that you're over here building, and then when it comes together, it's like, hey.

Speaker 3

24:06

It works. Yeah. So I think that's how science progresses. It's kind of spiraling up, and sometimes it takes a couple of more threats. It's not a single spral I remember very vividly that some of one of the critiques, one of the main critiques of image net by my colleagues is this is too big. We cannot even fit this into memory. What are you doing? What are you making this giant data set for when we cannot even you know, put it on a chip. And as that was happening, GPU was happening.

Speaker 2

24:46

So GPUs are the Nvidia chips that are now made in video, one of the biggest companies in the world. But they were figuring out that GPUs are particularly good for the neural.

Speaker 3

24:56

Network exactly exactly.

Speaker 2

24:58

So that moment, this moment when you guys come together and kind of create this you know, new era of computing really that we're still living in of AI is about ten years ago now, right, yeah, so just bring me to the present. Like that happened, then where are we now? I mean, it's it's kind of the same universe, right, It has advanced a lot, but the basic premise of you have neural networks training on vast, vast databases of images, like, it's basically the same.

Speaker 3

25:29

Right, So from my conceptional point of view, you're right, at that time, I was downloading the Internet of images to be honest. Now, the Internet of Images is just so vast, I don't know who can download it all. And then the GPUs is mind bogglingly advanced. Right, but you're right, the ingredients are still the same.

Speaker 2

25:54

We'll be back in a minute with the lightning round. Let's do a lightning round. Okay, what's one thing you learned running a drag cleaning shop.

Speaker 3

26:13

That's how very night to have to think. I think I learned resilience because my goal is to be a scientist. But if it takes running a dry cleaner shop to get there in the most detoured way, I'll have to do that.

Speaker 2

26:32

So you're writing the book about a high school teacher who was a very big and important influence on you, and how your advisors in grad school were an important influence. And now you have been a mentor to many people. So I'm curious, what's one tip for finding a mentor?

Speaker 3

26:48

For finding a mentor? That's a great question. I trusted them different stage. This trust meant different things. I trust their genuine intention, I trusted their wisdom, I trusted their vas and I trusted there you know, believe in me. So that was how I was lucky to find my mentors.

Speaker 2

27:22

What's one tip for being a mentor?

Speaker 3

27:25

Being a mentor is really about respecting the the person, the soul and help them to find their north star, to find their passion.

Speaker 2

27:39

If everything goes well, what problem will you be trying to solve in five years?

Speaker 3

27:43

I'm trying to usher in machines being so helpful and collaborative for humans, whether it's productivity or our wellbeing. If this includes sensors, spar sensors, virtual agents, or real robots, I think it all. You know, I'm very excited by that.

Speaker 2

28:13

Feif A Lee is a professor of computer science at Stanford and the author of the book The Worlds I See. Today's show was produced by Edith Russolo and Gabriel Hunter Chang. It was edited by Karen Chakerji and engineered by Sarah Bouger. You can email us at problem at Pushkin dot FM. I'm Jacob Goldstein, and we'll be back next week with another

Speaker 1

28:36

Episode of What You Have

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript