Brought to you by Toyota. Let's go places. Welcome to Forward Thinking, Pater and welcome to Forward Thinking, the podcast that looks at the future and says makes you think all the world's a sunny day. I'm Jonathan Strickland Obama, and I'm Joe mcformick. So today we're going to talk about something that I think is a pretty interesting topic,
and that's the automated description of images. This is yet another topic I came across in Alexis Magicals Five Intriguing Things Email, which if you're not signed up for you should get on that it. It is one of my favorite sources of daily delights on the Internet. So what is this idea, Well, it's what it sounds like. But I'll start with an analogy. Okay, imagine you're going to something like Google image search. Um, now, what happens when you do a Google image search? You type in some
words and it comes back with images. Well, that's kind of strange because how does it translate the difference between a collection of pixels on the one hand and the
words you've typed in. Well, one of the things, obviously, is that there's data associated with images medigata, right, Um, it's either captions that people have manually typed in, or perhaps keywords that they've attached to those images, or something else on the on the website's page that's going to clue you into what that image is about, like a file name or whatever. You could also use an approach
that is sort of refined by humans. So you could have humans sitting there working on your algorithm where they go through image after image from selected keywords and say this is a good match for that keyboard and this is a bad match for that keyword, and that sort of helps you, uh, connect words to images. Or let's say, what if there was no text associated with an image, could you still do it? Well, in some cases you probably could, right, because we've gotten to a certain level
with image recognition. Uh, there are automated programs that can look at this and say this is a human face or this is a cat, as we have discussed before on the show. And well, I'm sure right, and we we I think we don't know the full extent to which artificial intelligence like that already figures into something like Google Image Search. I wouldn't be surprised if that was a small part of it. But obviously we're relying heavily
on text associated with images. Okay, but now let's take that same last example, just identifying an image with no associated text, and say, could we do that with a complex scenario. So it's not just a picture of a human face, or say a bowling ball, which would be pretty easy to recognize, as you know, it's round, it's got three holes, but something like there is a pizza sitting in a bathtub, or a man throwing a sandwich off a cliff. I've got a lot of food related
imagery in your in your head. Well, I said those because I actually searched for them earlier before i'd had lunch. Yeah, alright, and I found no images of someone throwing a sandwich off a cliff. Well, why would anyone want to throw a sandwich off a cliff? I don't know, But you know what, I would be really surprised if there wasn't at least one picture of that out there somewhere. If there's not, there's definitely a stock photography opportunity lying in Wait, yeah,
you know what we're doing after the podcast. Um, there's a reason that this would be hard, right to describe a complex image not just one thing, but a complex sort of scene that requires a sentence to describe it, right, with propositional phrases and and situational um relativity. Yeah, exactly, you have to be able to describe the relationship between all the different elements that are inside that picture. Right.
And so that last example is what we're going to talk about today, how computers can look at an image with a complex scene taking place and turn that into a correct and accurate description made out of words. And this this goes into all, you know, a key element of artificial intelligence. It's not just the ability to describe something, but the ability to recognize it. It's something that can go beyond just describing a picture. And we talk a lot about how there are things that that we humans
are really good at. It's it comes naturally to us, it's the way we work. But they are things that do not necessarily come naturally to the machines we make um. And the example I gave was if you had a group of people, you know, you've got a bunch of people together, and you told all of them, I want you to draw this picture, and you describe the picture
to them. And in my case, I said, well, just imagine that there's a young lady saying at a table reading a book, and just make the picture as detailed as you can. But all that's all I give is just the elements that have to be there. Is there's a seated lady. Uh that she's at a table and she's reading a book. And so you could have all these different types of interpretations of that request. Sure, I might draw a picture of a lady sitting in a cafe reading a book. Or it could be a desk
at a library. There's so many different opportunities there. Uh, could just be a kitchen table whatever. But if I took all the pictures that everyone drew, and then I went to a separate group of people and I showed them those all those different pictures, and I all right, so what do all these pictures have in common? Pretty sure that a lot of people would end up saying, Okay, these are all pictures of a young lady reading a book at a table. It would be kind of weird
if in every picture she was reading tech war. That would be a very weird. For lots of reasons that would be weird. But at any rate, yeah, I mean, we would have essentially people answering the same you know, giving the same basic description. Now, they're a lot of
things going on in that scenario I just described. It's not just the fact that you're able to recognize things, it's that you're able to draw the conclusion that all these different pictures, even though the details are different, are showing you essentially the same thing, which is something that would be necessary if we were using an image search
that relied on this automated image description. Right, well, I mean, and furthermore, we as humans are able to recognize a lady sitting at a table reading a book from any angle that it's drawn from right right, pretty easily. Yeah, we can tell. We can tell like that this is the book, this is the lady, this is the table.
If you show just pixels to a machine that hasn't had any way of telling the difference between, you know, what these pixels actually mean, they might not That machine might not be able to tell that there are distinct elements in that picture, right, it may all just look
like one thing. So there are a lot of complications here, uh, same sort of We brought this example up a few times, same sort of thing, like I know what a cup is because I've seen a cup, but I was told this is a cup, and you're able to extrapolate many different kinds of cups. Exactly. You have an ideal in your head. It's sort of the platonic ideal of a cup, and there are many ways that a actual cup can vary that theme, but somehow you always recognize the theme.
I recognize every single cup in existence as an imperfect realization of the ideal cup that's in my mind. Uh so, which has grimace on it, by the way, But at any rate, Yeah, and so the again, a computer, you could if you gave it an image and you programmed in some software and said, this particular image that you're seeing here is a cup. But then you took a totally different kind of cup, different shape, different size, different color. The computer is not necessarily going to know what that is,
right right, it's going to say, well, this one's blue. Yeah, you need a huge shorter and it's from a different angle. It's it's one of those things where you you know there's a difficult problem and there's not necessarily a simple solution to fix it. The computer understands what's going on. So the point I'm trying to make is that this is a non trivial computer problem that a lot of people have worked on for a long time, and it's
absolutely amazing to see how much progress we've made. Absolutely and furthermore, this is only half of the issue that we're dealing with here overall. Because once, I mean, once you can teach a computer to identify, for example, a cup, how do you get it to to explain what that
cup is doing in relation to the other things? How does it describe it in a way that actually makes sense to That's a natural language problem, exactly right, because the computer doesn't think in English or whatever language you wanted to spit out right. Yeah, that's and we've talked about natural language issues as well, the idea that machine language and natural language are extremely different, and in fact, programming languages are a bridge between pure machine language and
human language. You might not think it if you're not a programmer and you look at raw code, you might think, well, this isn't language that any human would understand, But in fact that's precisely what it is. It's meant to be that that bridging material. Uh So, getting computers so that they can interpret the natural language innately is a very challenging issue. We've said it before that you can word the exact same thought numerous ways. Human language is highly,
highly redundant. Yes, they're all different kinds, and the differences in in word choice might express subtle differences in tone and things like that, but you can basically describe the same thing a jillion different ways different spellings. So, for example, if you're playing an old text adventure game like Zorc, yes, you could type walk down the hall or go down the hall, and that text parsing system might be smart enough to know either one and get you down the hall.
So like, okay, I know what the person just said, but if you type mosey on down the corridor, it
may very well say I don't know what you're talking about. Right, What did Zorc say when you said something I'm sorry, I don't understand what you mean, or something along those lines, or you were eaten by a group if it's too frustrated at you, if you're too confusing too often grew Yeah, no, but that's a great example the idea that you know, obviously, those those programs were only capable of accepting commands that had been pre programmed into them, and anything that went
outside those parameters was an error. It was missing. The message we would get is I didn't understand what you had to say, but in reality it could have just as well been found not found. Right. Okay, So now we're combining these two different artificial intelligence problems. On one the complex problem of looking at a scene and recognizing what's going on there, connecting that to other images and context, and the other one making sense of it in a
written language. And uh, why does this matter? Yeah? Why why do we even want to do this if it's so difficult? A lot a lot of reasons. One of the Well, first of all, we talked about image search and just being able to to automate this would make image search way more efficient for the things that we're looking for. We could be much more specific. Oh man,
if only I could tell it. So. Part of my job is searching for stock images that I end up publishing on our website and and human labeling of stock images. If you guys have never had to search for stock images for your job before, let me tell you it's one of the most joyous and terrifying things on the planet.
Because no matter what you type in, what you're going to get back is some weird clip art, some sexy ladies doing some stuff that may or may not have anything to do with what you just said and may not, and and maybe what you are actually looking you might also find some like truly weird images, like a an overweight, shirtless gentleman sitting at a table with a paper bag over his head, knife and fork in his hands, eating
a cartoon hamster. I mean, it's some weird stuff on there. Seriously, I am almost positive that there is a stock photograph out there somewhere of somebody throwing a sandwich off a cliff for like they've made it for diet purposes or something like that. Didn't get endless examples of that by that, right? If I searched for man throwing a sandwich off a cliff, I should have tried this before we started, but I'm almost positive I would not get that. Instead, I get
a lady in a bikini sitting on a pinball machine. Yeah, that's that's pretty accurate. So just increasing the accuracy of image search would be one reason, right, And and there are lots of different reasons why our our image searches on these things are imperfect. A large part of it is that you've got people gaming the system. They're essentially putting in any tag word they can possibly think of because they want their images to be the ones that
are purchased. But but if you want to play fair, then having an automated description would be best because you can't curate everything by human eye. It would just take too long. We're generating too much content for that to be a realistic possibility. But another is that it could be a huge help for people who have visual impairments.
So someone who is reading a news story, you know, someone who has who has some sort of visual impairment, maybe they're blind, and there could be pictures that give more context to whatever the story is, but they miss out on that if there's not an actual description of what that picture is. Particularly, I mean, there's some content out there where the caption might be playful but doesn't actually tell you what the picture is. Absolutely, so this would be a big help for people who are in
that situation. It also could speed up web access for people who have limited connectivity to the Internet. Perhaps it's over through a cellular network, and it may be that there's some important information they need to get hold of. But you know, if they're trying to load pictures, it's just taking too long for it to load any kind of thing. You could have a quick summary of those pictures. That would really speed things up. Because actually, I also
think it's just an important contribution to general artificial intelligence. Absolutely, if you're trying to create a system that can mimic all of the functions of the human mind, well, one of the main things humans do is look at something and describe it, right, and you know, the description is just part of it. There's also the interaction, right by by recognizing things in our environment, we know how to proceed.
We can make decisions on how to proceed. So if, for example, we walk into a room and we noticed that there are a lot of pedestals around us, and they're delicate vases on each pedestal, we know not to go swing in our arms everywhere willy nilly. But you know, a robot would not necessarily be able to tell that a a vause sitting on a pestel was not in
fact a single piece. It might it might interpret that as a column that's a good point, you know, or that there's even even if it could recognize that was an object sitting on another object the vase was delicate or that it was worth not smashing. Right, yeah, robots Teaching robots value is a very tricky thing. Also, they hate vauses they do. It's I think, I think you program them, I think programming. All well, it's because we
gave them the basic personality of Gallagher is the problem. Okay, okay, so we've so automated image description is a very difficult problem, but it's also very worth solving. But in the long term for artificial intelligence, and in some specific cases, in the short term. Who's actually working on this? Where did this come from? Well, I mean there are lots of
different people in computer science working on this problem. But the thing that kind of spur spurred on this particular podcast you discovered, right, Yeah, it was well, as I said, it was through Alexis Madrigals five Intriguing Things newsletter, and it was a link to the Google Research blog, which is a cool little blog. Some of it's definitely over the average reader's head, but it's also just very interesting.
And yeah, yeah, um, this specific blog entry is from Google UK uh and it was posted by a bunch of research scientists. It's it's one of those things where if you go into the Google research blog, they do get um more technical than your average blog does. They're not so technical as to be completely incomprehensible, but I will say they're really good about linking out two terms that you might be unfamiliar, which is important because I had to click on every single one of those links.
I did so much reading for this one blog post, so I could really get a handle on what they were saying. But again, it illustrates the complexity of the problem. So again we don't mean to suggest that these Google researchers are the only people working on this problem, or that their approach is the only way to do it. We're constraint on it because it was really well documented and it was just published on November sevente We are recording this on November two. Any first, so it was
of immediate interest to us. Yes, okay, so how are we doing this? Well? First, you have to identify what needs to be done before you can figure out how to do it right. You have to figure out the the things that have to happen in order for this to be a possibility. And they identified several things, including computer vision, which is how machines acquire and analyze images. So how do they get the images in the first place. Is it purely through code? Is it actual visual you know?
Is like like a camera system. I mean, if you're talking about robotics, and it's probably a camera system because they're looking around in their environment. It could just be sampling from the Internet or something like right right, Well with with something like an automated search, it could all be code. Like it could be that there's no quote unquote looking at the image, right, but so there's that. There's also object detection, which sounds really easy, but it's
incredibly hard. So this is what I was talking about, being able to recognize in visual objects within an image. So what separates an object from its background? If I've got a shot, a top down shot of a table, and there's a book sitting on the middle of that table, then when I look down, I can see that there's a book and there's a table, and I recognize those
is two different things. But like I was saying before, if it's a machine and it doesn't have this way of of telling the difference there, it may just think of that as a pattern that's on a table, you know, or even a raised part of that table. If it can detect depth, it's not an it's not a cut and dry thing. So getting a point where you can have a computer that can tell that there are multiple objects within a scene, that's already a challenge, although we've
gone a far away to actually do that. But imagine that you are looking at if you want to think about how hard this is, imagine you're thinking at a at a overgrown field and there's someone in a gilly suit out there. Gilly suit are those camos suits that have all the plant type material hanging off of them. Some of its artificial, some of it may actually be gathered from wherever you're going. Those camouflage suits are really convincing. It's really hard to pick someone who's someone who's good
at at being uh covert. You may not even know that there's a person there, and so that's as hard as that is for us. That about the equivalent of looking at a book on a table for a computer exactly. Yeah. You know, until you're able to teach a machine how to how to see in a way, then it's going to have it's going to be just as as difficult to detect that as it would be for us to see that guy in the gilly suit in the middle
of the overgrown field. Um. But once you do get to that, you still have other things to keep in mind, like classification and labeling. So this is a measurement of how accurately a program can assign correct labels to an image. Uh. I love the example, like if you looked at certain examples within the Google blog post and eventually took you to a picture of a dog wearing a sombrero, so you had dog hat. There's a hat on a dog, a wide brimmed hat. Yeah, these are all important elements.
That's part of that classification and labeling. Uh. You know, instead of just saying that's one ugly dog because not being a weird growth, yeah, that would be that would be more than that. It's it's uh dog with the hat and not just dog with something on it. Yeah,
so again not dog with stack of pancakes. Yeah, And identifying the fact that there is that relationship that there is a hat on a dog, not just that there's a hat and a dog in the picture, but how did those two objects relate to one another within the
context of that picture. So that's also pretty cool. So as We've said, lots of folks have been working on the problem of how to do this thing, and a common approach has been working on linking computer systems that understand what's going on in pictures with computer systems that understand what's going on in sentences, and letting the two match up images and phrases. But these scientists at Google
were approaching it from a different way. They're trying to create a system where the two halves work together directly with the same data, rather than comparing and contrasting two separate sets of data. Okay, um, so they were. They were inspired to do this by recent language translation research in which one half of the system would create a diagram of a sentence in one language, say English, and the second half would look at that diagram and generate
a sentence from it in another language, say French. Yeah. And again this was a part of classification. It wasn't It wasn't just a word to word, you know, what is the what is the analogous word in this other language for this the one that we're detecting here, but rather what is the meaning of this sentence and what is the what is the phrase in this other language that has that same meaning exactly, because because doing word for word translations, if you've ever worked in another language,
often doesn't at all. Right, and you realize as you say it to someone who's a native speaker of that language, is said, that's a very weird way of putting what you just said. Um, so yeah, it's it's a really interesting idea, and the way they do it is particularly technical. So I'm going to take this from a very kind of high level because well, first and what I gotta be totally honest, it goes so technical. I definitely don't
understand all of it. So we're talking about neural networks. Yeah, so I'm trying to take it from a high enough level where I feel like I still have a general grip on what's happening. But I'm if I dove down any deeper than I would most likely be giving at least equal amounts information and misinformation. Yeah. Yeah, but so so, like like just said, we're talking about artificial neural networks, which are the systems that these language researchers were using,
and you decided to go with the same thing. Yeah, they're systems that are attempting to replicate something that happens inside an organic brain. Yeah, so anything that is an artificial neural network is trying to to kind of mimic nature essentially at some level. Some of them are more like the neural networks you would see in our brains. Some of them react like neurons, but they are arranged in a way that's very different from the way our
brains are arranged. So, uh, you know, it's when we say these neural networks, just keep in mind we're not necessarily talking about an artificial brain. It's not the same thing. It's more like the tiny constituents, you know, millions of which make up a brain. Right. Think of each neuron as capable of of doing a various processes, whatever those processes might be, on information that comes to it, and then send it on to other neurons, which will then
add their own element of whatever it may be. So one of them might be all right, whenever I get input from uh, this other neuron, I know to perform this specific mathematic process and then I know to pass it on to this other neuron. And uh, even that's an oversimplification, but it gives you kind of an idea of what's going on. So one of the two types of artificial neural networks used by Google, and keep in mind there are lots of different variations of artificial neural networks.
It's called a convolutional neural network. And it's funny because when I think convoluted, I don't think of that as being a positive thing. But convolutional neural network in this case is a feed forward neural network, which means there's an actual pathway to follow that has a beginning and an end. So think of it like, uh, you know, it's it's gonna start and and a destination. Uh and and things always begin at the start and they always
end up at the destination. And the pathway has lots of neurons along it that can do work upon whatever the input is. So you have this is the start um and it's essentially there to classify objects within an image, and that data ends up getting encoded according to however
you've programmed that network. Al Right, you wind up with this with this really huge amount of data coming off of a single image, right, and all of that gets fed into the second artificial neural network, which is called a recurrent neural network, which is not a direct pathway. This is an interconnected model that creates a what is called a directed cycle. So there are several subsets of this kind of network. Uh. The fully recurrent network is
the probably the easiest to imagine. That's one where every single neuron has a direct connection with our directed connection with every single other neuron within the network. Uh. Now, these obviously get way more complicated the more neurons you add. This is one of the reasons why having an artificial brain model is so hard, because you're talking on the order of eighty billion neurons to make a simple artificial brain.
Eighty billion units that have interconnections with not necessarily every other node like every other neuron, but enough to make this complicated and slow on the classical computing scale, and then away getting back to their's. Uh, this one is what will end up uh processing all that information to describe the images that were classified from the first one.
So the first one classifies all the stuff. This one is what creates the language used to describe those images to make them meaningful to a human audience, so that when we get the description, it actually it reflects whatever the picture shows and doesn't like It's not like a lady sitting down at a table reading a book and it says frog dancing on skyscraper. Well hopefully, I mean, I mean the system is certainly not perfect. It's it's a really good system, and it's neat because this this
these neural networks. The other thing that these do that I didn't mention before is they can learn. They can you know, once once a process goes through, if you start making that process, you know, replicating that process, either by feeding the same thing through over and over feeding similar things through, it starts to pick up on that.
So this is very similar to that idea we had about feeding in the the thousands and thousands of images and videos of cats and how the machine was able to learn what a cat was without anyone telling it what a cat was the same sort of thing here. It's it's that same process. It ends up kind of like a memory. How our memories are pathways of neurons that fire in a specific sequence more or less, and every time we remember it, we're replicating that as close
as we can. Anyway, Uh, similar to what's going on here. Pretty cool And like I said, to get more more detailed is beyond me, yea, beyond beyond any of us sitting at this table. Um, I would check out that blog post if you get a chance, because especially because it's got a great image where it sort of shows the difference between what a good description of an image looks like and what I failed description looks like. Yeah,
radiance in between. Sy sure it had this rating system or has it probably hasn't been stroid since we've made this podcast. I hope that. Otherwise we're futile and telling you check it out. We're backward thinkers. Um. But but but yeah, so so humans have ranked the photos by like, well, this totally has this is an accurate description all the
way to this is not what that is. So, for example, like a person riding a motorcycle on a dirt road is described with the sentence a person riding a motorcycle on a dirt road sort of a dirt road. Actually, I would say it looks kind of like a motocross track. But that's enough. Yeah, okay, And then it would have sort of like describes with minor errors. So there was one that says close up of a cat laying on
a couch, it's a cat sitting on a bed. But cat in the photo, it's not really a close up, but yeah, you know, it's sort of sort of what's going on. Um. My favorite was the one that's labeled a refrigerator filled with lots of food and drinks and in fact it's some kind of parking sign with stickers all for it. Yeah, in a in a busy city tunnel.
It looks like and if you look at the images the way that they they show how the computer quote unquote sees the image, which really it's just it's just a visualization for our benefit, but it shows with different color boxes around each individual element to show how it how it's picking the mountain labeling it. It gives you an idea like when you when you think of it that way, you think, wow, this is a lot more complicated than I imagine. I mean, we we often think like,
I don't know what you guys think. I shouldn't say we I said. I often think of automated image description. I often will imagine the simplest of images in my head, just thinking, like, you know, it would look like an Apple commercial, you know, white background with one solitary image in the middle, and the description of what that is. I don't necessarily think, oh wait, no, this refers to every kind of image, including you know, pictures of me and my friends at at a restaurant with lots of
other stuff going on. Like when you think about that and all the different elements that can appear in a single picture, then you realize, wow, this is this is really amazing that they've been able to create anything remotely approaching automated image, because you know, that's that's the thing that our eyes do. They naturally pick out items of interest from visual scenes. So the same sort of thing
again applies with robotics. I mean, you know, I know this is mostly about image description, but the same kind of of processing is really important for machines, especially machines that are going to be interacting with humans on a more frequent basis, to be able to recognize an environment, not just to pick out the potential obstacles that a robot might encounter so I can maneuver around them, but also just to understand, h, these are the elements within
this scene that I need to be careful around because they are either people and thus I don't want to injure them, or they are things that are delicate and I don't want to break them, or these are the objects that I need to interact with, and this is how I might go about interacting with them exactly. And of course, beyond that, to communicate information about the environment.
And I mean that's something that would be amazingly helpful if a robot could describe to you what happened three minutes ago, yeah, you know well, And and for robots, we often talk about robots used for first responders, a robot that could be able to not just look for signs of life, but actively describe the environment back to operators, so that you know, say like, oh, hey, that pillar is down over there, and the floor is caved in over here, and there's a pile of debris and yeahah yeah,
because there are a lot of sensors on a robot that can detect things that don't necessarily translate into direct visual data for us, you know, not just the cameras but other things, and for red sensors, stuff that you know, would need some processing on our end to make it meaningful. But if it's able to communicate directly that information, that would be incredibly helpful. So uh, very important part of
artificial intelligence. And again I think it also illustrates that artificial intelligence is a much bigger idea than just a machine that quote unquote thinks like we do. That's the way a lot of us will describe like that. That's kind of my go to thought whenever I hear the words artificial intelligence. Mainly I guess because of Hollywood, But the reality is that it encompasses way more than that. So, uh, excellent work on uncovering that little item and and suggesting it, Joe.
I think it made a great topic. As I've said, it weren't me, Well, you saw it and then you brought it to our attention. If it hadn't been for you, we probably been talking about something that would not have required me to fire my entire brain at it for a day and a half of solid research. Well, I spend most of my time trying to figure out weird ways to make you exercise your neurons because I need those healthy for something I'm planning on doing in a
couple of months. That's that's all good, good, that's good. I personally thank you, Joe, because otherwise I'd just be saying here drooling. So I you know, sure your plan is likely nefarious, but I'm gonna go along with it because it's benefiting me in the short term. What's your blood type? So forward thinking? If you have suggestions for future topics, you should get in touch with us and
let us know what those are. Our email addresses FW thinking at how stuff Works dot com, or drop us a line on Google Plus, on Facebook or on Twitter. At Twitter and Google Plus, we have the handle fw thinking. Just search for that at Facebook will pop right up. Let us know what you want to hear about, or maybe you want to chime in, or maybe you found that that stock image of a of a man throwing a sandwich off a cliff. Oh I want to see that.
If you send it in, I will replace whatever image I originally put with this podcast and use that and uh and we'll even we'll even throw credit in. We'll say that you know you're the one who found it for us. So well, assuming now their photo permissions, yes, yes, we have to be able to we have to be able, we have to be able to use it with permission. So if it's if it's a stock photo that we you know, from a stock company, that we actually use,
then we can we can totally do that. And uh so as long as that all liigns up I know that's a lot of ifs. You'll get a credit for it, and just think you two will have have contributed to our forward thinking quest of people throwing food off of high elevations. I don't know at any rate. Get in touch with us and we will talk to you again really soon. For more on this topic in the future of technology, I'll visit forward thinking dot Com, brought to you by Toyota. Let's Go Places,
