Speech Recognition

Speaker 1

00:00

Brought to you by Toyota. Let's go places. Welcome to Forward Thinking. Welcome everyone to Forward Thinking, the podcast where we think about the future. I'm Jonathan Strickland, I'm Lauren Vogel Dan, and I'm Joe McCormick. And today we want to talk about the evolution of speech recognition software and hardware, and to talk about how did we get to where we are and where are we going from here? Because

00:34

clearly this is a thing. I mean, we're seeing speech recognition in lots of different devices and including computers, mobile devices. I know that my phone allows me to talk to it and ask questions and occasionally I get the correct response. I also have programs that will create an automatic transcript of voicemails. So how did we get to this point? How did we get to a time where computers can, at least on the surface level, appear to understand speech.

01:10

And to really understand this, we have to go back aways and by a ways, I mean seventeen seventy three, what, yeah, all right, So seventeen seventy three, shortly before the Atari twenty six hundred came out by a couple of centuries, there was a Russian scientist named Christian Kratzenstein, and he actually Kratzenstein, Kratzenstein, Kratzenstein. That's a great name, it's an

01:35

awesome name. Right. Well, during this time people were starting to get really interested in the nature of sound and ways of producing sound, and Kratzenstein actually created something very interesting. He created a machine that was capable of producing vowel like sounds using organ pipes and resonance tubes. Wow. So totally synthetic. Totally synthetic. And in this case we're talking about a machine producing sounds, not a machine taking in

02:08

sounds and analyst exactly. This is a machine. But the history of speech recognition is also a history of designing machines that talk back to us. They don't just listen to what we have to say, but they can communicate back to us. So this is one of those earliest versions of that. And he wasn't alone. In seventeen ninety one Wuff Gong von Kimpellen in Vienna he built the Acoustic Mechanical Speech Machine or two for two. Yeah, yeah,

02:41

and then let's just skip the entire nineteenth century. But wait a minute, I actually it wasn't I think that Alexander Graham Bell. His wife was deaf correct, and he originally when he was starting to play around with sound, created was trying to transform, trying to create the device that would transform audible words into a visual output that a deaf person could interpret. And I mean he wound up creating pictures with sound, but his wife never really

03:11

managed to interpret them. However, that research started going into things like the telephone. Well, and also there were early attempts when the gramophone first became a thing, when they started to use wax cylinders to record sound in a physical medium. Was that at a cent or somebody before him? Bell did some of this as well, So we're talking about there are actually quite a few inventors who were

03:34

working on this sort of technology. But there were people who had created these different devices to record sound on a physical medium, and there were already people thinking, well, if we can do this, is there some way where we can reverse the process, where we take the physical medium and make that into an input of some type. And some people were even thinking maybe we can make

03:55

an automatic typewriter. Once the mechanical typewriter came into being, there were thoughts of if there's some way to make this same process where we're recording sound onto a physical medium then turn into a way of actually transmitting this into text. That would be amazing. No one quite figured it out at that point, but it got a lot of people thinking, and in fact, Bell Laboratories was one of the leading companies or leading research firms that was

04:24

really concentrated on this speech recognition problem. And in the nineteen thirties there was a guy named Homer Dudley. I guess probably not quite as colorful as Kratzenstein or Wolfgang. It's okay, it's a more uh, what would you say, it's a more Americana kind of name. Homer Dudley, Yeah,

04:44

Homer Dudley. He was at Bell Labs. He proposed a system model for speech analysis and synthesis, and he also designed the Voice Operating Demonstrator also known as the VOTER, which was a speech synthesizer, And this was essentially building on that scene work that the other guys had done centuries before, but in a electronic capacity as opposed to mechanical. Then we get into this era where the researchers were starting to try and figure out how to make machines

05:13

actually understand speech. At least on a surface level. And the early emphasis was on phonetics, which is the sounds we make in our language. You know, the age language has its own list of phonemes that we generate in order to make the words. These are, yeah, the building blocks kind of speech. It's the individual sounds, and they're similar across languages. But yeah, for example, English has how about forty. Linguists are actually kind of in a disagreement

05:43

about the exact number. It all depends on where you are, like in the South. Next, we produce sounds in the South, like we can make a one syllable word into at least three or four syllables, y'all. So we have the ability to insert sounds where no sound was before, which is why things like hooked on phonics does not necessarily work as well as advertised, not necessarily unless you create a regionalized version, in which case that would be interesting.

06:10

But you've got to understand how hard this is for computers to hear, right, So we're used to it. I mean, we talk to people all the time. But just think about like when you're on the phone and say somebody is reading something to you, like spelling out a word or something, and you did you say P or B? Did you say P? I? Could you know what? That's why you need the Yankee Hotel fox trot kind of sure,

06:37

sure stuff. So when you it's so easy for computers to mess this up, well yeah, and beyond that, even if you are enunciating clearly, the speed at which you say a word can completely make a computer misunderstand you, absolutely, because if you've programmed the computer program in such a way, you've built the computer pro in such a way that it analyzes a word based on a sequence of sounds, and it expects each part of that sequence to be

07:09

a certain length. If you pronounce that word at a different speed than someone else, then the computer might have trouble figuring out that the two versions of that was the same word. And this can vary within an individual speech. I could say the same word twice in one paragraph, and the way I say it each time might be different enough to cause problems. Right, So these are non trivial problems, and in these early days they were mostly focused on just trying to figure out how to teach

07:37

a computer to recognize those basic sounds. In nineteen fifty two, Bell Labs introduced the Audrey system and that could recognize spoken digits, which made it a little easier because you eliminate everything that's not a digit, right, you're just going through a series of what ten, twenty? It was probably only nine actually, because you usually do one at a time. Maybe maybe ten. If you include zero there as well,

08:05

I mean you might not. It all depends. They discovered the number zero in the nineteen fifty They did, but they lost it for a while. Oh yeah, so, but I think by nineteen fifty two they re found it. The Mayans had it at least. Yeah, it was you know, twenty thirteen, we had to have it. I mean yeah.

08:22

Bell Labs ended up having this Audrey system, and by limiting it to just digits, it meant that they could work very hard on a drastically simplified version of speech recognition, because again, you just throw out anything that's non digit, and it means the computer it can concentrate on which digit did that sound like the most, based upon the phonemes that are needed to say whatever that digit is.

08:49

In nineteen sixty two, so this is a decade later, IBM demonstrated the shoe box machine at the World's Fair and it could understand sixteen words spoken in English. Another good point is that speech recognition, some of these systems are language specific. It's not that it can adapt to any language. It is most of the program's programmed specifically

09:12

for a certain one, right exactly. So, Again, if the phonemes that we produce here are different from ones and say China, then it's not going to give you like whatever it produces is not going to be the response that someone who's speaking Chinese would want, right So generally, in the nineteen sixties, Japanese labs began to work on vowel recognition phonemes, and they also did some early work

09:38

and continuous speech recognition. Now this is important because again those early speech recognition programs, even when they got to the point where they could recognize full words, you had to put long pauses between each word or else it never would And unless you're William Shatner, that's not really a natural way of speaking or Christopher Walkin. Yeah. Either way.

10:05

Also in the sixties, Fry and Dean's, two researchers at the University College in England, designed a phone name recognizer that could recognize four vowel sounds and nine consonant sounds, and they use statistical data on phoneme sequences found in English to help the system recognize more words than it normally would. And this is kind of interesting. What you do is you say, all right, there are a certain limited number of sounds typically found in the spoken English language.

10:37

But those sounds are not you know, it's not that those are completely interchangeable and that you're going to find every single combination of those sounds in an English word. There's certain sounds that are rarely, if ever, going to go together. So if you start to take those sounds out and then concentrate on the words that do use the sounds that are left, you have reduced the number of possibilities and thus made the system more efficient and reliable.

11:03

So now are we starting to get into an era of what you're talking about here where the machines are doing some analysis, Yes, to uh, to figure out what the language means, right, well, really what it means even just to figure out what the word is exactly. Yes, that's what I meant. To interpret the sounds into words. It's not just drawing on things that have been directly programmed into it, you know, the hard coded, right, understanding

11:31

that it's using statistical analysis. Yes, and and I mean clearly this would be important if you're talking about any sort of dictation software, right, because with dictation software, to program every single word in the English language into a vocabulary for this program and to do every variation of the pronunciation of that word would be pretty that'd be

11:57

a lot of work. Yeah. So if you can create a system that can analyze the phonemes and then, based upon the certain statistical analysis, figure out or make a best guess at what that word is, you've fixed a lot of the problems. And in fact best guest becomes really important in just a few decades. So in nineteen seventy one, oh wait, I'm sorry, let me back up. Late sixties early seventies, researchers start to look into non uniform timescale approaches to speech recognition, which is what I

12:28

was talking about earlier. The fact that not everyone speaks the same words at the same speed or uses the same emphasis. So you have to figure out a way of analyzing that and accounting for that. And it's called the it's called dynamic time warping, which is not a jump to the left and a step to the right. I'm disappointed, Jonathan, I'm sorry. Dynamic to me, H, well,

12:53

you know, I'll take you to a movie. On Friday nineteen seventy one, the United States Department of Defense Advance Research Project Agency, also known as DARPA initiates a program called Speech Understanding Research or su R, and it funded several projects, including one by Carnegie Mellon University called Harpie, which is just charming. But yes, there's a speech understanding system which could understand one thousand and eleven words, which I said was about the same as a vocabulary of

13:25

a three year old. And it used something called beam search to narrow down the possibilities of what a spoken sound could be by comparing it to the statistical data and going with the most likely results. So it's going with probabilities. And so this is really interesting to me because it doesn't necessarily mean it's going to produce the correct result. It's making a best guess based upon the

13:47

input that it got what it was you said. So in this case, if I were having the conversation with you, Joe, and I said a letter and you weren't sure if it was p or b you. Instead of you asking me, you just say, well, I think it was probac was a P. I'm just gonna write, well, I mean, if your computer is smart enough and it has a large enough dictionary, it might understand that that say, the words starting with a P sound makes sense here, But the

14:12

words starting with a B sound is not. So like I ate a pair or I ate a bear, and now some days, some days the pair eats you exactly. But of course i'd imagine the machine at that time didn't have the resources to say, go figure out if I ate a pair or I ate a bear made

14:31

more sense, right right? We need to remember that this is you said, early seventies, so this is when you know, computers were the size of like three of my car at least, you know, right well, and there were I mean, there was no Internet yet, and that'll come in a big way in a little bit here. By the seventies they had Arpanet, but that was very limited and that they had have anything to do with it. They had no web to draw on, no no web at all

14:58

for massive sampling of of data. So nineteen seventy six, the Serve program that DARPA had concludes there were a couple of other agencies that had tried to create speech understanding algorithms and hardware, but had not quite met the requirements by the end of the program to really count as a success, but they did end up contributing quite

15:21

a bit to future endeavors. So then we've got the nineteen eighties that typically follows the seventies, and that's when they introduced a statistical method that was based on the hidden Markov model. Have you guys heard of this the hmm, all right, it's a little complicated and it's difficult to really explain without the benefit of complicated graphics behind me, but I will try. So it's a probability model. And let's say that you've got let's say you've got three

15:58

earns in front of you. Okay, three three vases are in front of you. They're solid, you can't see through them, but you see that you've put a certain number of orange ping pong balls in each. The first one has the most. You put a certain number of white ping pong balls in each, The middle one has the most of those, and you put a certain number of yellow

16:18

ping pong balls in each. The third one has the most of those, and then you already know the states of the you're actually watching as you draw these ping pong balls out, and then you're combining them to get some sort of response at the end. It doesn't matter what the response is, but you're drawing a ping pong ball out from each combining them together, and that you see the whole process. Now, that's a normal Markov model because you know the state of each of those draws

16:44

from the vases, all right, so you observe the state. Now, let's say those vases are in a one room and you're in another room, and you cannot see into the other room. You just get to see the output of the three ping pong balls as they come out of this process. So you don't see which one's drawn from which earned, but you know that one is drawn from

17:05

each one, and you see what the result is. Now, you don't know the state of those individual urns, but you do see the result, which gives you enough information to draw some conclusions about the state of the urns inside the room. Not enough for you to know for certain, but you can get sort of a probability of what happened in there to get the result that you have that's a hidden Markov model, and that is an oversimplification

17:31

of the hidden Markov model. So anyone out there who actually works with systems that use this is screaming that's way too simplistic, I know, but this is the easiest way for me to explain it, Okay. But so basically what you're saying is that it uses it looks at the statistical prevalence of these three different colors appearing into the room, and by that it makes judgments about how common they probably are in the vases more or less. And so these models are used a lot in things

18:04

that require a lot of interpretation on machines. Part voice recognition is a big part of that, but it's not just voice recognition, gesture recognition, handwriting recognition, anything where you know, two people could try and make the same result, but because we are individuals and because we do think slightly differently, even though we're both creating the same result, we're doing it in a different way. The computer has to be

18:28

able to interpret that, right. So it's because it's taking sort of ambiguous analog data from the world, sure, yeah, and it has to be able to react to that

18:38

and create a meaningful result. So once people started to concentrate on this form of statistical analysis, voice recognition pretty much hit its peak as far as recognizing individual words, not necessarily knowing what the context is or what the meaning is, but it meant that if you were speaking into a mischie that had this kind of software in it, it could determine with relative ease what it was you were saying, not what it meant, but what the actual

19:10

words were. So if, for example, if it's a simple speech to text program, it be fairly accurate, and it got more accurate as time went on. In nineteen eighty two, that's when a certain Ray Kurtzweil got involved. Our old Kurtzweil's a well known futurist, one of those evangelists for the oncoming singularity, a fellow who I think is hoping to achieve immortality through technology in some method or another personally, yes, definitely.

19:38

So he created in nineteen eighty two the Kurtzweil Applied Intelligence Division Company, really and it was all about creating computer based speech recognition, And in nineteen eighty seven it introduced a commercial speech recognition system. And Kurtzweil was really applying his expertise in two areas, computer science and pattern recognition. He was really interested in the way that computers can identify patterns and respond to them, and speech was certainly

20:10

part of that. So he applied that knowledge and that expertise and really made some big contributions in the speech recognition field. Skipping over to the nineteen nineties, I mean, essentially we're having this field evolve over time. But in the nineties we started seeing the development of real speech

20:28

enabled applications. So this is when we started getting those telephone systems where you would call in and get an automated response saying say say or press one, which is again going all the way back to the Audrey system in nineteen fifty two that no labs DI it Yeah, You've only got ten responses, and so it just has to figure out which one right, and then eventually it would get to things like you know, say yes, or like I can help you with that. What is your

20:55

problem he's a keyword? Yeah, not that keyword. Sorry, I don't understand. Can you restate that? Yeah. So by twenty ten we get the we get Google's English Voice search system, which incorporates around two hundred and thirty billion words from actual user queries. Wow, have you all tried this thing? Oh? Yeah, I use it all the time. No, I do, because I've got an Android phone, so I actually do use

21:24

voice search all the time. Sometimes I think it's really hilarious how accurate it is, Like, you know, it shouldn't recognize that term, but it does. I use it mostly for navigation purposes, So I'll pull up a map application

21:39

and it's a Google one. So then I, you know, speak destination and I can say an address, or I can say a business name, or you know, if I have someone in my contact list, I can say their name and it pulls up the information, which leads us kind of into a second part of this speech recognition discussion. We've got the idea that speech and search are really tightly connected, actually to the point where advances in one field often mean that the other field benefits as a result.

22:11

But now we're talking about not just recognizing words, but pulling some sort of meaning from them. Right, Well, what is the goal of input of an interface that takes input from a human and turns it into data. I mean,

22:29

I don't know. We'll say I would argue that the ultimate goal of an input interface is to become invisible, to make things as easy and as natural and as intuitive for you as it possibly could be, so that you don't even recognize the tools you're using, right, Right, to give the computer the ability to answer your questions almost before you ask them, exactly and right now, you know, we're still using tools that we have to learn how

22:59

to use. Right. So when you when you want to talk to the voice recognition program on your smartphone, you do have to be aware that it's only going to be listening to certain keywords, right. You have to you have to give it keywords and sort of specific commands

23:17

that it can understand in order for it to help you. Sure, And in that sense, it's kind of like a program where you know, you have a certain number of buttons you can click on, or commands you can enter on a command line that are chosen from a list of pre selected commands, but you're just doing it with your voice, right. Anything outside of that would just be interpreted as an error. Sure, yeah, Yeah, you can say open and close, but if you say

23:42

French fries, it goes qua. Yeah. Yeah. So let's say you had this and you're looking for something on Google, right, you're looking at Google Maps and you're using your voice, you could probably say French fries, though, right, should say like French fries near my house. Well, even there it might be able to understand those keywords. Right, You've given

24:00

it something that it knows how to work with. But what if you've got a problem, like I'm trying to remember this meal I had that was real good in town and I don't know, and you're kind of describing it, but right, it can't do anything with that, right, Right, you'd have to talk to a person at that point. You would either that or you would have to have every single restaurant give every single possible explanation of what

24:23

its meals would be like exactly. Yeah. But so this leads me to a question about the future of voice and speech recognition. And here's a question. Why do we call tech support? I mean, if you've I mean, most people are working, most people have called tech support at some point. But I will venture that almost any problem that can be solved by tech support, there's already a

24:48

written out solution to the exact problem you have somewhere online. Sure, right, somebody has already solved this problem, and they've probably typed up instructions on how to fix it right, and they may even be easy to follow instructions. But the challenge there is for the person who's experiencing the problem, how do they frame their problem in such a way that

25:09

they get the exactly the right response. How do they connect the problem they're experiencing to the solution that exists somewhere out there If they don't know what the correct keywords are, If they don't they don't know what the problem is itself, They're just going my screen won't turn on exactly. And that's why we call tech support. I think you call tech support because you need something that can process natural language, which is right now a person.

25:34

A person can listen to you describe your problem, and whatever terms you come up with, can take that information, get the gist of it, and connect that to a piece of knowledge, right, and then in return, that person can respond with language that the person who called in

25:53

for tech support can understand. So, for instance, if I'm experiencing a problem and I call up Joe and Joe tells me how to fix it, but I don't understand his explanation, I can say I am sorry, I just I don't get that Joe can actually then take the time to reframe what it was he said in a way that my puny brain can comprehend, right, And then I can turn off my computer and turn it back on again and suddenly works. So yeah, it totally works

26:20

both ways. I mean, but it's so especially important in identifying what the problem is to begin with, because a lot of times we just don't know the right way to explain it to a computer in terms of commands and keywords. And so I think this is sort of the future of where voice recognition is going from here. And there are a couple of things we need to

26:41

explore about voice and speech recognition. One of them is how does the computer understand whole speech like sentences that you're speaking to it, as opposed to just little words at a time, and make sense of those in a grammatical way and actually make sense of them instead of instead of yeah, picking up on those keywords, because you know, right now, the technology doesn't know what you're saying, right, So yeah, well, I mean, and in our way, it

27:10

will probably never know what you're saying. Oh, we can have a debate what will computers achieve consciousness? You know, will the terminator learn to love. But whether or not the terminator, terminator will understand the meaning of love. The terminator will at least be able to make sense of my grammar, even if it's spontaneous and kind of manful. So the terminator may not love, but it may be

27:36

able to mark up your paper. Yeah, it may be able to help me figure out what restaurant I went to when I was in town last year, just by me describing some dish. If the terminator doesn't love you, I don't see it taking the opportunity to actually help you with that problem. And I'd like to put in that I do not want the terminator to be my English teacher ever. Thanks, I'm pretty sure I had the

27:56

terminator as my English teacher. You'll fail English. So so before well, no, I want to want to introduce a possible way of viewing the progress of our input through voice. And that's a way of looking at the computer helper as something that's that's got an obedient orientation versus a sympathetic orientation. All right, and what do you mean by that? And so I would say that right now, computers have an obedient orientation, meaning they they solve directly problems that

28:35

you give them. Right, they do what you tell them to do, and that's it. Yeah yeah. And when they're not doing that, it's because you haven't told them how to do it exactly right, Yeah yeah. And so you enter a command and it follows the command exactly, It performs the calculation, it searches for the search term. However, it goes like that. Now, what makes that person on tech support different. That person has a sympathetic orientation as

28:59

opposed to an obedient orientation. What that person does is listens to your whole problem, gets the gist of it, figures out what's important, and then helps you solve it. Right, They see the end. They see not just each of the individual commands you're giving, but they understand what you're trying to do overall. Right, and we're already making some

29:23

pretty big strides in natural language recognition. For instance, IBM's Watson, which was famous for going on Jeopardy up against two former Jeopardy champions and beating them, winning in a game of Jeopardy. But what it had to do was it essentially had a huge amount of information stored in its in its data banks. Yeah, but it had no connection to the Internet while it was playing the game, So it was it had much of the Internet on it.

29:50

You know, it was self but it was self contained. Yeah, all the YouTube comments were left off, but otherwise, yeah, it was Why didn't it need those worry about what happened when when it learned Urban Dictionary? Right, Yeah, they taught it Urban Dictionary and then they basically had to nuke Urban Dictionary from orbit from from its data banks because it started off. It's true. That's completely true. Okay, so I understand why it doesn't need YouTube comedy turn

30:18

Watson into a vicious sociopath. Yeah, yeah, which I will kill you. It was essentially becoming the the Sean Connery from the Saturday Night Live skits. So anyway I had it was it was closed off, so it didn't have an outside. It didn't have an outlet to uh to to go out and do a search on the internet

30:38

for everything. So when a Jeopardy clue came up, it had to analyze the clue, go through its database and then determine which bits of information were most likely to be the relevant ones to answer or to form the question in the case of Jeopardy for that clue, and uh and the way it did this was that it would assign probabilities to answers based upon parsing out the clue. And the thing about Jeopardy is that it's not just

31:06

really straightforward answers. You know, things like you know, this is the is Beethoven's symphony that contains Ode to Joy? What is the ninth Symphony? You know it's not that's there's word play in there, right exactly, there are punsure and yeah, yes, So they had to create programs that could that could parse that language and determine what is the underlying meaning of this phrase, not just what do

31:32

these words? What are you know? Not just using those words as search terms, because if it did that, it never would have won. It had to figure out the relevance. And so what it would do is it would pull up all these different answers and a sign that probabilities for being the correct one. And if the probability was higher than a threshold and I can't remember what the threshold it was, like seventy or eighty percent or whatever, but if it was higher than that threshold, then then

31:55

and only then would Watson guess and guests. Yeah. Otherwise Watson would be quiet and allow one of the other two people to answer, which is really interesting to me because that's a it's a step towards that natural language recognition, the idea that it's not just looking at the words as search terms, but as these are things units of meaning, they have meaning, and therefore you need to find the

32:19

data that corresponds with that meaning. And that is incredible. Well, it was searching for if I'm correct, it wasn't it. It had something to do with like keywords would be searched based on when they were in proximity to other important terms, right as far as I can understand it, yes, but I mean it gets really pretty complex. And then beyond that, you know, we're starting to see Watson being

32:44

used in medical facilities yea. You know they're using to describe what's wrong, you know, kind of a diagnostic although a lot of doctors will tell you that while it's a useful tool, it's certainly not a replacement for a doctor because so many cases can like two people with the exact same condition can come in and nonsent present different symptoms and even explain the same symptoms in very different terms, and so it becomes increasingly difficult for a

33:15

machine to be able to interpret that and come up with the right response, as opposed to a doctor who has that experience and has the ability to be much more dynamic and even proactive and asking the right questions to get the right information. Well. Also, I mean, a doctor, much like the tech support person is, though with much higher stakes obviously, is able to identify what's important. I mean a lot of most of the time, when you

33:40

come describing a problem, you're giving too much information. All the time, You're giving all this information, and a huge amount of it is probably not actually relevant to what's really the problem. And that's when the human who's experienced

33:55

this before knows what to zero in on. Computers have more trouble with that, right, Like, they have a hard time figuring out what's important when you've given it a list of sure, yeah, yeah, no, it without without giving it some form of a way of recognizing context and

34:12

way of recognizing the importance of particular words and particular phrases. Yeah, I mean, it's just how does a computer determine that the third word in a sentence is more or less important than the fifth word, right, It's all statistical probability and a certain point you're going to plateau on that because the more the more input that you give to these kinds of programs, you know, they'll analyze it and analyze it, and it gets really accurate, and then kind

34:36

of stops getting more accurate. Yeah. In fact, that's been a real issue with voice recognition in general, and a very interesting thing that I think. Maybe it's interesting to me because Kurtzweil worked on voice recognition, and I know the man must be aware that the technology increased at a pretty rapid pace but then began to plateau off. You know, really it even began to plateau off in

35:00

the eighties. We made improvements and we learned how to use the technology we had created better in better ways. But it's not like the advances we made are exponentially better than the previous generations. So in a way, the

35:14

curve is starting to plateau off and level off. We're still we're still making advancements, but not at an accelerated rate, whereas with Moore's law, every two years, essentially we're seeing computers get twice as powerful, and so I think that that makes some futuristic predictions less likely because we recognized

35:35

not all elements of technology accelerate at this same rate. Yeah, I think for the future of speech recognition, natural language processing is key, and natural language processing, you know, Watson seems amazing, but compared to what's probably going to come in the future, Watson is actually very primitive, right And also we got to in mind that right now, even though Watson does did do this amazing thing by beating humans at their own game, it had thousands of processors

36:10

and tens of thousands of cores. So you're talking about an incredibly powerful, energy hungry machine that was able to do something that a human can do. Right, well, that a tiny little meat thing. It was. It was able to do what the tech support operator can do. And ultimately, I think that's the endgame here. What we're talking about

36:35

in the far future. What we dream about is when your computer is as sympathetic in its orientation as a human helper, is that you can describe in spontaneous human language what you're trying to do and it can actually help you with that as opposed to just operating off of set commands. And it's we're getting there. I mean, if you talk to people who have used Apple's Siri or the Google Voice search stuff. You know, you can use some pretty you know, colloquial sayings to get what

37:07

you want. And it's it's getting better and better at interpreting those and giving you the response that would be appropriate. And and granted this is all again still on a surface level, but it's it's seemingly deep, you know, to the user experience. It seems like the machine understands what

37:24

you're saying, even though that's not really yes. And and you know, maybe in the future we have the semantic web that responds exactly to what we want even if we were to you know, I know that it's really hard to get tone across and text messages, but maybe computers will be better than people are. By the way, I'm always I'm always j slash k if you if you're wondering, all right, well, you know we should wrap this up. We've gone on quite a bit about voice

37:50

and speech recognition. It's a fascinating topic and it is one that I am eager to see more advances in the field. We've seen stuff not just in smartphones and tablets, but also game consoles, things like Microsoft's Connect and other devices as well incorporate voice recognition, and I expect we're going to see even more of that. I can't wait for my thermostat to have it. It's too darn on here and then it just immediately just cranks down five degrees.

38:18

That'd be fantastic because I don't have one that's connected to the internet, so I can't just use my smartphone. I actually have to. I can't believe it, get up and walk to it. I know it's a My life is a drama waiting to be filmed. So guys, that's our episode about voice recognition. We hope you enjoyed it. We highly recommend that if you have any topics that you think board Thinking should cover stuff about the future that really has you excited, that you get in touch

38:45

with us. We have an email address now, it's fw thinking at discovery dot com. You can also go to fwthinking dot com for all of our content. We've got videos, blog posts, podcasts, we have links to all of our social networking stuff. Go there, connect with us, let us know what you think. We look forward to hearing from you, and we will talk to you again really soon. For more on this topic and the future of technology, visit forward thinking dot com. Brought to you by Toyota. Let's go Places,

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript