More on NLP and where voice assistants come from

Speaker 1

00:04

Get in touch with technology with tech Stuff from how stuff works dot com. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with how Stuff Works in a love all things tech, and this is the second episode about natural language processing an LP, also natural language understanding and LU. The two are related.

00:31

With that describes the technologies and processes we use to give machines the ability to interpret and respond to language the way we use it, so not just understanding our input, but also generating output that still follows the rules of various languages. So it's all about getting machines to conform to us rather than the other way around. If you have not listened to the episode immediately before this one,

00:58

you should do that. But as I'm about to pick up where I left off, which was just after our PA pulled the plug on its Speech Understanding research project, and the research under the r PA project had shown that NLP was an even more challenging problem than had previously been anticipated. Even the simplest approaches were creating enormous demands on both the work programmers had to do to build a system out and the processing the system would

01:26

have to rely upon in order to interpret language. Work in the late nineties seventies ranged into psychology. NLP researchers felt a system needed to be able to identify a user's needs and goals in order to function properly, had to understand not just the surface level meaning of a phrase, but the underlying meaning of linguistic expressions as well. Only then could you have a computer system that could collaborate

01:53

with a human being in a seamless way. So, in other words, what they're saying is that you could translate stuff for interpret stuff word by word, but unless you have an understanding of what the person is trying to actually accomplish, chances are the results you're going to get back are not going to be as relevant as they could be. And so that was where the psychology was starting to take form. By the early nineteen eighties, which

02:19

marks the third phase of n LP development. According to the researcher Karen spark Jones, who I talked about in the last episode, researchers were coming to terms with the idea that a scalable NLP system that relied upon the old methods of building lexicons and syntax rules just was not practical It required far too much work on the front end when designing a system to make a general purpose in LP application. The problem was just way too

02:48

big to take that approach. Even with relatively narrow implementations like designing a system that would parse technical documents, you think, all right, well, the language used in technical documents is a subset of the language you would encounter in the quote unquote real world. Even with those use cases, the old methods were proving to require far too much investment in time, money, and effort on the design front. Spark Jones identifies the key focus during this phase as being

03:19

on grammar and logic. During this phase, researchers developed several different grammar types. Now, grammars are sets of rules for analyzing and formalizing language. I would love to go into more detail about the different grammars that were developed during this phase or adopted for computational models, but honestly, it gets really, really heavy, really quickly. It gets extremely technical, though not on a technological side, but more on the

03:46

linguistic side. And suffice it to say that a lot of research and debate centered around what is the best way to arrive at the meaning of language? How do we get to that? How how can you ascertain it is meant by what was spoken or what was written. The grammars were meant to direct NLP models to analyze language in different ways that were computationally viable and that wouldn't require the laborious process of programming everything in a

04:15

word for word style. Another big area of focus at this time was on generation, meaning creating models that would allow machines to generate natural language responses to users, including responses that were extended, long examples of discourse, not just a quick message. While machines wouldn't be able to think, they would be able to put together a more sophisticated response than chatbots like Eliza that I mentioned in the

04:43

last episode could manage. So the idea being, how can we make a machine that can communicate results to a person in a way that just makes sense. It's almost as if a normal human being is chatting with you. But as we understand it, it's very difficult to do this on an extended basis. You can do it for responses to individual queries, but when you start trying to create something that can carry on an actual conversation, that's

05:12

where things start. To break down. In the nineties, work in n LP focused on representing words as as mathematical vectors. Many words are related to one another, so for example, hotel and motel are related. They don't mean exactly the same thing, but they mean very similar things. Then you have a term like bet and breakfast. A bet and breakfast is similar again to a hotel or a motel. It's a different thing, but it's related. So these words

05:43

have similarities. They also have differences between them, but they're all more similar to each other than if I used a different word like hospital. A bet and breakfast is more like a hotel or a motel than it is a hospital. So in other words, we can group words together into vector spaces and calculate the quote unquote distances between vectors, and that determines degrees of similarity, and this is very helpful for both translation and natural language processing.

06:12

There are ways to do this that even take context into account. And this relates back to what was being uh suggested by Warren Weaver when I talked about that memorandum. There's a model called skip Graham, which is essentially what he was talking about. This model takes a window of words surrounding each word in a sentence to determine context, so it's not looking at it just from a word toward basis. Let's say that I write a phrase and

06:42

it says, I'm going to the bank to make a withdrawal. Now, the word bank can actually refer to a couple of different things. Right, it could be a financial institution, which is obviously what I do mean when I say that sentence. That it could also mean the area right next to a river, right the bank of a river. The Skip Graham model would take each word in that sentence and then part with a few other words that are close

07:07

by to determine the meaning of the phrase. So it's looking at I'm going to the bank to make a withdrawal for bank, it might say to bank, the bank, to bank, make bank a bank withdrawal bank. By looking at these pairings, the system can figure out from context that the bank I'm talking about is probably a financial institution. I'm probably not making a withdrawal from a river bank. So it's a way of machine systems figuring out the meaning of a phrase through contextual cues by using this

07:42

windowed approach. And again, Warren weaver Back had proposed such a thing. The vector approach would become more important as computer scientists made advances in neural networks. That approach also made machine translation much more effective because it no longer looked for word for word matches, but rather matches meaning

08:00

based on vectors and probabilities. That's really important because once you determine the meaning of a phrase in one language, then you can look for a phrase in another language that most closely resembles the meaning of the original. Uh. This is the art of translation. A real translator, someone who's translated from one language to another, is probably not doing so word for word. Rather, they're doing meaning for meaning to make certain that the intent of what is

08:33

being communicated gets through, not just the vocabulary. The ninety nineties, which sparked Jones identifies as the fourth phase of NLP development that would be the final phase in her report, saw a more concentrated focus on lexicons over syntax, and it also saw more practical applications of natural language processing, as well as leveraging the Worldwide Web to help train natural language processing models. There was an a rich source

09:02

of natural language on the Worldwide Web. Pretty much every permutation you could imagine from people who are very careful and the way they construct sentences and paragraphs to people who are much more cavalier in the way they use language, whether purposefully or otherwise. And also that report from spark Jones again is dated October two thousand one, so that's where her work stops for that particular report. But nearly two decades have passed since that time, So in that time,

09:34

what has changed. Well, I would argue we are now in a new phase of NLP development, one marked largely by the rise and a few key technologies. One of those is cloud computing. Cloud computing has removed the necessity to build in complex capabilities in end machines like a smartphone or a computer terminal, So an organization can create a cloud infrastructure which consists of powerful machines and data basis.

10:00

Those machines could be real, they could be virtual. Virtual machines are hosted on real hardware, but they're running virtual implementations of various operating systems. So these machines provide the processing power and they house the systems that are necessary to parse language and respond appropriately, So you can think of it as the brains of natural language processing. They all exist on these very powerful computers that are in

10:24

data centers. The widespread availability of the Internet and the fact that it's pretty easy to stay connected in many parts of the world make this possible. So the end user feels like the capabilities are actually housed on whatever device he or she is using, like if it's a smartphone or a computer, But in reality, all the work is actually taking place potentially thousands of miles away in a data center, and it's just being sent to you. The the queries are being sent to the center and

10:54

the responses are being sent back to your device. Another big development that has helped signific piquant LEE is the pairing of artificial neural networks and as well as a deep learning the process of deep learning, so a neural network processes information in a way similar to how our

11:10

brains do it. Every node in a neural network represents a neuron and it executes UH an operation upon data and then hands off this data, which has now been altered it's been transformed by this operation, to another layer of neurons with a network which do further processing, and so on and so forth. The system as a whole

11:32

can evaluate calculations and assign confidence levels to them. Deep learning passes information through numerous layers to transform data and, in the context of natural language processing, extract meaning from that information. Now I've got a bit more to say about natural language processing in general, and then after that I'm going to transition to talk about recent implementations like Sirie, Alexa, Google Assistant, and Cortana. But first let's take a quick

12:00

rake and thank our sponsor. In two thousand and sixteen, Google announced a system that could analyze syntax and recognize the various elements of a sentence, including verbs, nouns, adjectives, and other components. The system's name is sort of a snapshot of the zeitgeist of It was called and I'm not making this up Parsi mcpart's face. It really was. This is a parser, a a software that is meant to analyze inputs and determine what the relationships are between

12:43

various components within the input. So it's parsing out the meaning of a phrase by looking at the relationship between all the different components. It was designed specifically for English language inputs. In that same announcement, Google unveiled and open source neural network framework called syntax net syntax Net tags every word in an input with a part of speech tag, and the tag describes the purpose of that word, what purpose does it serve within the sentence, within the context

13:15

of that input. So, for example, it might be the subject of the sentence, or it could be an object of the sentence, or it might be the action the root the user wishes to perform upon the object. So if it identifies a verb that tends to be the

13:31

root of the command. The system also determines the syntactic relationship between all the words, so not just what each word's purpose is, but how that word relates to all the other words within the input, and then it creates a dependency tree which illustrates which words depend upon others. Syntax Net also makes use of beam search. That's the strategy I talked about in the Speech Recognition podcast a couple of podcasts go so that is to help eliminate ambiguity.

14:05

As sentence length increases, the number of possible interpretations of that sentence also increases dramatically. Right, the more complicated a sentence is, the easier it is to misinterpret what that sentence means, especially if you're looking at it from the perspective of a machine, So how does the computer know which interpretation is the right one? Syntax net takes a sentence and starts to parse it, beginning with a left to right approach for English, so it starts at the

14:35

beginning of the sentence and works its way through. Essentially, it creates a hypothesis as to how the words relate to each other. But as it goes along, it detects possible alternate interpretations, so it starts to assign a probability score to each interpretation, Essentially how sure it is that this is on the right track. And it will keep multiple possible answers as it parses, so it doesn't toss

15:00

them aside immediately. It says, all right, I'm right now, I'm pretty sure answer A is correct, but I'm going

15:07

to hold on to B and C just in case. Now, if one interpretation has a particularly low score and there are several other potential interpretations that have higher scores, the system will discard the low score with the assumption that it just can't be the right answer just doesn't make sense in well formed text, that is informal text, something that has been written in a very formal approach, PARSI

15:31

mcpars face does a pretty good job. In fact, a really good job has an accuracy rating that's approaching the level of a human linguist that is trained in parsing sentences. Humans who have that kind of training average at around scent accuracy, so PARSI mcpars faces right right behind them.

15:52

But the key phrase there is well formed text. If you present parsi mcpar's face with more lucy goosey language, such as what you might find on your average Internet website, which I know was redundant, parsing mcpars face has a much more modest nine success rating. It's still impressive, but

16:12

it's a significant drop in accuracy. Now, these sort of tools have been used in various Google products for a while, not just Google Assistant, which is the one that people tend to think about because it's the one we interact with when we are speaking to Google, but also in

16:28

stuff like Gmail. If you've used Gmail and you've noticed that sometimes you get automated responses popping up that you can choose as an option, So instead of writing an email, you just select sounds good or I'll see you then, or whatever it may be. Then you have seen this technology at work, or at least you've seen the product

16:46

of its work. Those automated responses are the result of a natural language understanding system that's parsing that email, identifying whatever the salient points are in the message, and then generating what are hopefully logical responses to it, so you can just choose that instead of taking the time to actually type something in. One of the key elements in natural language understanding is creating machines that can communicate with

17:09

us and explain how they arrived at a certain result. Now, this falls into the concept of transparency, which is really important when we were talking about artificial intelligence. There's a real fear that AI and neural networks are creaning toward a black box scenario, and a black box describes any system where the workings of the system are hidden from our view. We cannot see how something works, and so we can only make guesses as to what's going on.

17:38

I know a lot of gear heads who are exasperated with the way vehicle manufacturers are creating more of their cars, trucks, and other vehicles with systems that aren't easily accessible or modifiable. They consider those cars to be black boxes. It makes it much harder to work on a vehicle if you don't have the proprietary tools and knowledge that are specifically

17:59

for that system. Now take that concept and apply it to AI, and it gets pretty scary pretty fast, particularly since we're relying on AI to do some important stuff like drive cars, make stock option deals, or help with healthcare issues, and so one area of work focuses on giving machines the capability to explain themselves, not just to provide an answer, but explain why they came up with that answer. So imagine a chess playing computer. It's playing

18:28

a game of chess and it makes a move. Then imagine being able to ask the computer, why did you make that move, and then the computer could actually answer the question, explaining the logic behind the move it made. Now extend that concept to all sorts of different AI applications. If an AI stock trader suddenly buys up a ton of stocks, you might want to know exactly what prompted that decision, why did it make that purchase? And you can easily imagine situations in which you'd want to know

18:56

why a machine behaved the way it did. Why did an autonomous car choose a particular route. Why did a healthcare program suggest a particular diagnosis Without getting those answers, we're just putting our faith into machines blindly, and giving a computer the ability to generate meaningful and equally important relevant explanations would be extremely helpful. So what are some

19:20

of the uses of natural language processing technology. Well, one fairly simple application is in spelling and grammar checking software. If you've used a word processing program over the last few years the last couple of decades, chances are you're familiar with automatic real time spell check and grammar check features. This is possible because of the work that has been

19:40

done in natural language processing. Spell check needs to take into consideration not only if a word is spelled correctly, if a word matches a word that's in the computer's lexicon, but also if it's the right word for that instance. In English, we have a lot of hominems. Those are words that sound the same aim, but I have different meanings. Now you can have hominem's that are spelled exactly the same way, and those really aren't a problem because the

20:07

reader can pick up on what meaning you intended through context. Though, if you're using natural language processing to do a translation, then the NLP system needs to be able to determine which meaning the original author intended. In my earlier example about making a withdrawal at the bank, there's a hominem you know, to two versions of bank, but they mean two different things. I could also talk about bank as in the sense of a verb, as in banking off

20:34

of something, but you get the point. There are also hominem's that sound the same but are spelled differently, and they have different meanings as well. So for example, they dreaded too as in t O two as in t O O, and two as in two combo. Those are three words with three different applications, three different spellings. A good spell check algorithm will be able to determine if

21:01

you've used the correct one in any instance. So if you say that's two sweet, that's too sweet, but you're using the number too just in word form, the spell check will give you the old heads up and say I think you meant t O O not t w O. Fun fact, I typed that sentence into Google Docs and it said you're totes fine. BRA didn't notice it at all. Grammar checkers have to be able to analyze sentence structure and word choice and compared against the grammar program for

21:32

the system. This might also help determine if the word you use was the correct one. So, for example, affect versus effect, Affect is a verb you affect something. Effect is usually a noun. It's typically the result of some action. So I could affect a drum, which is a dumb thing to say, and the effect might be that the

21:55

sound I played hurt your ears. Now, if you spell the word correctly and the spell checker is only comparing the words you type against a lexicon to see if there's a match, you might not get an indication that anything is wrong because the computer system is saying, well, that word is spelled correctly. It doesn't realize it's the wrong word. But if it has a way of checking grammar, it can also make sure you're using the right word

22:17

in the right context. Search engines such as Google use natural language processing to determine what it is you're looking for right, So when you're typing in a search and you hit the search button, you might get a little uh notification that says, maybe you meant this other thing, or maybe you need to search for this terminology. That's a useful feature since not everyone thinks of search the

22:41

same way. I could tell a dozen people to go on Google and pull up information about Benjamin Franklin and the story about the kite, and those folks might go and perform their searches in twelve different ways. But the search engine's job is to return the best results based on the query, which means it needs to suss out what the searcher is actually looking for. So even if the twelve people all type twelve different ways of looking up this information about Benjamin Franklin and the kite story,

23:11

it should respond with the most relevant results. And maybe people get slightly different search results based upon the query, but they should be more or less the same. And it can also look out for you. It could give you suggestions for search terms, should you use an incorrect spelling or you approximate a spelling, or something like that. One of the areas of opportunity for natural language processing applications in the near future is handling the massive amounts

23:39

of information in big data applications. So, for example, a lawyer might want to search historical legal results using natural language to look for precedents that might help his or her case in the courtroom. A pharmaceuticals company might need to search information about clinical trials, doctors, notes, patient testimonials, and related information. And the amount of information represented by big data is truly astounding. It's enormous. It's way too

24:04

much for any human to sort through. So developing a method for computers to parse a query and return relevant results is highly desirable. For a computer to understand that context understanding and air quotes and being able to give you results based upon your questions, that would be incredibly valuable for lots of different industries. And we started off talking about machine translation at the early stages of natural language processing. That's still a big area of research. Now

24:35

you can get real time translation tools. You can use devices to translate from one language to another in real settings, including written languages like signs. You can just hold a camera up and get an an English translation of a sign that's written another language, and of course vice versa.

24:51

That tends to be marketed as a tool for travelers, but it really shows the amazing progress we've made in natural language processing from the old days of word for word models for machine translation that we're made back during the Cold War now we've still got a far away to go with natural language processing. We've seen some incredible improvements over the past few years, but machines still don't actually understand what we're saying or what we're writing, not

25:15

on a conscious level anyway. Instead, they are able to refer back to rules, either explicitly stated as in the older NLP models, or those arrived at through deep learning. Now I'm going to take a quick break, but when we come back, I'll talk a bit about the history of the voice assistance we all know and love. But first, here's another word from our sponsors. All right, So now we understand a bit about the technologies that make voice

25:47

assistance possible, specifically speech recognition and natural language processing. There's obviously a lot more than that, uh the system. The system can obviously process our requests or commands and return

26:00

a result using more traditional computational processes. So while the interpretation side is on speech recognition and natural language processing, there's still a lot of regular computation work that has to happen for a a personal assistant, digital assistant, a voice assistant, whatever you want to call them, to be

26:23

able to respond to you. So let's take a quick stroll through the history of the major voice assistants out there, and I'm going to cover these in the order they were introduced to the public more or less, which means our very first voice assistant that will be covering in this because I'm only focusing on the really big ones. Uh. There are lots of small ones out there, but I'm looking at the ones everyone's heard about, So that means the first one we get to talk about is Apple's Sirie.

26:49

Apple unveiled Sirie on April fourteen, two thousand eleven, and to be fair, Sirie existed before this. It was not an Apple creation. Syria was actually an app produced by an into and developer company called Sirie Incorporated, but Apple gobbled up that company in and brought them in house. And Apple had previously relied upon another speech recognition program called voice over, which had been used in Mac products

27:14

and all iPhones since the iPhone three GS. Siri would become available starting with the iPhone for s In this announcement, Apple pointed out that earlier implementations of voice commands required users to learn the syntax of the system. You had to follow a very specific set of rules in order to get anything to work. So you give a command defined by the system. So for example, you might say call mom or play once in a lifetime. You had to do this very structured approach to whatever it was

27:47

he wanted to do. But that requires the user to actually adhere to rules created by the architects of the system. Right, So Sirie was meant to be different. It was meant to be able to understand what you wanted on your terms, not based off a strict set of rules. Apple said that Siri would be able to interpret what you meant

28:05

and would return relevant information to you in response. In the unveiling, they said that Siri is quote, your intelligent assistant that helps you get things done just by asking end quote. During that demonstration, they showed off how Siri could parse different phrases that had the same underlying meaning that the example they gave originally was was the weather today, and then they asked that same question five or six

28:30

different times. Scott Forstall, vice president over at Apple, showed off how you could get the same weather information by asking it in these different ways. Then they showed off how Siri could interoperate with other apps, such as Apple's maps feature or through a partnership they had with Yelp. Siri could take a request, it could parse it, interpret it, send the appropriate UH request to the appropriate destination, and

28:53

then serve up the response. The destination could be a web search, it could be an action within a compatible app. You get the idea, so that serie next. On July nine, two thousand twelve, Google released Android jelly Bean a k a Android four point one, and one of the features included in that operating system update, at least for certain hardware upon release, was an offshoot of Google Search called

29:18

Google Now. This feature would serve up predictive cards containing information that the system had flagged as potentially being useful to you based off your activity. So let's say you spend a lot of times searching for stuff like baseball scores, Google Now would start serving you up cards that would give you scores from previous games before you could even search for them. You would just look at Google Now and you could scroll through and you see what the

29:43

latest results were. Then you could actually scroll through the different cards, all of which were slowly dialing you in as a person, which was kind of creepy. And it relied a lot on natural language processing and your activities. Now, Google Now was not a voice assistant. This was sort of a one way relationship. Google was analyzing information based on your activity and then serving up information to you

30:05

that might be useful. But over time the company would phase out Google Now and it gradually evolved into Google Assistant. There was also Google Voice that allows you to do things like voice search, so that also became incorporated into this. Google Assistant is a lot like Syrie. It responds to voice, It can respond to anaphores, meaning it can keep track of subject matter and respond to follow up questions that

30:27

don't contain an explicit reference to the subject. So, for example, you could ask Google Assistant, what is the weather going to be like in Atlanta? And then after you get a response, you might say, what about in Seattle. Now, you have not explicitly said what is the weather in Seattle? You just said what about in Seattle. However, Google Assistant can infer that you are still talking about the weather,

30:50

only now within the context of a different location. Google Assistant debuted in May, so in a way, this particular entry in our timeline spans two other debuts, because he had Google Now on one side and then Google Assistant later, but I figured it was important to acknowledge how Google Assistant grew out of the older Google Now feature. On April two, two thirteen, Microsoft introduced its own voice assistant

31:15

at the Build Developer Conference. Microsoft's entry is named Cortana, after the AI character from the Halo series of video games. Microsoft integrated Cortana to work with Windows ten, Xbox One, Windows Mobile, and a few other platforms as well, including apps that were meant for other operating systems like iOS and Android. Cortana's US voices that of Jen Taylor. She actually is the voice actress who provided the voice for

31:42

the character of Cortona in the Halo games. That's kind of fun, and like Siri, Cortona can interface with apps as well as performed web searches. In November, Amazon got into the game with Alexa and the Amazon Echo. Through Amazon Echo, Alexa can serve not just as a voice assistant that can retrieve information and play streaming media and that kind of thing, but also as an interface in home automation applications, and to be fair, so can Google

32:08

Assistant through devices like Google Home. So you can use Alexa to interface directly with systems in your home. If they are compatible, and not surprisingly, Alexa can interface with Amazon's ordering system, allowing users to order products from Amazon

32:22

directly by speaking to Alexa. No shock there. According to Amazon, developers were inspired by the Star Trek series of shows, which characters would speak out loud to computer systems and call for information or send commands to make various stuff happen. Amazon also released a developer kit to allow independent developers to create what are called Alexa skills. There's an old episode of tech Stuff where I interviewed some folks from

32:45

Amazon to talk about this process. But essentially, developers will submit skills to Amazon, which can then publish those skills and allow anyone who has an Alexa enabled device to activate those skills and make use of them. Individuals can even build up their own person lies skills using a tool called Blueprints, which Amazon introduced in April. Now there are other examples I could point to. There's Samsung's Bixby

33:09

which it introduced in March. There's sound Hounds virtual assistant called Hound that launched in March of But these were the ones that I really hear about the most frequently, so were the ones I wanted to kind of cover and they all work on on a similar principle. The implementations are all particular to their specific brands, but they work on under similar foundational principles of natural language processing, speech recognition, et cetera. And it's all about converging technologies

33:41

that took decades of hard work to make possible. Now I want to thank listener Nate, who was the one who set me on this trail to ask about speech recognition and natural language processing and these voice assistants. Was really interesting to dive into, very very cool, fascinating stuff. Thanks a lot, Nate. If any of you out there have suggestions for future episodes of tech Stuff, maybe it's a technology or a company or a person in tech. Maybe there's someone I should interview or have on as

34:10

a guest host, send me a message. The email address is tech Stuff at how stuff works dot com. Or drop me a line on Facebook or Twitter. The handle at both of those is tech Stuff H s W. Don't forget to follow us on Instagram and I'll talk to you again really soon for more on this and thousands of other topics. Is it how stuff Works dot Com

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript