Get in tech with technology with tech Stuff from how stuff works dot com. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with how Stuff Works in love all Things Tech and listener Nate wrote in and asked that I do an episode about personal digital assistance or virtual assistance or voice helpers.
This is hard because we don't really have a great term for these things, but I'm talking about applications like and I apologize ahead of time if I activate your technology Sirie, Alexa and Google Assistant. These sort of voice helpers that can respond to voice commands as well as other means of input in a way that makes them
seem almost intelligent. Now, as it turns out, that's actually a pretty complicated history because it requires a discussion about a lot of different connected ideas that we're all in
dependent and then ultimately converged. We're talking about stuff like speech recognition, natural language processing, and technology that was meant to improve accessibility and a whole lot more So, it makes talking about the services somewhat challenging because it's not like there was just one pathway that led to their development. They exist largely because of these independent but converging areas
of innovation. Much of the work that made these services possible took place in events that were concurrent with each other, with different organizations all working towards similar but unconnected, disconnected goals. So going by strict timeline approach would be really hard, if not impossible, just because you have to jump around a lot to talk about different advances. So today I'm going to focus solely on speech recognition. This in itself is a huge topic, so it's more than enough for
a single episode of tech stuff. In the next episode, I'm going to dive more into natural language processing, which has some crossover with speech recognition, but it is its own thing. And then after that we'll take a look at how voice assistants like Sirie and Alexa popped up over time. First, the idea of creating a machine that
could interpret speech is older than computers. If you listen to my episodes about the history of the turntable, you'll remember the phanatograph, designed by Eduard Leon Scott de Martinville in eighteen fifty seven. The gadget had a small brush that was attached to a parchment diaphragm and the bristles on the brush rested against a sheet of paper that itself was wrapped around a cylinder. On top of the
sheet of paper was a layer of soot. So to operate the device, you would turn the cylinder, the brush would drag across the soot on the paper, and you would shout at the diaphragm. The vibrations of sound would cause the parchment diaphragm to vibrate. That would make the brush vibrate and move against the paper, and that would create a pattern corresponding to the vibrations that were made by the paper diaphragm. The phonautograph was supposed to aid
in the study of language and sound. The machine itself was not intended to interpret sound, but rather facilitate interpretation. A human would take a look at these tracings essentially and be able to analyze sound, or at least that was the intent. It didn't quite work out that way, but that was the concept behind it. Now, let's set
the way back machine to the nineteen fifties. In nineteen fifty two, Bell Labs created the Audrey system, which was not a mean green mother from outer space, but rather the first documented speech recognizer system. It was an analog system, not a digital one. It was its own dedicated massive circuit, and it even had vacuum tubes in this thing. Because this is before the transistor. It could recognize strings of digits spoken by its creator with about nine and accuracy.
If anyone else tried it, the accuracy dropped a bit. This already shows that speech recognition is tough because not everyone says things exactly the same way. I know that's not a news flash, but it is important for the concept of speech recognition. Uh. You also had to pause between strings of numbers. You couldn't just rattle off conversationally. You had to put pauses in there. But it was also an enormous piece of machinery. It took up a six ft high relay rack and it consumed a lot
of electricity. Then Big Blue, also known as IBM, had scientists and engineers working on the possibility of designing technologies that could recognize speech. They were kind of working around the same time that Bell South was computer scientists Nathaniel Rochester, who designed an IBM computer called the seven oh one. He also wrote the first assembler. Headed up a group of engineers at IBM who were researching pattern recognition and
information theory. That work, which was early research into fundamental building blocks for artificial intelligence, would also become important for speech recognition. In the late nineteen fifties, William C. Dirsh, another IBM computer scientist, developed a computer system as part of IBMS Advanced Systems Development Division laboratory, and it incorporated basic elements of speech recognition. He unveiled the device, called the IBM Shoebox in nineteen sixty two at the World's Fair.
Using a microphone, you could speak basic digits from zero to nine, and also six additional control words like plus or minus, and the shoebox would recognize the words and perform calculation, So essentially this was a basic voice controlled calculator. While the application was limited, this showed off a remarkable achievement. Finding a way to program a machine to accept speech
as a command is a non trivial problem. Throughout the nineteen sixties, computer scientists took a brute force sort of approach to solving speech recognition, which could work in very narrow applications such as the calculator approach, but were by their nature difficult to scale up. Even in the early nineteen seventies, the Speech Understanding Research Project from our PA as the same organization that would help bring the Internet
into being, produced a brute force template called Harpy. While it was reliant upon brute force, Harpy, which came out of Carnegie Melon Research, could recognize about one thousand words. Harpy also made use of a process called beam search. This is a search strategy in which a search algorithm can consider multiple possible hits at a single time, rather than looking through a large data set for a specific perfect hit. Then the algorithm would determine the probability of
each of the hits as being the right word. The number of potential hits is determined by a value called the beam width, setting the speech recognition and application designer can set. Beam search is a much more efficient way to suss out speech, and it's frequently used today, not just in speech recognition but also in natural language process saying another sequential models, but it gets super technical, so we're gonna leave it at that kind of high level approach.
But these systems still mapped all words to a template, one template per word. It didn't break words up into sounds, but look for a match against a database of established vocabulary words, which meant that if you did not pronounce the word the same way as it was represented in the database, you might not get a hit. You would have to get it close enough to that template for you to be able to get a hit. This is a big problem. People speak with accents or dialects, or
they may have difficulty replicating certain sounds. The brute force approach often meant you you'd have to say the same word a few times with clear enunciation and long pauses to get a hit. And again, it just didn't scale very well. It wasn't until the late nineteen seventies that computer scientists were able to find a different approach that would power more modern speech recognition systems. And let's go through some of the steps that are necessary, from the
basic physical attributes of speech to the processing of the information. First, speech, like all sound, ultimately is a physical phenomenon. It is vibration. We produce these vibrations with vocal cords and our lips, teeth, and tongue according to the rules of whatever language we are speaking. These vibrations travel through a medium such as the air, and then they get picked up by something else,
like someone else's ears or a microphone or whatever. But at this stage we're talking about physical vibrations and analog form of input. Computers do not directly interpret physical vibrations. Computers process digital information, and speech is an analog phenomena. So the first thing we need for a computer to recognize speech is for some sort of analog to digital converter that can accept the analog information and then translated
into digital information. The a d C would typically sample speech by taking precise measurements of the sound at frequent intervals or samples such as thousands of times per second, so you can almost think of it like snapshots, Like like pictures. The a d C is measuring quantifiable elements of the sound every time it takes a sample. That might include stuff like amplitude and frequency, or volume and pitch.
If you're talking about how we perceive sound. There's usually some sort of noise filter incorporated into this step as well to help remove any unwanted sounds from the signal. The system has to be able to recognize which signals represent a command in which ones are not important. This is why I can do stuff like send vocal commands to a voice assistant, even if there's another conversation going on nearby, or if I have the radio or television on. Now.
I have a lot more to say about the technology that makes speech recognition possible, but before I get into that, let's take a quick break to thank our sponsor. So, a speech recognition system typically as a database of sound samples that will allow the recognition system to compare incoming
signals against that database. The speech recognition system might have to put the incoming sound through a process called temporal alignment, which is a fancy way of saying the system might have to slow down or speed up the incoming sound. You can think of this as like making a recording
and then almost immediately playing the recording back. Obviously, the speech recognition system can't change the speed at which you're speaking, though you might get a feature that prompts you to slow down or speed up if the message may say could you say that again, but slower that kind of thing. Um. If you happen to be someone from the Northeastern United States, for example, you may frequently get these messages saying slow
the heck down. Temporal alignment allows the speech recognition system to look for matches between the incoming sound and the samples in the system's memory. The system must also do gied up the sounds in the incoming signal into segments that represent specific sounds in the native language, such as the sound or the hard to sound. It looks for matches in its memory that represent phonemes, and a phoneme is a basic sound native to a specific language, to
a particular language, whichever when you're looking at. So, for example, the English language has about forty phonemes. Linguists actually get into some pretty vicious fights about exactly how many phonemes English language has, but it's around forties. Some people argue that there are more phonemes, some say that there are. Some of the supposed additional phonemes are in fact repeats of existing ones. Other languages, though, will have different number
of phonemes in them. Some may have far more than English, some may have fewer than English. The system then has to analyze the phonemes in sequence. So it's looking at these little markers that represent different sounds, and this is how it says them can look for matches between a series of phonemes and the words that it can recognize it can try and build words from these sounds. This
is way harder than I'm making it. Sound speech recognition systems have complicated statistical models to help them determine what a word might be. Even a simple speech recognition system will have a complex statistical model to recognize individual words. More sophisticated systems might also look at contextual information surrounding the phonemes. In other words, a really sophisticated system isn't just looking for a match in phonemes to sus out
what a single word is in a sentence. It's looking at the phonemes that came before and after to determine what those words were and to help increase the confidence level overall. So let me give an example. Let's say have activated one of these voice assistants, and I've used whatever voice command activates it. I'm not going to do it here because some of you might be listening on those devices. And then I say turn the volume up
thirty percent. The speech recognition system begins to parse what I said by analyzing those sounds phone name by phone name, identifying them, analyzing them, trying to group them together to form words, and when it thinks it's found a word, it assigns a certain probability to that, and when it starts to analyze the phone names that make up the word volume, it's also looking at the words that came before turn the and it's looking at the words that
came after up. That boosts the system's confidence overall that the keyword volume is in fact volume, and then it does what I told it to do. When I talk about confidence, I don't mean the system feels good about itself. I'm talking about probabilities. These systems largely work in the realm of probabilities. What is the probability that I said volume rather than some other word. For speech recognition system to work, it needs to be able to assign a
confidence level towards. The higher the level, the more certain quote unquote the system is that it got things correct. Typical computer engineers will design systems that will only execute a command or return a result of some sort if the system has reached a certain threshold of confidence, and if it hasn't, you won't get a result. So, for example, and this isn't about speech recognition exactly, but it illustrates
my point. IBM S Watson computer would not offer up an answer on Jeopardy unless it met a certain threshold of confidence in an answer, and I think it was about eight percent. So if it or eight percent certain that it had the right answer, it would buzz in. But if it was less than eight sure, it would not put forth that answer. There are two broad types of statistical models in speech recognition systems today. There are others that could be used, but there are two broad
ones that tend to be used these days. They are the hidden Markov model and neural networks. Hidden Markov model, by the way, is overwhelmingly the most popular method of using a statistical model to analyze speech recognition. It is the prevalent approach, and it works sort of how I just described. It looks at each phone name and starts
to build out a pathway. If you think of this as like an actual physical path that you're following, you would start off with the first phone name that represents the beginning of the path, and the phone name might eliminate other possible phone names right away. By that, I mean it might be a sound that doesn't combine with certain other sounds within that language. There might be a phone name that does not combine with other specific phone names.
So imagine you have a path and originally it splits into tons of other pathways, but a couple of those pathways are blocked off with signs that say the pathway is closed. It's closed because those pathways represent phone names that would never be paired with the initial one. You just don't get that sound in English. The closed paths would therefore be off limits, and only the open paths would be the possibility. Then the hidden Markov model would look at the next phone name the next step along
this pathway. That phone name determines which of the viable path options is actually the one to follow. All the other options would be discarded, and so on. It would go all the way down the list of phone names until the model arrives at a conclusion of the most likely word that was spoken. It assigns a probability score to each phone names, thinking I'm pretty sure the sound that I heard, quote unquote was this. That helps the system make an educated guess as to what word was
actually spoken. Now, I've talked a lot about neural networks in the past. I'm just going to give it a quick cursory covering here, because they really aren't the dominant statistical model in speech recognition. UH Neural networks have nodes, computer nodes or algorithms that act like a neuron right like a like a brain cell, and they execute operations
on data. The neurons also assigned a probability score to that x that execution of of data and shows the confidence in the system in the result before they pass it on to another neuron in the network, which then executes another operation on the data and so on, and ultimately the network produces an end result of all those operations and judges the probability of whether or not that result is the right one, and again, if it meets a certain threshold, then it's considered the correct answer or
the closest to correct that the system can manage. In any case, speech recognition systems have to be trained, and there are trillions of potential combinations of sounds that could represent different words. And the How stuff Works article How Speech Recognition Works Ed Grabanowski, who is one of the powerhouses of the site. He's written some of the best
articles on how stuff Works, gave a great example. He says, take the phrase recognize speech right the phone emes in that phrase happened to be pretty similar to a totally different phrase, which would be recognized beach. So you have recognized speech or wreck a nice beach. The speech recognition software has to be able to determine the difference, or else the next thing you know, you're gonna have terminators
kicking sand in everyone's face, and that's no good. Alexander Wibel, who worked on that system called Harpy that I mentioned earlier, had another couple of examples. He said, you might say youth and Asia and get the result youth in Asia. Or you might say give me a new display and you get the result, give me a newdist play. If you've ever used something like Google transcripts, where if you had a Google Voice and you were reading the voicemails,
you could get hilarious results. Because of this, the speech recognition, the speech to text feature could end up spelling out truly ridiculous messages. I would get messages from my mother, and I only wish my mom would leave me messages the way that Google transcript thought she was leaving me messages, because they were the most crazy messages ever. But it's mostly because my mom has a Southern accent and so Google would often misinterpret what she was saying, so these
systems have to undergo hours of training. John Garofolo, a computer scientist who was cited in that House Stuff Works article, had this to say. These statistical systems need lots of exemplary training data to reach their optimal performance, sometimes on the order of thousands of hours of human transcribed speech and hundreds of megabytes of text. These training data are used to create acoustic models of words, word lists, and
multi word probability networks. There is some art into how one selects, compiles, and prepares this training data for digestion by the system, and how the system models are tuned to a particular application. These details can make the difference between a well performing system and a poorly performing system, even when using the same basic algorith Rhythm speech recognition
also requires a decent amount of processing power. This was a limiting factor on speech recognition for a really long time. Systems were limited in their capabilities, which meant that for years, if you wanted to incorporate speech recognition in a computer system, and then most of the computer's processing power would have to dedicate itself just to parsing speech. You couldn't do
much else on that machine. But since Moore's laws held up so well for decades, we got to a point where the process and capabilities of machines reached a stage where this isn't as big a concern, And another development that Google really helped pioneer definitely change things. I'll talk more about that in our next section, but first let's
take another quick break to thank our sponsors. Okay, So, advances in speech recognition in the late nineteen seventies paved the way from how most systems work these days, though of course the models have under gone multiple refinements and tweaking over time. The first speech recognition product to ever launch for consumers was a program called Dragon Dictate, which debuted in Dragon Dictate. The original version that is, because they still come out to this day, relied on discrete
speech recognition. Now, I don't mean you had to be secretive and hush hush about it. It's not that kind of discreet. Rather, I mean you had to pronounce each word clearly, with a pause between words. You could not speak conversationally, or the dictation software could not interpret what you were saying, so using the software would sound like this. It was limited and it was primitive compared to today's speech recognition products, but it was a groundbreaking product in
the early nineties. And it also costs somewhere between six thousand and nine thousand dollars I saw differing accounts, but that would be between nine and fourteen grand in today's dollars, so pretty expensive software package. Dragon still produces speech recognition technologies to this day, and of course they are much more adept at recognizing and transcribing speech than the original version was years ago. The software is also less expensive.
One version I saw retails for less than a hundred dollars, so nice. Big deep price cut advancements and model design and processor speed meant that speech recognition technology advanced rather quickly. In Bell South released Vowel v a L. The Voice Portal VAL was an automated interactive system that could respond
to questions over the phone. This was a basic implementation that would evolve over time to the systems you may have encountered when calling up automated menus where it's a press three or say three and that kind of thing, or do you have any questions? You can say anything from check my balance to you know, that kind of stuff.
In two thousand five DARPA, which is the same brand each of the Department of Defense that used to be known as ARPA, So in other words, it's the same R and d ARM that funded the creation of the Internet. They funded a program in two thousand five called the Global Autonomous Language Exploitation Project or GALE. The purpose of this project was to advance research and development into automated
translation between languages. So not only were computers supposed to be able to recognize speech, but also translate that speech from one language into another, which adds another layer of complexity on top. Right well, according to s r I International, the system should be able to quote automatically take multi lingual newscasts, text documents, and other forms of communication and
make their information available to human queries end quote. So wouldn't just translate the information, which was already even more complicated than speech recognition, It could also index that information in a meaningful way so you could search for stuff. So layer upon layer of complexity for that project. Things that helped push speech recognition as well as natural language processing to new heights largely came from two competing companies,
Apple and Google. So let me explain that In two thousand seven, Apple introduced the iPhone, which was the first truly successful consumer smartphone, especially here in the United States. The smartphone introduced a new era and form of computing. It created countless opportunities in numerous areas, including location based computing, mobile interactions, and speech recognition. The computer was in a
phone form factor. Phones are designed for us to talk into, So now you can walk around carrying a computer that was designed to transmit your voice. It's only a matter of time before someone figured out a way to leverage that for speech recognition. Google meanwhile, was pioneering an approach in what would perform all the processing functions necessary to
support speech recognition. It was doing it in the cloud, so instead of having the device itself have to run all that processing power, the device would have a persistent connection to a server on the Internet, and the server would do the work. It would just send the signal to the server. The server would process and analyze the signal and return the result back to the phone, and the phone just was acting as a transmitter. It wasn't
really having to do any of that analysis itself. So in two thousand and eight, Google launched the Google Voice search app for the iPhone that would do all the this uh speech recognition processing. Right, you could speak into it and have Google search the terms for you for whatever it was you were saying. But again, what was really going on was that Google was sending those search terms or that that speech signal over to a server that Google operated, and then send the results back down
to the phone. But to the user it looked like the phone itself was doing all the work. The truth was it was simply a very basic application of true cloud computing, and that created a new method of rolling out speech recognition and apps and services. No longer did you have to worry about creating a really powerful piece of equipment. You can have that be on the back end. The piece of equipment the user could have could be
a relatively underpowered terminal. Essentially. Meanwhile, it also meant that Google could collect enormous samples of data, not necessarily to market to people or to identify specific individual but rather it could collect a lot of data for training its
speech recognition and natural language recognition models. Google could build out a much more robust model of human speech patterns because they had thousands of real world uses going on in real time they could keep using that to build out and bolster their models, and that improved Google's speech recognition accuracy. Today, major speech recognition platforms typically have an error rate below five percent, which is pretty darn impressive.
According to a calm score estimation, by twenty half of all searches on the Internet will be voice searches. So speech recognition, along with natural language processing, could lead to a future of ambient computing in which the environments we move through our effectively computer interfaces, and we can access them through voice commands and other ways of commanding, maybe gesture commands, but that seems like it might be better, say for our episode about voice assistance and where we're
headed with that technology. In our next episode, I'm going to really explore natural language processing, how it works, and how that field of research has evolved over the last few decades. It's also really fascinating, and it does, in fact cross over quite a bit with speech recognition. But natural language processing goes beyond speech. It also includes text,
and that will be our next episode. But if you have a suggestion for a future topic I should cover on tech Stuff, send me a message let me know about it. The email for the show is tech stuff at how stuff works dot com, or you can drop me a line on Facebook or Twitter. The handle for both of those is text Stuff hs W, and you can also follow us on Instagram. I would love it if you did, and I'll talk to you again really soon.
For more on this and bouthsands of other topics, is it how stuff works dot com
