How Smart Speakers Work

Speaker 1

00:04

Welcome to Tech Stuff, a production of I Heart Radios How Stuff Works. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. I'm an executive producer with I Heart Radio and I love all things tech, and guys, stick with me. I am fighting off a cold. You'll be able to hear it in my voice. I have no doubt. But you know, I wanted to get you guys a brand new episode. So we're gonna fight on

00:32

because the show must keep going. I think I think this is saying, oh no, this cold medicine is good though. All right, Anyway, I thought that we would do an episode about smart speakers because I wanted to kind of start this whole episode off with with an old man observation, you know, get off my lawn kind of thing. And this is from our resident old man, old man Strickland. That meaning meaning me, So, when I was young, speakers

01:01

were dumb. Now I don't. I don't mean that speakers were useless, or that they were terrible, or that they were incapable of replicating certain frequencies or volumes of sound, or that they were limited in some other way other than they didn't quote unquote think they didn't connect to any sort of computational engine in a meaningful way. You might have a set of speakers plugged into a computer, but that was just a one way communications tool, right.

01:27

It was just a way to provide an outlet for sound that your computer was generating, nothing more than that. But contrast that with today, when we have numerous smart speakers on the market. These speakers act as a user interface between us and the Internet at large, often facilitated by a virtual assistant of some kind. Now with these speakers, we don't just listen to stuff like music and podcasts and the radio and you know, other traditional audio content.

01:57

We use them to find out information. We might link them to our calendars so that we can get reminders for upcoming appointments. We probably use them to ask about the weather report. I use mine at home for that all the time, or even more often than that, if you're at my house, you'll hear us use it to find out which foods are safe for us to feed to our dog. My doggie, Tibolt, absolutely loves our smart speaker because it frequently gives us permission to spoil him

02:24

with a carrot or a piece of banana. But how do these smart speakers work, How are they able to respond to our requests? And what are their limitations? How safe are they? That's the sort of stuff we're gonna be looking into in this episode of tech Stuff, and we'll start off with the basics, which means we have

02:44

to start off with how speakers work in general. Now, this is something that I've covered before on tech Stuff, but I want to go over it again from a high level because well, I just find it fascinating that people figured out how to harness electricity to drive a motor so that it could in turn cause components to

03:02

replicate a recorded or transmitted sound. And really motors being too generous, but to drive an element to create vibrations that could replicate a sound that was made into another component, that whole thing just boggles my mind that people are smart enough to figure that out. Okay, So to understand how speakers work, it first helps to understand how sound itself works. Sound is a physical phenomenon. Do do do do?

03:29

Sound is all about vibrations, and typically we experience sound when we pick up on changes in air pressure that enter through our ear canal and then affect the tympanic membrane or ear drum. So it's all about these changes of of of air pressure, all about air molecules transmitting vibrations from a source outward in a radiating pattern from from that source. So let's think of someone knocking on a door. For example, you're inside a house, someone's knocking

03:59

on your door. When that person's hand hits the door, it causes the door to vibrate, and that vibration transmits to the surrounding air molecules on the other side of the door. They get pushed through that vibration and then pulled when the the wood is vibrating back towards its original position. So the air molecules vibrate, those air molecules cause the next surrounding layer of air molecules to vibrate as well, and so on and so forth. It's like

04:29

a cascade or domino effect. You get these little pockets of high and low air pressure that travel outward from that door. It spreads further as it goes towards you know, any distance, and if you are close enough so that you can still detect those changes in air pressure. You experience this by hearing the knocking on the door. Those vibrating air molecules lose a bit of energy as they

04:56

move outward. Right, as they vibrate to the next layer, you start to lo use a bit of energy with each transmission of that So the sound gets quieter the further away you are because there's not as many air molecules vibrating, its amplitude as decreased. So if you are in hearing range, you can pick up on those changes of air pressure they encounter the tympanic membrane in your ear canal. Those changes in pressure will cause a reaction in your middle and inner ear set that will ultimately

05:25

get picked up by your brain that interprets it as sound. Now, the frequency at which those fluctuations occur relate to the pitch that we hear, so faster vibrations are higher pitches, higher frequencies, higher notes. If you think of a musical scale, we perceive the force of the changes as volume, so lower forces lower volume right, and higher forces higher volume.

05:55

The human ear can hear a pretty decent range of frequencies from twenty hurts, which means twenty cycles or twenty waves per second past a given point of reference, to twenty killer hurts. That's twenty thousand cycles or waves per second. So yeah, the cycle refers to the frequency of the wavelength of sound. The lower the frequency, the lower the sound. All right, and then our brain has to make meaning of all this, Right, it's not just that it's picking

06:23

up on it. Our brain interprets this and we experience it as a sound we have heard. So it either matches this perceived sound with one we've encountered before, and then we say, oh, I know what that is. That's someone knocking at the door, or they might be Holy Cala, I've never heard that sound in my life. I have no idea what it is. If the sound is language, then our brains have to derive the meaning from the perceived sound. We've heard someone say words such as you're

06:56

hearing me say this. Then our brains have to take that collection of sounds and say, what does that actually mean? What is the the context, what is the the intent? What is the message here? Otherwise it would just be you know, random noises that I'm making with my mouth. Alright, so we have a basic understanding behind the physics of sound. Now to talk about speakers and microphones and the reason I'm going to talk about both of them is that

07:24

the devices complement one another. You can think of one as being the other in reverse. Plus smart speakers we have to talk about microphones anyway, because smart speakers have microphones as well as the speaker element. So you can think of this as one long process of taking the physical phenomena of sound waves, transforming that physical phenomena into an electrical signal, taking the electrical signal, and changing it back into something that can produce the sound waves that

07:53

started the whole thing. So you're replicating the original sound waves with this end device, which in this case is allowed speaker. So the microphone is the part of the process where you take the sound and you turn it into an electrical signal, and the speakers where you take the electrical signal and you turn it back into actual sound. That's the simple way. But what's actually happening, Well, let's talk about on a physical level. Sound waves go into

08:18

a microphone. So you've got these fluctuations and air pressure that encounter a microphone. I'm speaking into a microphone right now, so this is happening right now. Inside the microphone is a very thin diaphragm, typically made out of a very flexible plastic, and it's sort of like the skin of a drum. So as the changes in air pressure encounter the diaphragm, they cause the diaphragm to move back and forth. Well.

08:45

Attached to the diaphragm is a coil of conductive wire, and that coil wraps either around or near a permanent magnet. Magnets have magnetic fields. They have a north pole and a south pole, and there's a magnetic field that surrounds the magnet. And the electro magnetic effect means that if you move a coil of conductive wire through a magnetic field, it will produce a change in voltage in that coil, otherwise known as electromotive force, and that means electrical current

09:19

will flow through the coil. Now, if you have the end of that coil attached to a wire, a conductive wire for that current to flow through, you can send that current onto other components. So for our purposes, the component in question would be an amplifier, and I'll get to explaining why that is in just a moment, but first let's talk about loud speakers, and the way allowed speaker works is essentially the reverse of a microphone. You've got your permanent magnet around or near which is a

09:51

coil of conductive wire. The wire is connected to a diaphragm, one much larger and typically made out of stiffer material that the plastic you'd find in a microphone. This is the element inside a speaker that will vibrate, that will push air and pull air as it moves either outward or inward. The electrical signal comes from a source such as the microphone we were just using a second ago

10:18

that comes into the loudspeaker and it flows through the coil. Now, when you have an electrical current flowing through a conductive coil, you generate a magnetic field because the laws of electromagnetism. You've got the electro magnetic field generated as a result. Now that field will interact with the magnetic field of the permanent magnet. That the permnet magnet always has a magnetic field. The coil only has one when electric current

10:46

is flowing through it. And as I said, we have magnets to have a north pole and a south pole. And we also know that when we bring two magnets with their north poles together, they'll push against each other, right because like repels like, But if we turn one of those magnets around so that now it's a south pole and a north pole, they attract one another, you know,

11:08

opposites attract. So by having the this magnetic field being generated by the coil, uh, it starts to generate interactions with the magnetic field of the permanent magnet, so they start to push and pull against each other. Well, the coil is attached to that diaphragm, so it in turn drives the diaphragm to either push outward or pull inward.

11:36

That causes air molecules to vibrate, just as it would with any other you know, source of sound, and it emanates outward from the loudspeaker, so you get a representation of the same sound that was going into the microphone got converted into an electrical current. The electrical current then was passed through a coil and next to a permanent magnet to create the same sort of movement. It replicates the movement of the original diaphragm in the microphone and

12:07

generates the sound. So you get the replication of the sound that was made in the other location. It's pretty cool. I think now I did mention earlier that you would need an amplifier. And the reason you need an amplifier is that the electrical signal generated by a microphone is far too weak to drive allowed speakers diaphragm. You just wouldn't have the juice to do it. It would be much much less, uh powerful than what the speaker would need.

12:36

So chances are the diaphragm would either not move at all because it would just be too stiff, it would resist the movement too much, or it would move so weakly as to generate little to no sound, so it wouldn't do you any good. So the signal from the microphone has to first pass through an amplifier, which, as the name implies, takes an incoming signal and increases the amplitude of that signal the volume. In other words, uh so, it doesn't affect pitch, but it does affect the signal

13:03

strength and consequently the volume. And I've done episodes about amplifiers, including explaining the difference between amplifiers that use vacuum tubes and ones that use transistors, so I'm not going to go into that here. Besides, it doesn't really factor into our conversation about smart speakers anyway. It's just important for it to work with a microphone and speaker setting. Now, over the years, engineers have paired microphones and speakers and

13:29

lots of stuff. You've got telephones, you've got intercom systems, public address systems, handheld radios, all sorts of things, so that technology was well and truly mature. Before we ever got our first smart speaker, there wasn't much call to incorporate microphones into home speaker systems for many years. I mean, what would you actually use a microphone embedded in a

13:52

speaker for? Before smart speakers, Typically you would have your speakers like I'm talking about, like like sound system speakers. You would have them hooked up to some other dumb as in, not connected to a network technology. So it might be a sound system or home entertainment set up with a television as the focal point, or maybe even you know, a computer for the purposes of playing more

14:14

dynamic sounds for like video games and and things like that. Um. But for a very long time, these were all thought of as one way communications applications, right, Like, the sound was coming from a source and it would get to us through the speakers, but we weren't meant to send sound back through those same channels. The information was just coming to you. You weren't sending anything back, But that

14:37

would all change in time. Now. One thing to keep in mind about smart speakers is that they are the product of several different technologies and lines of innovation and development that all converged together. The microphone and speaker technology is one of the oldest ones that we can point to as far as the fundamental underlying technology is concerned, the stuff that's been around since the late nineties century. Now there is one other we'll talk about that's even older.

15:03

But I don't want to spoil things. I'll just mention there is an even older line of development that goes into smart speakers than the microphone speaker stuff of the nineteenth century. Most of the other components, however, are much younger than that. One big one is speech or voice recognition. Creating computer systems that could detect noise was relatively simple. Right. You could have a computer connected to microphones and they could monitor the input from those microphones and any incoming

15:35

signal could be registered. Right, they could record an incoming signal that would indicate the microphone had detected a noise. That's child's play. That's easy to do. But teaching computers how to analyze those signals and decipher them so that the computer could display in text or otherwise act upon that that sound in a meaningful way that was much more difficult. There was an IBM engineer named William C. Dirsh of the Advanced System Development Division who created an

16:06

early implementation of voice recognition. It was a very limited application, but it proved that the ability to interact with computers by voice was more than just science fiction. Within IBM. It was called the Shoebox. Dirsh worked on this project in the early nineteen sixties and what he produced was a machine that had a microphone attached to it. The machine could detect sixteen spoken words, which included the digits of zero to nine plus some command indicators like plus

16:39

minus total, sub total. You get the idea. So you could speak a string of numbers and then commands to this device, then ask it to total everything and it would do so. So it was more or less a basic calculator with some voice interpretation incorporated into it. Now there's a great newsreel piece about this shoebox. There's a demonstration of it, and it came out in nineteen one, and I love that newsreel because it has that great music you would hear in the background of those old

17:10

industrial and business films. Anyway, there's also a helpful chart that hangs in the background of that video where Dersh is actually explaining how it works. You can see a little bit behind him what the what is actually being analyzed and uh he broke the words down into phonemes and syllables, so phonemes being specific sounds that make up words. So, for example, the digit one is a single syllable word

17:40

with a vowel sound right at the front. But you also have the word eight that's another single syllable word as a vowel sound right at the front, but it's different from one phonetically in that eight also has a plosive and has that hard t at the end. So the shoebox was limited not just in what words it could recognize, but also the types of voices it could recognize.

18:07

Get someone who has a different dialect or manner of speech, and the machine might not be able to understand them because they're not pronouncing the words the same way that drsh did. This would be a big challenge in speech recognition moving forward, and it's also an example of where

18:24

we find bias creeping into technology. And it's not necessarily a conscious thing, but if you have people designing a system and they're designing it based off their own uh, you know, speech patterns, their own pronunciations, their own dialects, then it may be that the system they create works really well for them and less well for anyone who isn't them, And the further away you are from their manner of speaking, the more frustration you will encounter as

18:56

you try to interact with that technology. That's an example of s and in fact, if you read the histories of speech recognition and as we'll get too later natural language processing, you'll see a lot of people say it works great if you happen to be a white man, because the manner of speech was being or the people who were designing it were primarily white men who were uh typically aiming for a a what is considered a non accented American dialect somewhere in you know, the Eastern

19:31

seaboard side. But that meant that if you did have an accent or a dialect, or you had a different vernacular, that it was harder for the systems to actually understand what you were saying. That's an example of bias. Well.

19:46

The general strategy was again to break up speech and too constituent sound units, you know, those phonemes, and then to susse out which words were being spoken based on those phonemes, and that was done by digitizing the voice train, forming it from sound into data that represented stuff like the sounds frequency or pitch, and then matching up specific

20:08

signal signal signatures with specific phone nmes. So generally the idea was that the computer system would monitor incoming sound, convert the sound into digital data, compare that data that had received with information stored in a database, and effort to look for matches. Uh. The shoebox database was just

20:26

sixteen words and size. Later ones would be much larger, but pretty quickly people realized this was not an efficient way of doing speech recognition because the bigger the vocabulary, the more work intens of it was to build out those databases. So it wasn't something that people thought would be sustainable for very large vocabularies. But the Shoebox marked the beginning of a serious effort to create machines that could accept audio cues as actual input, and as we'll see,

20:54

that's one important component for these smart speaker systems. I've got a lot more to say, but before I get into the next part, let's take a quick break. Now, obviously we didn't jump right into full voice recognition right after IBM S Shoebus innovation. The challenges related to building automated speech recognition systems were numerous, even for just a single language, because, as I said, you can have accents and dialects. One voice can have a very different tonal

21:28

quality from another, people speak at different speeds. Teaching machines how to recognize speech when the phonemes and pacing of that speech aren't consistent from speaker to speaker, that's really hard. This kind of gets back to the same sort of challenges you have when you're teaching machines how to recognize images. You know, you teach a human what a coffee mug is.

21:51

I always use this example, but you teach a human what a coffee mug is, and pretty soon they can extrapolate from that example and understand that coffee mugs can them in all different sizes and colors, and you know different designs and textures. We get it. Like you you see a couple of coffee mugs, you understand machines though they aren't able to do that. Machines, you know, you have to give them lots and lots and lots of different examples before they can start to pick up on

22:20

what things actually make a coffee mug. Same sort of thing with speech, right, So if you don't have consistency between speakers, it makes it very hard for machines to learn what people are saying. Now, it didn't take long for the tech industry at large to really dive into trying to solve this problem. In ninete, DARPA, that's the Research and Development division of the United States Department of Defense, got behind speech recognition in a big way. Now, remember

22:49

darp it self doesn't do research. The organization's purpose is to invite organizations to pitch projects that align with whatever darpest goals are and and DARBA would provide funding to the winning organizations to see these projects to completion if possible. So DARK is really more of a vetting and funding organization anyway. In n DARPA created a five year program called Speech Understanding Research or s u are. The initial goal was pretty darn ambitious considering the capabilities of the

23:23

technology at the time. The project director, Larry Roberts, wanted a system that would be capable of recognizing a vocabulary of ten thousand words with less than ten percent error. After holding a few meetings with some of the leading computer engineers of the day, Roberts suggusted that goal significantly. After that adjustment, the target was going to be a system capable of recognizing one thousand words, not ten thousand.

23:50

Nearror levels still had to be less than ten percent, and the goal was for the system to be able to accept continuous speech, as opposed to very deliberate speech with pauses between each pair of words that would not be really that useful. One person who was skeptical about the potential success of this project was John R. Pierce

24:16

of Bell Labs. He argued that any success would be limited so long as machines remained incapable of understanding the words, not just recognizing a word based on phone names, but understanding what the word is. That is. Pierce felt that the machines needed some way to parse the language to get to the meaning of what was being said. That's an important idea that we will come back to in

24:38

just a bit now. Among the companies and organizations that landed contracts with DARPA were a Carnegie Melon University BBN, which actually played a big part in developing our ponette, the predecessor to the Internet, Lincoln Laboratory, and several more and very smart people began to create systems intended to recognize speech and meaningful ways. The names of the programs were a lot of fun. There was h W I M that stood for hear what I mean as in here as in listen hear what I mean. That one

25:09

was from BBN. CMU introduced hearsay, which was later designated as Hearsay one, and then they came out with Hearsay two. They also would demonstrate another one called harpy. Oh, and there was a professor at CMU named Dr James Baker who would design a system called Dragon in nineteen seventy five that he would later leverage into a company with his wife, Dr Janet M. Baker in the nineteen eighties,

25:35

and they had a very successful business with speech recognition software. Now, I'm not going to go into each of those programs in deep detail, but rather just mentioned that they all helped advance the cause of creating systems that can recognize speech. One of the big developments that came out of all that work was a shift to probabilistic models, which would also play a really important part in another phase of developing the smart speaker. So what do I mean when

26:00

I say probabilistic? Well, as the name indicates, it all has to do with probabilities. Essentially, systems would analyze incoming phonemes and make guesses as to what was being said based on the probability of it being a given word or part of a word. The systems typically go with whatever word has the highest probability of being the correct one. Even with that approach, there are nuances to language that

26:26

are difficult to account for with a machine. So, for example, you have homonyms and which you have two words that sound the same but have very different meanings and potentially spellings like right as in to write a sentence, or right as in am I right? Or am I wrong? Or you could have a pair of words that sound like a single word and have confusion there, such as a door. You can say a door you mean you're meaning a single door a door to go into a building, or you might say a dore as an I adore

26:58

this podcast you're doing, Jonathan. That's sweet of you, Thank you for saying that. So computer scientists were hard at work advancing both the capability of machines to make correct guesses at individual phone names and then full words, as well as figuring out a way to teach machines to adjust guesses based on context. That requires a deeper understanding

27:21

of the language within which you're working. If you're aware of certain idioms, you can make a good guess at a word or phrase even if you didn't get a clean pass at it right. So, for example, the phrase it's raining cats and dogs just means it's raining a lot. And if a system included a database that indicated the phrase cats and dogs sometimes follows the phrase it's raining, then the system is more likely to guess the correct sequence of words instead of guessing something that sounded similar

27:52

but it's wrong. For example, if it said, oh, they must have said it's raining bats and hogs, that would not makes sense. So the systems estimate the probability that any given sequence of sounds within the database matches what the systems have just quote unquote heard progress in this area was steady, but slow, and I'd argue that it was also a reminder that concepts like Moore's law do

28:18

not apply universally across technology. Rapid development in one particular domain of technology is not necessarily an indicator that the same sort of progress will be observed in all other areas of tech. We often get into the mistaken habit of believing that Moore's law applies to everything. Alright. So a related concept to voice recognition is something called natural language processing, and this relates back to how we humans tend to process information compared to the way machines tend

28:49

to do it. So we humans formulate ideas, we shape those ideas into words and sentences. We communicate them in some way to other people through that language. It may be through speed you maybe through text. It may even be through a nonverbal or non literary way, but we communicate those ideas. Machines typically accept input, they perform some process or sequence of processes on that input, and then they supply an output of some sort. Machines do this

29:19

in machine language. That's a code that's far too difficult for humans to process. Easily. Binary is an example of machine language. Binary is represented as zeros and ones, which would group together can represent all sorts of stuff. But if you just looked at a big block of zeros and ones, it would mean nothing to you. It's not easy for humans to use, and then machines in turn are not natively able to understand human language, so there's

29:44

a language barrier there. Because of that, people created different programming languages. These languages provide layers of abstraction from the machine language. They make it easier to create programs or directions that the computer should fall low. So the person who's doing the programming is using a programming language that's easy for humans to use that then gets converted into

30:08

machine language that the computers understand. But what if you could send commands to a computer using natural language, not even programming language. You could just speak in Plaine vernacular, whether it's English or any other language, the way humans communicate with one another. What if a computer could extract meaning from a sentence, understand what it was you wanted

30:30

the computer to do, and then respond appropriately. So imagine how much time you could save if you could just tell your computer what you wanted it to do, and it took care of the rest. If you had a powerful enough computer system with strong enough AI, maybe you could even potentially do something like describe a game that you would love to be able to play, like not not a game that exists, a game in your head, and you could describe it to a computer and the

30:56

computer could actually program that game. Well, we're we're definitely not anywhere close to that yet, but we made enormous progress with natural language processing. Now, the history of natural language processing isn't exactly an extension of voice recognition. It's actually more like a parallel line of investigation. And that's

31:16

because natural language processing doesn't require voice recognition. You can have an implementation in which you just right commands in natural language, you know, you type them out on a keyboard and the machine then carries out those those instructions. So much of the early work in natural language processing was in text based communication rather than in speech. The history of natural language processing includes stuff like the Turing test,

31:41

named after Alan Turing. So the most common interpretation of the Turing test these days is that you've got a scenario in which a person is alone in a room with a computer terminal, they can type whatever they like into the computer terminal, and someone or something is responding to them in real time. Now it might be another person, or it might be a computer system that's responding to

32:04

that person. You run a whole bunch of test subjects through this process, and if the computer system is able to fool a certain percentage of those test subjects, like say thirty percent of them, that it is in fact another human and not a computer, it is said to have passed the Turing test, And typically we use that to mean the machine has given off the appearance of possessing intelligence similar to the one that we humans possess.

32:32

That gets beyond our scope for this episode, but it helps point out that stuff like speech recognition and natural language processing are both closely related to the field of artificial intelligence. In fact, they really belong within the artificial intelligence domain. The Turing test was more of a hypothetical. It was a bit of a cheeky way of saying, Hey, if you can't tell whether or not something is intelligent, it makes sense to treat it as if it actually

32:58

is intelligent. After all, we assume that every human with whom we interact possesses some level of intelligence. Based on those interactions, so why should we not extend the same courtesy to machines. Now, natural language processing would prove to be another super challenging problem to solve. In computer science. Early work was done in translation algorithms, and these were programs that attempted to take phrases written in one language

33:24

and translate those automatically into a second language. At first, that seemed pretty straightforward, but you realize that's also pretty tricky. Really. For one thing, you can't just translate word for word and keep the same order from one language to another. The syntax or the rules that the language follow uh, they could be different from language to language. In one language, you might use an infinitive such as to record, in the middle of a sentence, while another language might put

33:53

all the infinitives at the end of a sentence. So in one language, I might say I'm going to record a podcast in the studio right now, but in another language it might come out as I'm going a podcast in the studio right now to record. It starts to sound like yoda. There was initial excitement around machine translation, but once computer scientists and linguists began to see the

34:16

scope of this challenge, their excitement faded a bit. Also, there was a lot of other stuff going on in the nineteen sixties and seventies that was demanding a lot of attention, such as the Space race. So for a while, this branch of computer science was given less attention than

34:32

other branches, and by less attention, I really mean funding. Now, when we come back, we'll talk a bit more about the advances that were necessary to support natural language processing, and we'll move on to how this would be another important component in smart speakers. But first, let's take another quick break. Okay, So early enthusiasm for an natural language processing created a bit of a hype cycle that ultimately crashed into the telephone poll of unmet expectations. That was

35:10

a really bad metaphor. Anyway, natural language processing went through something similar to what we saw with virtual reality in the nineteen nineties. You know, people saw what was actually achievable, and then they compared that to what they thought they were going to get, and those two things didn't match up at all, and that really pulled the rug out of funding for natural language processing, which men of course,

35:35

that progress slowed way down. It kept going, but it was definitely on the back burner for a lot of projects. When interest renewed in the nineteen eighties, there had been a shift in thinking around natural language processing. Computer scientists were starting to look at statistical approaches similar to what was going on with speech recognition, building up probabilistic models in which a computer can start making what amounts to educated guesses at the meaning of a command or a phrase.

36:06

Machine learning became an important component on the back end of these systems, and later artificial neural networks became an important part as well. A neural network processes information in a way that's sort of analogous to how our brains do it. You have nodes or neurons that connect to other nodes, and each node affects incoming data in a certain way, performing some sort of operation on it, and the degree to which they do that in one way

36:35

versus another is called the weight of that node. Computer scientists apply weights across the nodes in an effort to get a specific result in order to train these models. So you might feed a specific command into such a system, and you let it go through the computational process from the beginning of the neural network through to the end, and then you look at the result, and if the result is correct, well, that just means the system is already working as you intended it, which honestly is not

37:04

likely to happen early on. But if it's not correct, then you start adjusting the weights on those nodes in

37:12

order to affect the outcome. I almost think of it as like Plinko or pachinko, where you've got the little coin and you drop it down and it bounces on all the pegs and sometimes you're like you might think, all right, well, this time it's going to go right for that center slot, but it doesn't, and you think, well, maybe if I remove some of these pegs or I shift these pegs over a little bit, I can drop it in that same spot and get hit the center.

37:36

It's kind of like that, except you're talking about data, not physical moving parts. So you have to do this a lot, like up to like millions of times in order to try and train a system so that responds appropriately to commands. And once it's trained, you can then test new commands on the system to see if it can parse them and respond appropriately. And in this way, the system quote unquote learns over time how to respond to commands. And then we have another component that's important

38:07

with smart speakers, and that's speech generation. So it's one thing to have a machine either broadcast or play back a recording of speech. It's another thing for a machine to generate brand new speech. In computer science, we call it speech synthesis. Now, this is the really old technology I was alluding to at the beginning of this episode,

38:29

speech synthesis. If you want to be really, you know, kind of technical about it, it actually predates every other technology I've mentioned up to this point, at least in its most rudimentary implementations. You have to go way back to the eighteenth century the seventeen seventies, as when a Russian smarty pants named Christian Kradsenstein was building a device that used acoustic resonators. These these reads that would vibrate,

38:57

and it was in an attempt to replicate basic vowel sounds. Now, even with such a working device, it would be really difficult to communicate anything meaningful unless you were, i guess, speaking whale like Dory and finding Nemo. But it would be an early example of how people tried to create mechanical systems that could replicate speech or elements of speech.

39:18

Another inventor named Wolfgang von Kimberland built an acoustic mechanical speech machine and that used reads and tubes and a pressure chamber, and it was all meant to replicate various speech sounds. He had other elements to create sounds like plosives, those hard sounds that I mentioned earlier in the episode. So he had all these different elements that, working together, could create parts of the sounds that we humans make

39:47

when we speak. He also built a supposed chess playing machine, and it turned out that the chess playing part was a hoax. So unfortunately, because that device was a hoax, a lot of people dismiss his other work, which was legitimate. So by fudging on one thing, he kind of cast doubt on everything he had ever done. Skipping ahead quite a bit, we get to Homer Dudley, which is a

40:15

fantastic name. He unveiled the voter or voice Operating Demonstrator device at the New York World's Fair in nineteen thirty nine. It consisted of a complex series of controls and it sort of reminds me of something like a musical instrument, kind of like a synthesizer, but with extra controlling units. Like there was like a wrist element, there was a pedal. There's a lot of stuff that made it very complex, and with a lot of practice, you could create specific

40:47

sounds from this synthesizer. You could even create words or full sentences, though from what I understand, it was incredibly challenging to do. It was a very high learning curve, but it demonstrate the possibility of a like tronic synthesized speech. Now. There was a lot of work done in this field by lots of different talented scientists and engineers, and someday I'll have to do a full episode on the history

41:14

of speech synthesis. It's really fascinating, but it's far too big a topic to cover in its entirety in this episode. By the late nineteen sixties we had our first text to speech system, and by the late nineteen seventies and early nineteen eighties, the state of the art had progressed quite a bit and we were starting to get to a point where we could create very understandable computer voices. They weren't natural, they didn't sound like people, but you

41:41

could understand what they were saying. And finally, something else that would enable smart speakers and virtual assistance was the pairing of improved network connectivity and cloud computing. That removes the need for the device that you're interacting with to do all the processing on its own. So, if you think about or the history of computing, we used to do main frames with dumb terminals that attached the main frame,

42:05

so the terminal wasn't doing any computing. It was just tapping into the mainframe computer, which was sending results back to the terminal. Then you get to the era of personal computers, where you had a device sitting on your desk that did all the computing and it didn't connect to anything else. Then we get up to networking and the Internet, where we suddenly had the capability of having really powerful computers or grids of computers that were able

42:31

to take on processing power. Uh, and you just you send the request out to the Internet and you get the response back. That's the basis of cloud computing. So your your command or message or whatever relays back to servers on the cloud that then process it and send the proper response to whatever device you're interacting with, and then you get the result. So with the case of the smart speaker, it might be playing a specific so long or giving you a weather report or whatever it

43:02

might be. Now, if the speakers were doing some of that computation themselves, that would be an example of edge computing, where the processing takes place at least in part, at the edge of a network at those end points. But for now, most of the implementations we see send data back to the cloud to get the right response, so you have to have a persistent Internet connection. These devices

43:25

are not useful without that connection. You do have some smart speakers that can connect to another device like a smartphone via Bluetooth, so you could do things that way, but without those connections, the smart speaker turns into, you know, just a dumb speaker, or sometimes just a paperweight. Now, this collection of technologies and disciplines are what enabled Apple to introduce Sirie in two thousand and eleven, and Syria

43:52

is a virtual assistant. Series origins actually trace back to the Stanford Research Institute and a group of guys Grouber, Adamshire and dog kit Louse who had been working on the concept since the nineteen nineties, and when Apple launched the iPhone in two thousand seven, they saw the iPhone as a potential platform for this virtual assistant that they had been building, and they thought, well, this is perfect because the iPhone has a microphone, so the assistant can

44:20

respond to voice commands as a speaker, so it could communicate back to the user, it could do all sorts of stuff. We can tap into the interoperability of apps on the device. It's a perfect platform for us to

44:33

deploy this. So they developed an app once the opportunity arose because apps were not available for development immediately when Apple launched the iPhone, and once they did launch that app, uh within a month, less than a month, Steve Jobs was on the phone calling them up and offering to buy the technology, which of course they would agree to and it would become an integrated component in Apple's iPhone line afterward. And that's where voice assistants kind of lived

45:02

for a few years. They mostly lived on smartphones like the iPhone. But in November two thousand fourteen, Amazon introduced the Amazon Echo smart speaker, which was originally only available for Prime members, and it had its own virtual assistant named Alexa, and thus the smart speaker era officially began. Now, there are plenty of other smart speakers that are on the market these days. There are products from Google like

45:28

Google Home. Uh, there are so no speakers that can connect to services like Amazon's Alexa or Google's Assistant, and we're probably going to see a ton more, both from companies that piggyback onto services from the big providers like Google and Amazon, and maybe some that are trying to make a go of it with their own branded virtual assistants and services. Smart speakers respond to commands after they

45:52

quote unquote here a wake up word or phrase. Now, I'm gonna make up a wake up phrase right now so that I don't set off anyone's smart speaker or smart watch or smartphone or smart car or whatever it might be. So this is just a fictional example of a wake up phrase. So let's say I have a smart speaker and the wake up phrase for my smart

46:15

speaker happens to be hey, they're Genie. Well, my smart speaker has a microphone, so it can detect when I say that, but really it's constantly detecting all sounds in its environment. The microphone is always active. It has to be in order to be able to pick up on when I say the wake up phrase. So the microphone is always active on most smart speakers. There's somewhere you can program it so that it will only activate if you first touch the speaker and that wakes it up.

46:47

There's some that you can do that with, But for the most part, they're always listening. While the speaker can quote unquote here everything, it's not listening to everything. In other words, it's not mon of during the specific things being said. At least that's what we've been told. And honestly, that makes a ton of sense from an operational standpoint.

47:07

And the reason I say that is that the sheer amount of information that would be flooding in from all the microphones on all the smart devices from any one provider that happened to be deployed all over the world, that would be an astounding amount of data. And sifting through all that data to find stuff that's useful would take an enormous amount of effort and time and and

47:29

processing power. So while you could have all the microphones listening in all over the place, finding out who to listen to at what time would be a lot trickier and probably not worth the effort it would take to pull something like that off. So what these speakers and other devices are actually doing is looking for a signal

47:50

that matches the one that represents the wake phrase. So when I say, hey, they're Genie, the microphone picks up my voice, which the mic then try inslates into an electrical signal which gets digitized and compared against the digital fingerprint of the predesignated wake up phrase. And in this case, the two phrases match. It's like a fingerprint matching something that was left at a site. So that turns the speaker into an active listener rather than a passive one.

48:20

It's ready to accept a command or a question and to respond to me. But if I didn't say, hey, they're Genie, then the speaker would remain in passive mode because it wouldn't have a digital fingerprint that matches the one of the wake up phrase. Everything stays at the local level, and none of my sweet secret speech gets transmitt related across the internet. It's all staying right there. At least that's what we've been told. And again I don't have any reason to disbelieve this, but it is

48:50

something to keep in mind. You are talking about devices that have microphones. Of course, if you have a smartphone, you've already got one of those or a cell phone. In general, you've got a device with a microphone on

49:00

it neck near you pretty much all the time. Now, once I do make a request with my smart speaker, the speaker then sends that request up to the cloud where it gets processed, It's analyzed, uh, and then a proper response is returned to me, whether that is playing a song or giving me information I've asked for, or maybe even interacting with some other smart device in my home, such as adjusting the brightness of my smart lights in

49:26

my house. Now, if the system is not sure about whatever it was I just said, it will probably return an error phrase. So maybe maybe I'm too far away from the speaker, so it's it couldn't quote unquote hear me really well. Or maybe I've got a mouthful of peanut butter or something as I want to do. Then I'm going to get something like I'm sorry, I don't know how to do that, or I'm sorry I didn't understand you, and then I'd have to repeat it. Now,

49:53

smart speakers are pretty cool. However, they do represent another piece of technology that you have to network to other devices, including your own home network, and as such that means that they represent a potential vulnerability in a network. It doesn't mean they're automatically vulnerable, but it means that every time you are connecting something to your network, then you're

50:18

creating another potential attack vector for a hacker. Right now, if everything is super strong, it it doesn't really effectively change your safety in any meaningful way. But if one of those things that you connect to your network is less strong than the others, you're looking at the weakest link situation where a hacker with the right know how in tools could potentially target that part of your network

50:45

to get entry into everything else. And when you're talking about a smart speaker, you're talking about device that has an active microphone on it. So potentially, if someone were able to compromise a smart speaker, they would be able to listening on anything that was within range of that

51:03

smart speakers microphone. So that's why you have to at least be cognizant of that, do your research, make sure the devices you're connecting to your network are rated well as from a security standpoint, when you're setting things up and you have to create passwords, create strong passwords that are not used anywhere else. The harder you make things the more likely hackers will just pass you by, not

51:30

because you're too tough to crack. Never get your into your head that you're too strong to to be hacked, but rather if there's someone who's weaker than the hackers are going to go after that person instead. So just don't be the weak person. Practice really good security behaviors, and you're more likely to discourage attackers and they'll they'll go on to someone else. Um, especially if you're talking about newbies who don't really know their way around their

52:00

just using tools that other people have designed. They get discouraged very quickly. They'll move on to someone else because there's always another potential target. I'm curious about you guys, whether or not you have any smart speakers in your life, and uh if you find them useful. I find mine pretty useful. I use it for a very narrow range of things. I don't tend to use it. I definitely don't use it to its full potential. I know that

52:26

because what's in the blue moon. I'll just try something and I'm amazed at what happens when when I get a response. But for the most part, I'm asking about whether what I can feed my dog whether or not it can turn on the lights and uh and and that's about it. Are occasionally playing a song. Um, but I'm curious what you guys are using them for. Reach Out to me on social networks on Facebook and I'm on Twitter, and the handle for both of those is text stuff. H s W also use that those handles

52:56

if you have suggestions for future episodes. If you've got, you know, an idea for either a company or a technology or a theme in tech you'd really like me to tackle, let me know there and I'll talk to you again really soon. Text Stuff is a production of I Heart Radio's How Stuff Works. For more podcasts from my heart Radio, visit the i heart Radio app, Apple Podcasts, or wherever you listen to your favorite shows.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript