Techstuff Classic: The Dirt on Digital Audio

Speaker 1

00:04

Get in touch with technology with tech Stuff from half stuff works dot com. Hey everybody, it's Jonathan Strickland here with text Stuff classic episodes. We're doing some Saturday morning reruns for you guys. This is a special series where we're going to dig up some classic episodes of tech Stuff and present them to you guys who may not have had a chance to listen to them, especially if you're a brand new listener. First of all, welcome. If

00:33

that's the case, I hope you enjoy these episodes. This one is called The Dirt on Digital Audio, and it is an episode all about the actual technical process of recording audio into a digital format and what that requires because it's very different from the analog style. I hope you guys enjoy it. This episode was originally published on

00:55

November twenty three, two thousand sixteen. And just in case you're listening to this one in the far future, I'm recording this in two thousand eighteen, So we're gonna time travel a bit and listen to this classic episode The Dirt on Digital Audio. So to start it all off, we all have to take a quick trip to Germany, So anyone who is not in Germany get your passport. I was actually in Germany not that long ago. I

01:22

got to visit Berlin and had a wonderful time. And in Germany there's a company called frown Hofer Gazelle Shaft. And you might wonder, well, what does this company do they think? I joke that my profession, that my title that I should put on my business card it should say professional smart person. Well, no joke, that's what these people are. They specialize in research and development, applied research. It's a whole company that specializes and applied research. And

01:54

it's huge. It encompasses sixties seven institutes and research units across Germany. Well back in the eighties and there was a researcher named Karl Heinz Brandenburg, and Karl Heinz made a breakthrough round uh and came up with this clever idea about encoding audio. He was actually working towards creating a way that would allow for high audio quality transfer but having a low bit rate sampling, so that file

02:30

sizes and transfer times wouldn't get out of control. Because you got to remember, this is the eighties, this is before the World Wide Web was a thing that would that wouldn't happen until the early nineties, so the Internet was very young. In fact, they weren't even looking at the Internet as a method of distribution for this particular type of encoded audio. They were looking at using this

02:51

to transmit across telephone lines. So they needed to have something that was going to be high quality but low space. So what the heck does that mean? All right? Well, digital audio and analog audio are very different things. So to understand that, we need to look at how sound works and how we describe sound, because that informs how we can capture sound and replicate those qualities digitally. So stick with me. We're gonna go back to school for

03:22

some basic sound science. And this goes back to the way sound physically moves through a medium, whether that's a solid or through the air or through water. Sound is vibration. Now we sense this primarily through hearing it or sometimes feeling it. If it's the right frequency and the right amplitude. We can actually feel sound. Anyone who stood close to, say a sub wiffer that was really blasting out bass notes, you know what I'm talking about. You can feel it

03:54

pressing against you. Well, sound travels through the air when molecules vibrate against each other, and this creates instances of increased pressure and decreased pressure at what is a hyper local level. We're not talking about weather maps here, We're talking about tiny little areas. So this increase in decrease in pressure is something that we can sense as sound. When those changes in pressure affect a diaphragm, such as one that's in a microphone or maybe your ear drum,

04:25

for example, it causes the diaphragm to actually move. So increased pressure pushes the diaphragm in and decreased pressure doesn't really pull the diaphragm out. I mean, you could say it it pulls the diaphragm out, but to be more accurate, the diaphragm actually pushes outward because the pressure on the outside is lower than the pressure on the inside. But you get what I'm saying. The diaphragm begins to to flex inward and outward depending upon the amount of pressure

04:56

that it's it's encountering. You can imagine this being kind of like a drum drum, not an ear drum, but an actual drum and striking it. Uh, that's the same sort of thing. So sound is the fluctuations of pressure, which we can diagram as a wave or a wave length a wave form on an x y axis, So the horizontal line that access that represents time that has passed, and the vertical axis represents the amplitude or the volume

05:26

of the sound wave. The wave length of the sound which is the distance between successive points on a wave, such as like the successive crests on a wave. That tells you a lot about the frequency. So sound moves at a constant rate through a given medium, but it moves at different rates through different media. So in other words, it moves a different speed through a solid than it does through air. If the crests of each sound wave

05:53

are really close together, that's a high frequency sound. More waves will pass through an arbitrary point within a second. The waves that are spaced further apart, that would be a lower frequency sound. Higher frequency sounds have a higher pitch than lower frequency sounds. So if you hold a single note at a constant frequency, you'll have what is called a simple harmonic motion. That means the vibrations are moving at a constant rate inward and outward. The cycle

06:22

is constant. A tuning fork is a good example of this. So if you hear a clear C note played on a musical instrument, that could be a simple harmonic motion. It won't be, but it could be. I'll tell you why it won't be in a minute. So the frequency of vibration doesn't change, and so you would get this very clear note as a result, And if you were to diagram it, you would have very regular crests and troughs, all of the same amplitude and distance from each other.

06:53

The frequency and volume would remain constant, assuming of course, that you're not trying to change the frequency or volume. Now, this is where I point out most musical instruments don't produce a single clear note, even if played expertly. They actually create several resonant frequencies. So every physical object resonates at several different frequencies. You've probably seen this in various programs.

07:19

MythBusters did one about bridges, the idea being that if you were to have a group of people marching on a bridge at the bridge's resonant frequency, it could cause the bridge to start to vibrate and swing out of control. Well, there's a reason for this. You may have also seen videos of people singing a certain note and causing a

07:39

crystal glass to shatter. That's because that crystal glass does have a resonant frequency, and if you can hit that resonant frequency at the right volume, you can cause the glass to start to deform, or the crystal in this case, to deform to a point where it loses integrity and it shatters as a result. Well, the resonation of an object is dependent upon lots of different factors, and in fact, most stuff will resonate at different frequencies, but at different intensities.

08:10

Like there might be one sweet spot, one specific frequency that will have the greatest effect, but other related frequencies may also have an effect. It will just be to a lesser extent. Well, if you were to pluck a guitar string, just you've tuned it to whatever note doesn't matter. Let's say it's you tuned it to to G and

08:31

you play the G string on your guitar. The note that you will hear really over all others will be g that that is going to be the one that will sound the loudest, But it will also play resonant frequencies at a decreased amplitude. In other words, of decreased volume so you still hear the intended note above everything else, above all the other resonant frequencies. This is called a complex tone, and that collection of frequencies in their amplitudes

08:59

is called the sectrum of sound. You get a full spectrum. Now, some of the components of that complex tone will be uh, imperceptible to you. You there'll be so quiet that you wouldn't really notice them. They might affect the overall quality of the sound, but in such a subtle way that it may be difficult for you to even put it into words. Each of those little components is called a partial.

09:23

So in the example of a guitar string, the partials are all integers of the same fundamental frequency, and the sound has a harmonic spectrum. But as you get further away from that fundamental frequency, the amplitude decreases significantly. So, like I said, you get far enough away, they are technically there, but they might be imperceptible to you. Now, some sounds have frequencies that aren't integers of a fundamental

09:51

frequency and are inharmonic Uh. Certain bells, like if you hear a bell ring, you can probably pick out a couple of different frequencies. There that are not harmon frequencies. These are very complex sounds, and to our perception, if it's complex enough, it can seem like there's no single discernible pitch. They're like there's no fundamental frequency over all the others. If it's complex enough, we call it noise.

10:17

That is the technical term. It is noise. Now, the unit we use to measure frequency is the hurts uh H E R t Z. Typical human hearing ranges from twenty hurts, which means a wave will pass a given arbitrary point twenty times within a second, all the way up to twenty killer hurts, which means a wave will pass a particular point in time twenty thousand times in a second, or particular point on your wave form twenty

10:47

thousand times in the second. And most of our sensitivity tends to be between one or two killer hurts up to four or five killer hurts. That's generally where we have human voices, and we've really gotten good at picking those out of over everything else. So our sensitivity of hearing is really concentrated between one killer hurts and four killer hurts or two and five depending upon whom you ask. Now we get back over to amplitude, that is referring to the height of the wave. It also refers to

11:18

the volume the loudness of something. Amplitude means bigness. So how big is the sound, Well, the greater the amplitude, the louder it is. And amplitudes can have an enormous range and affect how we perceive sounds. So, for example, take a really complicated classical piece of music. It's just

11:38

easy to explain it in that term. You might have a stretch in that classical piece of music in which all the instruments are more or less playing at a similar volume, so the sound from each instrument section has a similar amplitude. But then there might be one segment where an instrument group or maybe even a single soloist has an increased amplitude and increased volume. It rises over the rest of the orchestra, and that peak of the amplitude is called the attack of the sound, and the

12:10

entire range of amplitudes is called the amplitude envelope. Now this is important when we get to m P three's because the way we perceive these sounds uh that that has everything to do with the way the MP three was designed. The whole point of the MP three was to try and create a small file size to represent what we can hear and kind of ignore everything else. We'll get to that in a little bit more more time.

12:40

So this is really interesting to me. If you take a sound and you double its amplitude, you increase the amplitude by twofold, a listener would not necessarily feel that the sound is twice as loud. Human hearing is incredibly subjective, and typically for most listeners, it would require much more than doubling the sounds amplitude for them to feel that the sound itself was twice as loud. This perception of volume is important when we get to the lossy formats

13:14

for audio files. Now I've given you all this information, and I know everyone is probably thinking, you know, I learned this in primary school, elementary school. All of this is really familiar to me, and you're maybe rolling your eyes because it's so basic. But I think it's important to have that refresher so that you can understand the difference between sound as we experience it and sound as the way we encode it digitally and replicate it digitally.

13:46

For one thing, this illustrates how sound in the real world is a continuum. It's a continuum both in frequency and amplitude. You can have sound changing in frequency very smoothly from one pitch to another. You can also have sound increase or decrease in amplitude in a very smooth way. And it is continuous, it's unbroken. It can have smooth transitions. And these qualities provide challenges when we want to describe something digitally because at the heart of digital information is

14:23

the bit, the basic unit of information. It is a unit of information that only has two states zero or one is essentially off or on. When you get down to defining information in just two states, then you start to look at something that is continuous and you realize this is going to be a challenge. How do I describe a continuous experience in very discrete amounts of information. And that's when we get to the methodology we've developed

14:57

to digitally encode sound. I'm going to get into that in just a minute, but before I do that, let's take a quick break to thank our sponsor. All right, let's get back into it. So we've talked about the nature of sound. Analog sound, by the way, tries to replicate exactly what we would experience in nature. It tries to create this continuous experience, so you get these smooth waves of frequencies and amplitudes. And that's why some people argue that that analog styles of of sound recordings are

15:44

superior to digital ones. I don't necessarily think they're right, but they often feel that way. So something like a vinyl album, which is an analog format of digital or sorry, an analog format of music storage should say sound storage. Uh, they think that that is superior to say a c D, which is a digital storage format. Uh. And who's to say. I mean, like, if your sense of hearing is incredibly well tuned, you might be able to pick up on

16:18

some differences. Or if someone did a really terrible job encoding music digitally, then that might reveal itself to you as well. Uh. But this is one of those things that I think a lot of people feel they can tell the difference, but if they would do a double blind test, they might be surprised at how difficult it is. If things if everything's working the way it should, then there shouldn't be a perceptible difference at any rate. Digital

16:48

audio has two really important factors. Sample rate and bit depth, or to another extent, bit rate. We'll talk about bit rate as well. So the sample rate refers to how many times you reference an analog sound to create the digital version. So sound, like I said, is uninterrupted in the analog world, you've got that that nice wave form. In the analog world, that's not how digital world works. Digital world, we have to describe that sound in a

17:21

series of discrete snippets of sound. It's probably easiest to describe this with an analogy to movies on film. If you work with film, like you're creating a movie on film, then you know that you're not looking at a real moving picture when you see the film played out at the cinema. Instead, what you're looking at is a series of photographs. If you take a film strip and you look at it under a light, you'll see it's one

17:56

after another photograph. It's just a series of pictures. It's only when you play them back at the right speed and you projected onto a screen that you get the illusion of continuous motion. But it's not really continuous. It's just this series of photographs played at twenty four frames per second in the case of actual film. So that ends up being very analogous to the way we encode digital audio. You take the analog recording and you take

18:26

snapshots of sound. The more frequently you take those snapshots, the higher your sample rates. So in other words, if you did one a second, your sample rate would be awful. You would have a sample rate of one. But the higher the sample rate, the closer your digital representation will

18:43

be to the frequency in the analog sound format. Actually, what's really important to remember is that your sample rate has to be about twice actually does have to be twice what the highest frequency sound is in your recording. It has to be because as if it's not, it cannot encode that sound accurately. It's kind of interesting and you might wonder, how do we take these snapshots in

19:09

the first place. Well, if you're capturing audio, let's say we're recording to digital, So we've got a microphone set up and we're recording to a digital media storage. Like let's just say we're recording straight to someone's hard drive. So we're talking into a microphone recording to a hard drive. So you're using an analog microphone. Let's say you would need an analog to digital converter Now this particular component can receive discrete voltages from another device like your microphone.

19:41

So your microphone is converting sound into uh differences in voltage. That's essentially how it communicates, so that it can then send that to some other element. In this case, it's sending it to the the analog to digital converter so that it can be stored digitally on your our drive. So this analog digital converters references or samples the discrete voltage many times every second in order to create a

20:12

digital representation of the analog sound. It converts the voltages into numbers and a process called quantization, and we express those numbers in bits, So these are zeros and ones. When you want to play the digital audio, a digital to analog converter does the same process in reverse. So it takes this digital information, these zeros and ones and converts it into a series of discrete voltages, which then can be amplified and sent to a speaker and create sound.

20:44

So all of that's really important. But now let's let's talk about some concrete examples, and the best way to do this is to go with compact discs. Because we have a standard sample rate for compact discs, and that standard sample rate is forty four point one la hurts to create CD equality audio. That means that the audio is sampled forty four thousand, one hundred times every second the way they hear you say, the range of human hearing you said only goes to twenty hurts to twenty

21:15

killer hurts. If it only goes up to twenty killer hurts, why are you sampling at forty four thousand, one hundred times every second? If it's twenty thousand times a second for the frequency, why go up to four thousand, one hundred Is there some relationship between that and the c D sample rate? And the answer is yes. So there is a theorem called the Nyquist Shannon sampling theorem, and that states that the sample rate must be twice the maximum frequency of a recording in order to describe the

21:46

frequency properly. So the general thought is the maximum frequency most humans can here's twenty killer hurts. And for that reason, Phillips and Sony when they were working to create the CD format to make it a standard, they decide on forty four point one killer hurts as that standard sample rate for c D audio. It was more than double the top frequency generally considered to be in the upper level of human hearing. But what happens if you were

22:11

to lower the sampling rate. What if you didn't sample at What if you sampled at let's say sixteen killer hurts, so sixteen thousand times a second you sample it well, that means you would only be able to record and replicate any sound with a frequency up to eight killer hurts or less, so eight thousand hurts or less. But if you had any sound that was greater than eight thousand hurts or eight killer hurts, anything higher than that, it would be folded down to fit below the eight

22:46

killer hurts limit. Perceptually, that means the sounds you would hear in the playback could include frequencies that were not present in the original performance of that sound. So let's say that I'm using a sample rate of sixteen uh, you know, killer hurts, and someone is playing a musical instrument and they play a note that's at a nine killer hurts frequency. Well, because I'm sampling at sixteen killer hurts,

23:15

my limit for frequencies is eight killer hurts. If you play something at nine killer hurts, what happens is it the recording seems to fold the sound back, and it folds it back at the same limit that the sound goes over, the sample rate rather the Nyquist limit, I should say, not the sample rateself, but the Nyquist limit. So nine killer hurts sound played, My limit is eight

23:45

killer hurts. Well, nine killer hurts is one killer hurts more than eight, so it folds it back and the sound you would hear on the recording would be seven killer hurts. So the original sound is nine killer hurts. The playbacks sound is seven killer hurts, and you would

24:03

hear something recorded that wasn't actually played. That's why you have to have a really high sample rate so that you don't have these instances where sound gets folded back into the frequency range, because otherwise what you were hearing is not an accurate representation of what was actually generated what you were trying to record. This whole phenomenon, by the way, is called fold over or sometimes alias sing.

24:33

So that's sample rate. But then we've got bit depth. Now, this is all about measuring the volume or amplitude of a sound. So you have a range. You just make an arbitrary range to say, like we're gonna go quietest to loudest, and you just define what that range is. It could literally be any range. Let's say you say zero to one. Zero is dead silence, no sound at all. One hundred is as loud as the sound ever gets.

25:02

It's the peak volume of sound. That means you can describe all the different volumes within that recording at a number between zero and one hundred. But let's say you take that same recording and instead of making the range zero to one hundred, you say it's zero to two thousand. You haven't made the volume louder. The volume is still the exact same as it was when you called the range zero to one hundred. But what you have done is added more units. You've created more precise steps between

25:38

absolute silent and as loud as it gets. So you've just increased the size of the range so that you can be more precise in the differences in volume. And this is really important. So let's say that you've got a sound that you rank at seventy eight and another sound that you rank at seventy nine, and that's gonna be the same for both of these changes. Uh, just

26:01

two different examples. Actually, So you've got your zero to one range and a seventy eight would be seventy eight percent of the loudest sound in the entire recording, and at seventy nine would be a seventy nine of the loudest sound in the entire recording. That's an actually pretty hefty jump. But let's say we instead went with that zero to two thousand range and you still had seventy

26:25

eight and seventy nine. Well, seventy eight would represent three point nine percent of the full volume and seventy nine would represent represent three point nine five of a full volume. In other words, you'd be able to mark much more subtle differences in volume, and that means you can have more nuance in your recording. And since we're talking about a natural sound to start off with, so you're taking a natural sound and you're trying to digitize it. Smooth

26:55

changes in amplitude are possible in natural sound. Using a broader range to describe the volume is best if you want to get an accurate representation or resolution of that sound. Going back to that zero to one range changes in volume would be more chunky. Two sounds that have slight differences in amplitude would end up being defined as being identical because you wouldn't have the precision. You know, you couldn't say this one seventy eight and a half. It

27:25

would either be seventy eight or seventy nine. So you could have two sounds that in greater precision you could tell the difference between their volumes. But if you have that lower, that more shallow bit depth, you wouldn't be able to tell the difference of it. You would lose that nuance, that subtlety. This is part of the reason why people say, like a lot of the modern music has uh lower ranges and changes in volume, like the

27:53

the loudest loud parts and the softest soft parts. That range has decreased over time, which a lot of people have argued has meant that music has gotten less complex and therefore, in some minds, less interesting. That's on a related uh kind of philosophy to what I'm talking about here. So you want to have those smaller steps between each unit so you can create greater resolution, more smoothness to

28:23

the recorded audio. And it's actually the bit rate and CD audio that will help make the sound seem smooth. So if you ever listened to eight bit music, you know, like the kind from old video game consoles. That sound is really harsh and sort of chunky. It has an appeal, but it's not you know, it's not smooth at all. It can create an amazing effect, but if you want to represent true analog sound, it's not awesome. But if you went up to sixteen bit, that's CD quality bit depth,

28:59

it's much better. Uh, professional recording studios will do twenty four bit or thirty two bit because they're gonna do a lot of post processing work on those audio files. And when you do that post processing work, if you do it at sixteen bit, the stuff you're doing, the changes you make, can become noticeable, and most times you don't want that. You don't want it to be you know, you don't want it to stand out from the rest

29:24

of the audio file. But that's the only reason they go up to twenty four bit or thirty two bit. There'd be no point in playing it back at that rate, that bit depth, because human hearing is not so adept to tell the difference, at least not for most humans. So if you played back a recording at sixteen bit and another one at twenty four bit, and it's the same piece, most people would not be able to tell the difference because you've already reached a resolution that equals

29:55

the precision of human hearing. Keeping in mind again, human hearing is subject. If not everyone is equal, there's some people who have incredible hearing who may be able to pick out that difference. I am not one of those people, but I am a person who's going to tell you. We'll get to the last section in just a bit, but first let's take another quick break to thank our sponsor.

30:27

All Right, so bids depth. What we just talked about that can be thought of is how well the sound is described, and the sampling rate is how frequently or how much the sound is described. And CD Audio quality has sixteen bit audio. That means that they actually have sixty five thousand, five hundred thirty six different levels of

30:50

volume that they can describe within an audio track. So my example of zero to two thousand that is primitive compared at the c D audio because it has the sixteen bit style six five hundred thirty six different levels. And how is that possible? Well, when we say sixteen bit, remember a bit represents two states zero or one. So you take the number two and then you raise it to the power of sixteen. Uh, so you multiply to by itself sixteen times and you get sixty five thousand,

31:27

three D fifty six. So that's that's where that number comes from. Now, with your digital sample, you have a collection of points that roughly replicate the shape of an analog sound wave. It's gonna look a little funky, but you'll be able to see what the frequency and amplitude generally was of the original recording if you were to

31:48

plot this on an X y axis. But if you were just to connect each successive point with a straight line, even as close together as they would be, because you're looking at forty four thousand one times a second, it had sound pretty awful. So we actually use an algorithm called interpolation to join the points smoothly to imitate a sound wave form, and that gives a musical playback program

32:13

the ability to replicate an analog wave form. And that's actually called pulse code modulation or pc M. And if you store audio uh intact this way, you would have what we call a lossless audio file, which means exactly what it sounds like. None of that data would ever get filtered out of the file, even if the sounds were beyond the range of human hearing, they would be recorded and you would have a lossless file format. Those files tend to be quite big, depending upon how long

32:47

a recording you make, of course. All right, so now here's where it gets a little confusing. And I think I even said bit rate a couple of times when I really meant bit depths earlier. But up to this point, I really was talking at depth. So my apologies to all of you out there if a bit rate slipped through, because I did not mean it. Now I'm going to talk about bit rate and show you how it's different

33:10

than bit depth. Bit Rate refers to the amount of data audio uses per second or requires per second of recording, and you derive bit rate from the bit depth and the sampling rate. It's represented as bits per second. So again, let's go to ceed equality sound. That makes it easy. You have thousand one samples per second. You've got sixteen bits or two bites, because remember a bite is eight bits, so you've got two bites to describe each sample. So

33:45

two bites for one samples per second. Uh plus you probably are gonna have to multiply that by two because you're probably recording in stereo, so you have to do that once reach track, so you get that number, then you have to multiply that by sixty seconds to determine how much data per minute you are creating when you're recording, and with seed quality audio, that ends up being about ten megabytes of data per minute. Now these days that's not really that big a deal because we're dealing with

34:18

super fast internet speeds and enormous hard drives. But just a few years ago, that was considered to be a really sizeable file, I mean an enormous file, and so if you wanted to find a way to distribute digital audio so it didn't take up too much space, you had to figure out how you could compress those files and make them smaller, make them more manageable. And now we can finally get back to Germany and Hair Brandenburg. You thought we left him behind, We didn't. He was

34:52

just part of a flashback. So let's go to the MP three. First of all, it gets his name from the Motion Picture at Spurts Group, also known as IMPEG. It was part of a project that IMPEG was doing that was looking at ways of compressing audio. Along with the work that they were doing with video files. It's actually named after the process that they developed, called IMPEG Audio Layer three. So yes, there was a layer one

35:21

and a layer two. Layer three was a refinement of the approach and was the one that was actually successful in the market. Now, Brandenburg was working with an instructor he was pursuing Brandenburg was pursuing a PhD at the time and trying to come up with a practical means of transmitting digital audio across phone lines, and in the process he began to experiment with algorithms that could take digital audio information and determine which bits are significant. Anything

35:51

that was deemed insignificant could be discarded. So the thinking was that information we cannot perceive as human beings is worthless. There's no point in preserving it in an audio file format. It's just taking up space that we can't even perceive when we play it back, So there's no reason to replicate it, there's no reason to record it. Leave it out,

36:12

and that way you could compress digital audio files. Or to put it another way, if the algorithm determined that a sound was outside the range of human hearing, it would drop it from the encoding process, so you get a sound file much smaller than the more accurate representative version. So the lossless version would be more accurate to the

36:32

original sound. But this new version, what we would call a lossy version, a compressed file, would be able to replicate it pretty well if it's designed properly, and maybe to a point if you design it well enough that you couldn't tell the difference between the two. Uh. That

36:50

took some time. That was not easy to do. So the new file, the new version, the compressed one, the lossy format, would only have the actual relevant data, and from that point forward, the challenge was to determine what are the benchmarks to figure out what is relevant versus what is irrelevant, Because if you lose too much information, you change the quality of the recording, meaning it's no

37:16

longer an accurate representation of the original sound. So you might say that any sound below twenty hurts isn't relevant because it's below the range of your typical human humans ability to hear. You might say that anything above twenty thousand hurts or twenty killer hurts is irrelevant because humans typically can't hear sounds above that frequency. You might say that sounds at a certain amplitude or lower are irrelevant

37:46

because they're so quiet that humans wouldn't hear them. Or you might say that if a certain sound is at a lower amplitude and a different sound is at a higher amplitude, the higher amplitude sound is drowning out the lower amplitude sound, and so we humans don't really perceive the lower amplitude sound. This is where we get into psychoacoustics. It's not just what we hear, but how we perceive

38:11

the sound itself. And a lot of that went into formulating the algorithms to figure out how to compress this music in a way where you get a recording that represents the original without you know, compromising too much and still getting the file size to a manageable size. And these are the decisions you have to make to figure out which bits of information you keep in which ones

38:35

you ditch. Brandenburg and a team we're working on refining this approach in the late eighties and early nineties, And he said, at one point he thought he had nailed it, and then he heard an acapella song. It was Tom's Diner by Suzanne Vega, And then he listened to the compressed MP three version of that song using the the version of MP three that had been developed up to that point, and he said, it ruined the song. It

39:05

trashed it. It sounded terrible. He said that other representations of music seemed fine with this particular approach, but when they went with this stripped down acapella song with this particular kind of you're in the middle of a space

39:19

listening to Suzanne Vegas sing, it ruined her voice. And so the team began to tweet the compression algorithms to correct for this problem, and it took a lot of work to figure out, Okay, well, what are the elements of sound that we messed with that have created this issue, and ultimately they were finally able to create an MP

39:39

three file that didn't distort or ruin the recording. Brandberg said he listened to that song somewhere between five hundred and a thousand times, and then he saw Suzanne Vega performance live and he was able to recognize all of those subtle changes in her voice because he had paid so close attention to it during the process of tweaking this algorithm. He said, ultimately, the real telling thing is

40:06

he still enjoyed the song, which says a lot about him. Me. I can't stand that song, but maybe it's just because to me there's a point where it just sounds like someone is just singing about what they're doing. And I do that every day. No one gave me a record deal, alright. So getting back to MP three, they had finalized the foul format and created the standard, but it was just one of several possibilities for encoding audio and it didn't

40:35

immediately take off. It wasn't immediately adopted by consumers. The team had identified the Internet as a possible distribute distribution method for MP three files, rather than just over telephone lines. They said, well, can technically we could send and B three's across the Internet, so you could send manageable sized files across this network. Until life fourteen, they created the file extension DOT MP three. Now it would take a little bit longer for software to take advantage of this.

41:10

One of the early programs was win amp, which made MP three decoding accessible, and from that point the file format began to take off. To follow would be dedicated MP three players and sites that allowed people to upload and download compressed audio files, which also indicated a rise in piracy, and then in response to the rise in piracy. We saw an increase in d r M strategies digital rights management or copy protection if you prefer, and that all really ended up shaping a lot of the policies

41:44

and strategies that affect the Internet today. So you could say that the MP three is one of the reasons why the Internet is the way it is right now, and why arguments both for and against net neutrality have formula aided in certain ways. A lot of it is shaped by the MP three. So that kind of wraps up this discussion about digital audio in general and a

42:10

little bit on MP three files. In the next episode of this series, I will dive into a more technical explanation of what is actually going on with the MP three compression algorithms, and I bet you can't wait to learn all about fast Furrier transforms. I know I can't, And like I said, I have other episodes to sprinkle in between this one and the next one and then the third one, so that way you won't just get

42:36

digital audio overload. And if you guys have any comments or questions or suggestions for show topics or people I should interview, or maybe people I should have on as a guest host shoot him my way. My email is tech stuff at how stuff works dot com, or you can always drop me a line on Facebook or Twitter with the handle tech stuff hs W and I'll talk to you guys again really soon. For more on this and thousands of other topics, is it how stuff works dot com

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript