Techstuff Classic: How MP3 Compression Works

Speaker 1

00:04

Get in touch with technology with tech Stuff from how stuff works dot com. Hey everybody, and welcome to tech Stuff. I'm Jonathan Strickland. I'm the host of the show, and this is a Saturday morning rerun episode where we take a classic episode of tech Stuff and we present it to you guys who may have missed it. I've been talking a lot about tech and music recently. If you've been listening to the recent episodes, you know all about that,

00:31

and there have been some great discussions. But it also requires a little bit of uh knowledge of previous episodes at times, and I know it can be tricky to dig through the archives. So in this classic episode, I talk about how the MP three compression format works, so that you can actually understand how MP three works as opposed to something like middy, and you can get an appreciation for the differences between the two formats. This episode

00:58

originally published on January two thousand and seventeen. This is a whole year ago more than that. Now we're in April two eighteen as I record this. I hope you enjoyed this classic episode. I hope it gives you a deeper appreciation of the technical aspect of creating digital music and I'll see you guys on the other side. So let's remember that the heart of digital information is the bit that's either a zero or a one. The basic

01:28

unit of information for digital formats zeros and ones. Now we can use those zeros and ones to describe all sorts of information, from text to audio, to video and really pretty much anything you can think of that's represented digitally. Ultimately, when you get down to it, it's a bunch of zeros and ones. So let's say you start off with your uncompressed audio file. You've got this enormous audio file in front of you. It's made up of zeros and ones.

01:57

How do you make that file smaller? So in the world, we can compress stuff, right, we can apply physical pressure to things. Think about packing a suitcase. You can make sure you get that extra outfit and if you just press it down hard enough and get that zipper zipped before it can burst open. But once you get to a certain level of compression, you cannot make things smaller, at least not without hurting yourself or whatever it is

02:21

you're trying to compress. Digital files are a little different because you cannot physically cram the zeros and ones closer together. It doesn't work like that. These are abstract things. You can't make them smaller, right, You can't decrease the font. It doesn't work that way. The numbers represent two different states. So if you want to create a smaller audio file containing the recording that was in a larger audio file,

02:47

you have to start getting creative now. In the last part of this series, I talked about how the MP three compression algorithm was born from an applied research institution in Germany and the team behind the m B three wanted to find a way to compress audio, specifically music for transmission over phone lines. Eventually this evolved into the Motion Pictures Expert Group audio Layer three compression methodology, better known as the MP three, and there's also Impact two

03:18

and IMPEG four standards. Impact two, by the way, is the basis of compression on DVDs, although the actual DVD format is really a modification of Impact two. An Impact four is a compression strategy for audio and video that's frequently used in lots of different up capacities, including streaming media services. So by the late nineteen seventies, researchers began to explore the possibility of leveraging psychoacoustics to figure out

03:42

how to compress audio. And psychoacoustics refers to the way we perceive sound, it's uh and also the physiological effects of sound on us. So this involves not just our our physical sense of hearing, but also our brains and the way our brains interpret sound. Owned So, for example, there's a psychoacoustic phenomenon that's called the Hawse effect h A A S. And I think it's pretty interesting. So

04:08

here's how the Hawse effect works. If you hear the exact same sound coming from different directions, but the two sounds arrive within thirty to forty milliseconds of each other, your brain will be convinced that you really only heard one sound and it came from the direction that hit you first. So let's say a sounds coming from directly in front of you and to your left, and you get both of them within that thirty forty millisecond range, and you hear the one coming from ahead of you

04:39

first to you. You're convinced that you only heard that sound once and it came from dead on straight ahead of you. Your brain kind of discounts the one that came off from the left, although it can reinforce it, which ends up being really useful if you're planning out p A systems for stage shows. I'm not joking. That really is the way that uh people plan those things out. It's pretty neat. Humans perceived sounds in a way that's

05:04

not necessarily representational of all the sounds surrounding us. You can think of your brain as the filter between your understanding and what reality actually is. A lot of stuff goes on that it ends up getting rid of information that your brain just says, you know what, he or she doesn't need that, it's just gonna confuse things. We're gonna dump it. And that's kind of how it works. It's all on an unconscious level. It's not like you're

05:30

actively working to do this. So let's say you're in a relatively busy hallway and there could be a lot of sounds in that hallway. Stuff that's going on constantly around you. Maybe they are doors opening and closing, Maybe their footsteps going up and down the hallway. Maybe someone shoes are squeaking against the linoleum floor. People are chattering away in there. But you are having a conversation with someone, so you turn your focus on that person and other

05:57

sounds seemingly fade away. They're still doesn't but they're not important. So in this example, you would actually call those other sounds of distraction and you would really focus on the conversation. Uh. That also shows how we're able to consciously direct our since our perception of hearing. So both of these factors come into play. Now. One thing that MP three encoding takes advantage of is something called masking, and there are a couple of different variations of the masking effect. One

06:27

of them is called frequency masking. So let's say you've got to sound frequencies that are similar, perhaps there's just a few hurts apart. Remember, UH, frequencies are measured in hurts, which is really the number of oscillations per second. So let's say you've got a sound that's at I don't know, uh, one thousand killer hurts, and another one that's at one

06:52

thousand and ten killer hurts. Now, the human ear is precise enough to be able to tell the difference of two sounds that are at least two hurts apart from each other. That's how precise our resolution of hearing, it's it's at that level. But if you get two sounds played at the same time and they are that close together in frequency, and one of those frequencies is played at a greater volume than the other, our brains will pick up on the louder sound and ignore the quieter sound,

07:23

even though both of them are present. What becomes important at that point is the amplitude. Now, the further apart in frequencies you get, the less that has an effect. So if you get far enough apart where there are two pitches, one of them noticeably louder than the other, but they're far enough apart, you will hear both of them. It only works if the two pitches are relatively close together,

07:45

and there's not a universal formula for frequency masking. As you get closer to the boundaries of human hearing, frequency masking becomes easier, So if it's a really low pitch or a really high pitch, it's easier to get away with it. Once you started getting into what is the out of as the sweet spot for human hearing, which is generally considered to be between two and five killer hurts, you need a greater difference in volume or a smaller

08:10

difference in frequency in order for masking to work. Frequency masking at any rate. But then there's also temporal masking, and you might say, okay, I got it. Temporal that means time. Indeed it does, my friend. This describes the effect of a short but loud sound masking a softer sound for a short time. Weird thing is the loud sound can actually mask sounds that precede it slightly, not

08:37

by a whole lot, but a little bit. MP three compression takes advantage of both frequency and temporal masking when it's trying to determine which data needs to be included and which data can be dumped, because it won't affect your perception of whatever the the audio file is in the first place. So you also probably remember I talked about the physical limitation to what we humans can hear, no matter what our brains might be up to, so that this doesn't have to do with our brains, you know,

09:04

filtering through the information that's coming in. This has to do with the physical limitations of the human ear. In the last episode of the series, I said typical human hearing.

09:14

Keep in mind typical there are exceptions. UH covers the range of frequencies between about twenty hurts and twenty killer hurts or twenty thousand hurts, So twenty to twenty thou higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, And as you get older, your ability to perceive those higher frequencies starts to diminish. So most adults actually have an upper range closer to sixteen killer hurts, not twenty. Uh. Kids,

09:44

they can hear those higher pitches. You may have heard the story about how some convenience stores experimented with getting rid of teenage loiterers by by uh projecting out these super high pitches that that adults could not here but kids could, and it discouraged kids from hanging out at the convenience store and loitering. Um. I love that idea so much. Anyway, that's because I'm old and my hearing

10:13

is terrible. Well, remember I also mentioned you can detect changes in pitch at two hurts increments if you get below two hurts and change, like, if it's just a one hurts difference between two frequencies, it's too low a resolution for us to detect. To us, it will sound exactly the same. So if you were to hear a frequency at one thousand one hurts or one point zero zero one killer hurts and one point zero zero to kill hurts, you wouldn't notice the difference. They would sound

10:47

exactly the same to you. So if you're gonna take audio and compress it, one step you could consider is eliminating anything that's outside the actual range of frequencies that we can hear, or simplifying any changes in frequency that are smaller than two hurts. If you get take all that data and you say it is physically impossible for a human to perceive this, get rid of that information, then in theory it wouldn't have any effect on the

11:14

rest of the recording. But how you go further than that, right, how do you create a method so that you can really compress this file? You want a method that will preserve the important sounds while potentially ignoring all the unimportant or incidel sounds. And you wanted to be automatic because if you have it manually, then that's going to take countless hours just to edit a single sound file. So that was the challenge that the MP three research team

11:44

faced as a group. Now, their solution, which ultimately created even more challenges was to come up with what was essentially a simulated human ear and brain. They needed to replicate the experience of perceiving music so that an algorithm could evaluate every sound in an audio file and judge if in fact was relevant enough to include in the final compressed version. If a sound were imperceptible, then it wouldn't make sense to include it in the MP three file.

12:15

So by leaving out all the irrelevant data, they can make the audio information take up less bandwidth. The file itself would be smaller because you just dumped everything that wasn't important. So the team used an algorithm called the low complexity Adaptive Transform Coding or lc DASH a TC as the foundation for their research. This was kind of their starting point, and this is an approach that that tries to do away with redundancy as much as possible,

12:43

and it also incorporates adaptation to perceptual requirements. Also, MP three's oh a lot to the IMPEG Layer two standard, So the Layer two obviously came out before Layer three, and so a lot of the features of layer three are really um their legacy features from Layer two. Uh. In other words, MP three group kind of got stuck with them because otherwise they would have had a problem

13:09

with backwards compatibility. So the result is kind of a clunky arrangement under the hood, and some of the features may make very little sense when I go through them, but some of that is because it's a holdover from an earlier compression strategy, which isn't terribly satisfying as an answer. But the reason many parts of the MP three compression algorithm are the way they are is because that's the way we've always done it. So next I'm gonna dive

13:35

into the phases of compression. But before I do that, let's all take a deep breath and take a moment to thank our sponsor, and we're back. So there are two big phases we'll need to talk about with MP three compression. The first phase is analysis and the second phase is the actual compression itself. And after that there's the process of decoding and MP three for playback. But that's way simpler once we get an understanding of how

14:13

the encoding process actually happens. So let's begin with analysis. Now. This is the part where the standard has to figure out which frequencies within an audio range are recording rather are important or perceptible. So how does a program and encoder figure out what we can hear and what we cannot hear? Alright, time to get technical. So you start off with your pulse code modulation audio file or PCM file.

14:45

And you might remember I talked about PCM audio in the first episode of this series, but just in case you don't, it's a lossless digital audio file. The actual format could be a wave or ai f F or something along those lines, but the important thing to keep in mind is that it is uncompressed. Now, that means those files tend to be pretty big. This is our raw material that we want to take and squish down

15:09

to a more manageable transferable size. And in our our last episode in this series, I also mentioned that the standard for c D audio is a sample rate of forty four point one killer hurts. And we learned that you need a sample rate twice the frequency of the highest frequency in your recording, and since human hearing tops out at around twenty kill hurts, the standard for c

15:32

ds is forty four point one killer hurts. The MP three standard can support lots of different sample rates, but forty four point one killer hurts is pretty much the common standard. So you've got a number of samples with your audio file, and that number will depend upon how long the audio file is. You've got forty four samples per second, actually twice that for stereo. But for the purposes of this discussion, let's kind of stick with mono sound so that I don't start having math coming out

16:02

of my ears. And we're still in the very easy, simple part as far as math goes. We haven't gotten to the complicated stuff yet. All right, So you've got forty four thousand, one hundred samples per second. To compress it into an MP three format, the algorithm first groups all of these samples into collections called frames. So take those four thousand one per second, and then you start saying, okay,

16:27

we're gonna group you in batches. Each batch is called a frame, and each frame contains one thousand, one fifty two samples. Now that's specifically to maintain backwards compatibility to IMPEG Layer two, which established that one thousand, one fifty two number. But we're not talking about IMPEG layer two. We're talking about IMPEG Layer three, and though that means we have to get a little more complicated. So each

16:52

frame consists of two subgroups called granules. So each granule has five hundred seventy six samples six times two one two, so five seventy six samples per granule. Now, technically MP three encoders only work on one granule at a time, but they may reference the granules immediately before and immediately after the current one in order to see how the audio within the file changes over time. All right, So now you've got your granules of five hundred seventy six

17:25

samples each. Then the MP three encoder runs the samples through a filter bank, which sorts the sound into thirty two frequency ranges. Are you? Are you crazy about the numbers yet, Dylan? Are you? Dylan's Dan's nodding. Dylan gets worse from here. So you have thirty two frequency ranges, which is another nod to the layer two method, which use those thirty two ranges for encoding purposes. But we're not talking about layer two, are we. No, we're talking

17:54

MP three. Gosh darn it. That means we take those thirty two ranges and we subdivide them by a factor of eighteen. That means we have five hundred seventies six bands of frequencies each band containing one seventy six of the frequency range of the original sample. So what that actually means and this this is actually pretty easy. The bands are not limited to a specific number for their

18:21

frequency range, right. The bands don't mean that on the on band number one it goes from twenty hurts up to a certain range, and on band five D seventy six it ends at twenty killer hurts. That's not what it means. They're dependent upon the original audio. So if the original audio contains sounds within a narrow range of frequencies, the five seventy bands will be more precise. But if the original recording has a vast range of frequencies, the

18:50

bands are less precise. So another way to think about this is with a pizza. So let's say you get extra large pizza and you cut it into eight equal slices, and then you get a small pizza and you cut that into eight equal slices. Well, in both cases you have with each slice one eighth of a pizza. But the extra large pizza pizza slice is bigger than the small pizza pizza slice. It all depends on the size of the pizza. So in this case, it depends upon

19:21

the range of frequencies. And and Dylan, do you think we could go for some pizza, you know, just just put the episode on hold and go get pizza. Dylan's nodding. It's great for audio. Yeah, so, uh, pizza, We'll be right back. Okay, I was good pizza. Now um oh, man, I got a whole bunch more notes. Okay, well, let's let's go ahead and and do the rest of this. All right, So you've got your sound divided up into those five seventy six sub brands of frequencies, you know,

19:49

the thing I compared to pizza slices earlier. Now you get two different mathematical processes applied to this data. One is the fast Furrier trans form or f T, and the other is the modified discrete Cosine transform or m d c T. Now, I am not going to dive deeply into how these transforms work, because frankly, they are beyond my mathematical understanding. But I know what they do. I just cannot explain the process like how they do what they do. So I'm going to give you the

20:24

explanation of what they do. What the outcome of each of these transformed processes happens to be, but I'm not going to be able to tell you the actual mathematical steps involved in each because I don't math. So good guys, But let's start with a fast for your transform. So transform is kind of what it sounds like. It's all

20:42

about transforming information in some way. So in this particular case, the f f T transforms the frequency bands we just talked about into data that can be further analyzed by a psychoacoustic model that's in the encoder. So this is that simulated human ear and brain we were talking about earlier. So what the encoder does is it analyzes each bit of data and looks for signs that it represents audio that wouldn't be perceived by a human. So it's look

21:14

looking for any potential for masking possibilities. So are there collections of frequencies that are grouped close together, and is one of those frequencies louder than the others. You might be able to do away with those softerw frequencies because of frequency masking. The encoder will also look at whether or not the audio has a lot of complexity to it, if it has a lot of changes, or if it's

21:36

just relatively steady or simple audio. Any transient sounds that are present in the audio might end up being temporal masking, so it'll analyze those as well and see if that's a possibility. So really what they're looking is for, you know, just any really loud sounds that stand out above the rest of the recording. That's what the f f T is doing. So what about the modified discrete cosine transform. Well, this is happening in parallel with the f f T,

22:05

and the samples get sorted into different patterns called windows. Uh. And the criterion for sorting all has to do with whether the sample represents a steady sound or varied sound. So if you have a simple steady sound that goes into a long window. If there's a lot of variation in the sound, like there are a lot of consonants in a vocal line, or it's like a drum solo or something like that, it would get sorted into a series of three short windows. And each short window contains

22:36

one two samples. That amounts to four whole milliseconds, so four thousands of a second in three patterned windows. So you've got these windows now, either long windows for simple sounds or short windows for the more complex sounds, and then the modified discrete cosine transformed kicks into gear. It looks at each long window or set of three sort windows and converts them into a set of spectral values. To some of you, that probably sounds meaningless. So let's

23:06

talk about spectral analysis for a second. First, I was very disappointed to learn that spectral analysis doesn't involve a psychologist talking to a ghost about its emotional state. So bummer. But spectral analysis is when you look at a spectrum of information, like a spectrum of frequencies or related information like energy states. That's what this transform does. It takes data that originally represented a slice of time in a

23:35

sound waveform. That's what sample is. A sample is an instance of time in a wave form and converts it into information representing sound as energy across a range of frequencies. Now, you can plot out spectral information in a lot of different ways, but one common method is to use brightness to indicate energy levels. Higher energy levels are brighter patches

23:59

in your vision. Dual representation of spectral data. High frequencies would appear at the top of a spectral view like imagine a box, and at the top of the box that's where you would find high frequencies. At the bottom of the boxes where you find low frequencies, and it's

24:14

just lots of patches of color. The really bright patches of color represent very high energy frequencies, so they could be high or low in in actual frequency, but we're talking about energy levels, not whether it's a higher low pitch. Looking left or right represents the passing of time, and looking along any vertical points shows you the actual frequency or pitch, and then the respective energy level is the brightness.

24:42

So it's kind of like looking at sound as a wave, but instead of being a wave, you're looking at information that indicates frequency range and energy level. That representation is actually kind of analogous to how we hear audio, So and encoder can analyze the spectral view and start to

24:58

filter out the data we would and perceived due to psychoacoustics. Now, after all that processing, the encoder looks at the frequency sub brands and the levels of spectral intensity for each and that information can then be used for the next phase, which is compression. But right now I think we could all stand a little decompression, So let's take another quick break to thank our sponsor all right, So now you're ready to compress your analyzed audio. Good for you, and

25:37

by you I mean encoders. This has to be simpler than that analysis segment, right, I mean that got a little crazy with all the different bands and sub bands and windows and frames and granules. Sadly it gets more complicated. All right. So there are two layers of compression going on with IMPEG Layer three. One of those layers depends upon the psychoacoustic analysis and the other doesn't. So why

26:07

would you use two layers with different strategies like that? Well, the reason is that one strategy is great for complex audio with lots of components, but not so great with simpler sounds, and the other strategy is kind of the opposite. So the psychoacoustic approach is the one that's really good for complicated sounds. If if you've got a lot of volume changes, lots of different frequencies, it's just complicated and

26:31

rich sound. You've got a lot of opportunities to look for masking and other acoustic elements that limit the actual sounds that people perceive. So it means there are a lot of chances for you to uh fudge by dropping all the stuff that people probably wouldn't notice anyway. And Uh, if you take a piece that's got a lot of elements at varying volumes, there are likely several opportunities to

26:54

to do this. But if you're talking about relatively straightforward audio with few components, few changes in volume, there's really not a whole lot of data you can ditch without it actually affecting the quality of the audio in a perceptible way. And this is part of what Brandenburg, that guy I was talking about in our first episode in this series. Uh, that's when he discovered when he was working with the MP three standard and he was listening

27:22

back to that Suzanne Vega acapella track Tom's Diner. Uh, he was listening to a compressed version of it, and he said it was terrible. He said it ruined the quality of the audio. And part of that is because that particular song is fairly simple, there's just not a lot of opportunity to take advantage of masking and other tricks without potentially compromising the quality. So they decided to also incorporate some traditional compression strategies which which worked better

27:50

with those types of recordings. So the MP three format takes advantage of both the traditional approach and the psychoacoustic approach, and that allows the encoder to compressed files into smaller size without just following a single strategy, like it doesn't have to do a one size fits all for all elements of audio. Now, combining those two strategies requires a little more mathematical gymnastics. So let's go back to those five seventy six frequency bins. You know, those sub bands

28:20

we talked about earlier. You gotta quantize those suckers. What does that mean. It means assigning a quantity to each to each frequency bin, you have to give it a quantity of some sorts so that you can end up judging how much you can get away with dropping data. So to do this, the encoder sorts those five six bins into twenty two scale factor bands. How you doing over there, Dylan? Just checking in on you? Okay, Dylan's got Dylan's got a thousand yards stare going. I hope

28:53

you guys are doing okay over there? All right, So before smoke starts coming out of your ears, let me explain what the scale factor bands are all about. The whole purpose of the scale factor bands is to determine how the information will be stored within the compressed state. So you want to get away with as little data as possible before affecting sound quality. So if you can say the same thing in a shorter space without affecting the quality of what it is you're saying, you go

29:22

with it. Brevity is the soul of compression. So if we were talking about language, I would say it's more efficient to say it's raining outside, or even just it's raining, because you would assume that it would be outside where the rain is happening, and it would be inefficient for me to say it's coming down like cats and dogs out there. It's not as efficient as saying it's raining.

29:49

So if you can get away with shorter statements without affecting the actual quality, and you could argue that by switching from it's coming down like cats and dog out there and it's raining changes the quality, and that could be a valid argument. But if you can get away

30:06

with shorter without affecting quality, you do it. So each scale factor band is represented by a quantity, Then the encoder divides that quantity by a given number called the quantizer, which is the same across the entire frequency spectrum for that recording. The resulting number is then rounded up or down to a whole digit. And here's an important point. Individual scale factor bands can be scaled up or down for more or less precision to represent the actual value

30:41

of those bands. So what the heck does all that mean? Well, the purpose of dividing and rounding is just to simplify the data to reduce the amount you need in order to store the information. So let's go with a totally

30:53

hypothetical example. Let's say you've got a scale factor band and you've decided your rep is sending that scale factor band with the quantity seven eight four zero seven thousand, eight hundred forty, and you've chosen the number one hundred to quantize your data, meaning that you will divide each uh scale factor bands quantity by one hundred. So this is seven thousand, eight hundred forty. You divide it by one hundred UH, and the scale factor for this particular

31:24

band you have determined is one point zero. That means that once you get that result where you've divided the quantity by the quantizer, you multiply by one. That means there's no change. You multiply by one you get the same number. More on that end a bit. Okay, So you take that seven thousand, eight hundred forty you divided by one hundred. That gives you seventy eight point four. Well, now you have to round that number, so you round

31:48

it down to seventy eight. Now, when you have a decoder and you're ready to play back the information, it comes across this quantity the sight and it knows what the quantizer number was, so it multiplies by one hundred to get back to seven thousand, eight hundred. So the replicated number is actually forty off from the original number. The original number again was seven thousand, eight hundred forty. The replicated number is seven thousand, eight hundred. Now those

32:16

inconsistencies manifest as noise in the actual playback. So if you wanted to increase the precision of any given scale factor band, you could do so by changing the scale factor number. So in that example, just now, I said the number was one point zero, meaning there's no change to that result. But I could have said it was ten, which means we would multiply the quanties number by ten.

32:39

So we would take that seven thousand, eight hundred forty divided by one hundred, you get seventy eight point four, then multiplied by ten to get seven eight four. So when the decoder decompresses the file, it would reverse this this whole thing. It would just multiply by a hundred um. You would end up getting seven thousand, hundred forty again, which means that you wouldn't introduce any noise to the file.

33:00

You would have a perfect representation. But in some cases the encoder may determine that any noise that you generate wouldn't be noticed or it wouldn't impact the quality of the audio enough for it to be a problem because of other factors for that particular scale factor band, like maybe it's really quiet, or maybe it's really complex. So in those cases, you could reduce the scale factor number by making it something else, like point one instead of

33:26

one point oh. So that means you would multiply the quantized number by point one, So the seventy eight point four would become seven point eight four, and then you have to round it to get a whole integer, so you get eight seven point eight four rounds up to eight. Now, when a decode or decompresses the audio and multiplies eight by one hundred, that quantizer that we've talked about so much.

33:49

Uh and uh. Actually at this point it would have to be eight thousand because it's also taking into account the scale factor, so it's multiplying it by a thousand, not just a hundred. So you would get a number that would pop up to eight thousand. And remember the original with seven thousand, eight hundred forty. So you look at the difference between these two, the original seven thousand forty, the new fact number is eight thousand. There's a pretty

34:12

big difference there. That change might introduce enough noise for it to be a problem. So how does the encoder determine if a scale factor band is meeting the proper criteria? How can it tell if there is uh too much noise or if the noise falls below the threshold. Well, it goes through what it's called a Huffman coding process. At this point, Dylan is currently just staring at the

34:37

wall and drool is coming out. Huffman coding process. It's converts scale factor bands into binary strings, and the process goes through a series of tables to determine if the data within the scale factor band requires more or less precision to describe the sound without affecting the audio quality. So Huffman coding is a process. And when you start with a large number of possibilities and you begin to narrow it down. Uh. Some people describe it as the

35:01

coding equivalent of twenty questions. So you ask your first question like animal, vegetable, or mineral. You get an answer so animal. While that first answer eliminates a ton of other possibilities and narrows the focus, like anything that doesn't pertain to animal, you can automatically discount because you already

35:20

know it can apply to that answer. With MP three compression, this means making certain the number of bits representing a granule because remember I mentioned that an MP three formats you have frames, and each frame, each frame has a thousand, one or fifty two samples and consists of two granules with five s each. So when you answer the first question, it eliminates a lot of other possibilities and narrows the focus.

35:46

So like with animal, vegetable, mineral, if I say animal, you're gonna not ask any questions that have to do with minerals or vegetables only because it wouldn't make sense. You know, those aren't gonna apply. Same thing with m P three's, except this time it means making certain the number of bits representing a granule. Remember their two granules per frame with the MP three layer, Uh, you want to make sure that the number of bits representing that

36:12

granule match the chosen bit rate for a compression. So if after going through this process, the encoder says, hey, this granule has more bits than what's allowed. It's too many bits. The we gotta get rid of some of these, the encoder can adjust the scale factor band so that there's less precision meaning that multiplier in other words, that but I talked about earlier, and thus reduce the amount

36:35

of data needed to represent that particular granule. If a granule comes in under the bit rate, the encoder can increase the precision to reduce noise and fill that granule out properly so that matches the actual threshold. After all this, the pairs of granules become frames within the MP three files, and the only other component then MP three file apart from these frames is the I D three metadata. And

37:04

this is pretty simple. This is like a header and it comes before all the frames in the audio file and contains information about about the file itself, which can include stuff like the title of a song, an artist name, an album title, other stuff like that. It can also include copyright information as well as information about the file itself, such as whether or not it's stereo recording or a

37:25

mono recording. So when you use a decoder like an MP three player, it takes this compressed information, these these these representations that the music has been reduced to, and it converts that Huffman data back into the quantized format, scales the data back up to its original size or close approximation. Remember the the uncompressed version may actually be off by a significant amount depending upon each individual granule. And all of that data gets combined into a new

38:01

PCM sample that can be played back to you. And that's all there is to it. Nothing could be easier, all right. That took a lot out of me, So I got really technical, and I apologize if I lost any of you out there, or for those of you who have a lot of experience working on compression algorithms, for oversimplifying in several cases. But now we've got a full episode about this, and I hope you have a better understanding of how a big sound file can be

38:28

reduced to a smaller sound file. Next time, I'll just say magic. It will make everyone happier. If you guys have any questions for me, or comments or suggestions, anything like that, send me a message. My email is tech Stuff at how stuff works dot com, or you can drop me a line on Facebook or Twitter to handle it. Both of those is tech Stuff H. S W. And I'll talk to you guys again really soon. For more on this and thousands of other topics, is it how stuff works dot com, wh

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript