How MP3 Compression Works

Speaker 1

00:04

Get in tech with technology with tech Stuff from how stuff works dot com. Hey there, and welcome to tech Stuff. I'm your host, Jonathan Strickland. And in a recent episode I explored how digital audio works and gave kind of a brief history on the MP three file format. I warned you back then that that was part one of a three part series, and today we're gonna explore part two. So I hadn't forgotten about it. We're back to it, uh, And today we're gonna do a deeper dive with m

00:36

P three's and how do they compress audio? And how can you take a file filled with information and make it a smaller size? What do you have to give up in order to make files smaller? And today we're gonna try and unravel the technical mystery behind the MP three And I am not going to lie to you people.

00:55

This is gonna get a bit you know, man athy And that was an English major, So you mathematicians out there, get ready with your corrections because I'm probably gonna make some over generalizations for the purposes of my own sanity. There does get to a point where to really get into the technical details, it would likely be uh impossible for me to describe it in a way that would

01:21

make sense and be accurate. Um, and I have given my producer Dylan the mandate that, should I get to cryptic and incomprehensible with my explanation, that he is to intervene in a way that he sees fit. Just not in the face, Dylan. It's not in the face. It's moneymaker, man. I gotta gotta take care of it. So let's remember that the heart of digital information is the bit that's either a zero or a one. The basic unit of

01:56

information for digital formats zeros and ones. Now we can use those zeros and ones to describe all sorts of information, from text to audio, to video and really pretty much anything you can think of that's represented digitally. Ultimately, when you get down to it, it's a bunch of zeros and ones. So let's say you start off with your uncompressed audio file. You've got this enormous audio file in front of you. It's made up of zeros and ones.

02:24

How do you make that file smaller? So in the real world, we can compress stuff, right, we can apply physical pressure to things. Think about packing a suitcase. You can make sure you get that extra outfit in if you just press it down hard enough and get that zipper zipped before it can burst open. But once you get to a certain level of compression, you cannot make things smaller, at least not without hurting yourself or whatever

02:48

it is you're trying to compress. Digital files are a little different because you cannot physically cram the zeros and ones closer together. It doesn't work like that. These are abstract things. You can't make them smaller, right. You can't decrease the font. It doesn't work that way. The numbers represent two different states. So if you want to create a smaller audio file containing the recording that was in a larger audio file, you have to start getting creative now.

03:17

In the last part of this series, I talked about how the MP three compression algorithm was born from an applied research institution in Germany and the team behind the MP three wanted to find a way to compress audio, specifically music for transmission over phone lines. Eventually, this evolved into the Motion Pictures Expert Group Audio Layer three compression methodology, better known as the MP three, and there's also IMPACT

03:44

two and IMPEG four standards. Impact two, by the way, is the basis of compression on DVDs, although the actual DVD format is really a modification of Impact two and Impact four is a compression strategy for audio and video that's frequently used in lots of different up pacities, including streaming media services. So by the late nineteen seventies, researchers began to explore the possibility of leveraging psycho acoustics to

04:08

figure out how to compress audio. And psychoacoustics refers to the way we perceive sound it's uh and also the physiological effects of sound on us. So this involves not just our our physical sense of hearing, but also our brains and the way our brains interpret sound. So, for example, there's a psychoacoustic phenomenon that's called the Hawse effect h A A S. And I think it's pretty interesting. So

04:35

here's how the Hawse effect works. If you hear the exact same sound coming from different directions, but the two sounds arrive within thirty to forty milliseconds of each other, your brain will be convinced that you really only heard one sound and it came from the direction that hit

04:53

you first. So let's say a sounds coming from directly in front of you and to your left, and you get both of them within that thirty to forty millisecond range, and you hear the one coming from ahead of you first to you, you're convinced that you only heard that sound once and it came from dead on straight ahead of you. Your brain kind of discounts the one that came off from the left, although it can reinforce it, which ends up being really useful if you're planning out

05:22

p A systems for stage shows. I'm not joking. That really is the way that people plan those things out. It's pretty neat. Humans perceive sounds in a way that's not necessarily representational of all the sounds surrounding us. You can think of your brain as the filter between your understanding and what reality actually is. A lot of stuff goes on that it ends up getting rid of information that your brain just says, you know what, he or she doesn't need that, it's just gonna confuse things. We're

05:52

gonna dump it. And that's kind of how it works. It's all on an unconscious level. It's not like you're actively working to do this. So let's say you're in a relatively busy hallway, and there could be a lot of sounds in that hallway, stuff that's going on constantly around you. Maybe they are doors opening and closing, Maybe their footsteps going up and down the hallway. Maybe someone shoes are squeaking against the linoleum floor. People are chattering

06:17

away in there. But you are having a conversation with someone, so you turn your focus on that person and other sounds seemingly fade away. They're still present, but they're not important. So in this example, you would actually call those other sounds of distraction and you would really focus on the conversation. Uh. That also shows how we're able to consciously direct our sense our perception of hearing. So both of these factors

06:43

come into play. Now. One thing that MP three encoding takes advantage of is something called masking, and there are a couple of different variations of the masking effect. One of them is called frequency masking. So let's say you've got to sound frequencies that are similar ahaps, there're just a few hurts apart. Remember, frequencies are measured in hurts,

07:04

which is really the number of oscillations per second. So let's say you've got a sound that's at I don't know, uh, one thousand killer hurts, and another one that's at one thousand and ten killer hurts. Now, the human ear is precise enough to be able to tell the difference of two sounds that are at least two hurts apart from each other. That's how precise our resolution of hearing it's

07:33

it's at that level. But if you get two sounds played at the same time and they are that close together in frequency, and one of those frequencies is played at a greater volume than the other, our brains will pick up on the louder sound and ignore the quieter sound, even though both of them are present. What becomes important at that point is the amplitude. Now, the further apart in frequencies you get, the less that hasn't a effect.

08:00

So if you get far enough apart where they are two pitches, one of them noticeably louder than the other, but they're far enough apart, you will hear both of them. It only works if the two pitches are relatively close together, and there's not a universal formula for frequency masking. As you get closer to the boundaries of human hearing, frequency masking becomes easier. So if it's a really low pitch or a really high pitch, it's easier to get away

08:23

with it. Once you start getting into what is the ought of as the sweet spot for human hearing, which is generally considered to be between two and five killer hurts, you need a greater difference in volume or a smaller difference in frequency in order for masking to work. Frequency masking at any rate. But then there's also temporal masking, and you might say, okay, I got it. Temporal that means time. Indeed it does, my friend. This describes the effect of a short but loud sound masking a softer

08:56

sound for a short time. Weird thing is the loud sound can actually mask sounds that precede it slightly, not by a whole lot, but a little bit. MP three compression takes advantage of both frequency and temporal masking when it's trying to determine which data needs to be included and which data can be dumped, because it won't affect your perception of whatever the the audio file is in

09:19

the first place. So you also probably remember I talked about the physical limitation to what we humans can hear, no matter what our brains might be up to, so that this doesn't have to do with our brains, you know, filtering through the information that's coming in. This has to do with the physical limitations of the human ear. In the last episode of the series, I said typical human hearing.

09:41

Keep in mind typical there are exceptions. UH covers the range of frequencies between about twenty hurts and twenty killer hurts or twenty thousand hurts. So twenty to twenty thousand higher frequencies represent higher pitches and sound lower frequencies lower pitches, right, And as you get older, your ability to perceive those higher frequencies starts to diminish. So most adults actually have an upper range closer to sixteen killer hurts, not twenty. UH.

10:11

Kids they can hear those higher pitches. You may have heard the story about how some convenience stores experimented with getting rid of teenage loiterers by by UH projecting out the super high pitches that that adults could not hear but kids could, and it discouraged kids from hanging out at the convenience store and loitering. UM. I love that idea so much. Anyway, that's because I'm old and my

10:39

hearing is terrible. Well, remember I also mentioned you can detect changes in pitch at two hurts increments if you get below two hurts and change, Like, if it's just a one hurts difference between two frequencies, it's too low a resolution for us to detect. To us, it will sound exactly the same. So if you were to hear a frequency at one thousand one hurts or one point zero zero one killer hurts and one point zero zero to killer hurts, you wouldn't notice the difference. They would

11:13

sound exactly the same to you. So if you're gonna take audio and compress it, one step you could consider is eliminating anything that's outside the actual range of frequencies that we can hear, or simplifying any changes in frequency that are smaller than two hurts. If you get take all that data and you say it is physically impossible for a human to perceive this, get rid of that information, then in theory it wouldn't have any effect on the

11:41

rest of the recording. But how you go further than that? Right, how do you create a method so that you can really compress this file? You want a method that will preserve the important sounds while potentially ignoring all the unimportant or incidel sounds. And you want to be automatic because if you have a man you really then that's going to take countless hours just to edit a single sound file. So that was the challenge that the MP three research

12:11

team faced as a group. Now, their solution, which ultimately created even more challenges, was to come up with what was essentially a simulated human ear and brain. They needed to replicate the experience of perceiving music so that an algorithm could evaluate every sound in an audio file and judge if an in fact was relevant enough to include in the final compressed version. If a sound were imperceptible, then it wouldn't make sense to include it in the

12:41

MP three file. So by leaving out all the irrelevant data, they can make the audio information take up less bandwidth. The file itself would be smaller because you just dumped everything that wasn't important. So the team used an algorithm called the low complexity adaptive transform coding or lc DASH a t C as the foundation for their research. This was kind of their starting point, and this is an approach that tries to do away with redundancy as much

13:10

as possible. And it also incorporates adaptation to perceptual requirements. Also, MP three's oh a lot to the IMPEG Layer two standard, So the layer two obviously came out before Layer three, and so a lot of the features of layer three are really um their legacy features from layer two. Uh. In other words, MP three group kind of got stuck with them because otherwise they would have had a problem

13:36

with backwards compatibility. So the result is kind of a clunky arrangement under the hood, and some of the features may make very little sense when I go through them, but some of that is because it's a hold over from an earlier compression strategy, which isn't terribly satisfying as an answer. But the reason many parts of the MP three compression algorithm are the way they are is because that's the way we've always done it. So next I'm

14:01

gonna dive into the phases of compression. But before I do that, let's all take a deep breath and take a moment to thank our sponsor, and we're back. So there are two big phases we'll need to talk about with MP three compression. The first phase is analysis and the second phase is the actual compression itself. And after that there's the process of decoding and MP three for playback. But that's way simpler once we get an understanding of

14:40

how the encoding process actually happens. So let's begin with analysis. Now. This is the part where the standard has to figure out which frequencies within an audio range are recording rather are important or perceptible. So how does a program and in coder figure out what we can hear and what we cannot hear? All? Right, time to get technical. So you start off with your pulse code modulation audio file

15:10

or PCM file. And you might remember I talked about PCM audio in the first episode of this series, but just in case you don't, it's a lossless digital audio file. The actual format could be a wave or ai f F or something along those lines, but the important thing to keep in mind is that it is uncompressed. Now, that means those files tend to be pretty big. This is our raw material that we want to take and

15:36

squish down to a more manageable, transferable size. And in our our last episode in this series, I also mentioned that the standard for c D audio is a sample rate of forty four point one. Killer hurts and we learned that you need a sample rate twice the frequency of the highest frequency in your recording, and since human hearing tops out at around twenty kill hurts, the standard

15:59

for CDs is forty four point one killer hurts. The MP three standard can support lots of different sample rates, but forty four point one killer Hurts is pretty much the common standard. So you've got a number of samples with your audio file, and that number will depend upon how long the audio file is. You've got forty four thousand one samples per second, actually twice that for stereo, but for the purposes of this discussion, let's kind of stick with mono sounds so that I don't start having

16:29

math coming out of my ears. And we're still in the very easy, simple part as far as math goes. We haven't gotten to the complicated stuff yet, all right, So you've got forty four thousand, one hundred samples per second. To compress it into an MP three format, the algorithm first groups all of these samples into collections called frames. So take those forty four thousand one per second, and then you start saying, okay, we're gonna group you in batches.

16:56

Each batch is called a frame and each frame contains one thousand, one fifty two samples. Now that's specifically to maintain backwards compatibility to IMPEG Layer two, which established that one thousand, one or fifty two number. But we're not talking about IMPEG layer two. We're talking about IMPEG Layer three, and though that means we have to get a little more complicated. So each frame consists of two subgroups called granules.

17:25

So each granule has five undred seventy six samples seventy six times two one thousand fifty two, so five seventy six samples per granule. Now, technically MP three encoders only work on one granule at a time, but they may reference the granules immediately before and immediately after the current one in order to see how the audio within the file changes over time. All right, so now you've got

17:49

your granules of five hundred seventy six samples each. Then the MP three encoder runs the samples through a filter bank, which sorts the sound into thirty two frequency ranges. Are you are you crazy about the numbers yet, Dylan? Are you? Dylan's Dylan's nodding. Dylan gets worse from here. So you have thirty two frequency ranges, which is another nod to the layer two method which use those thirty two ranges for encoding purposes. But we're not talking about layer two early, No,

18:20

we're talking MP three. Gosh darn it. That means we take those thirty two ranges and we subdivide them by a factor of eighteen. That means we have five hundred seventies six bands of frequencies, each band containing one six of the frequency range of the original sample. So what that actually means, and this this is actually pretty easy. The bands are not limited to a specific number for

18:48

their frequency range. Right. The bands don't mean that on the on band number one it goes from twenty hurts up to a certain range and on band five D seventy six in that twenty killer hurts. That's not what it means. They're dependent upon the original audio. So if the original audio contains sounds within a narrow range of frequencies, the five bands will be more precise. But if the original recording has a vast range of frequencies, the bands are less precise. So another way to think about this

19:21

is with a pizza. So let's say you get extra large pizza and you cut it into eight equal slices. And then you get a small pizza and you cut that into eight equal slices. Well, in both cases you have with each slice one eighth of a pizza. But the extra large pizza pizza slice is bigger than the small pizza pizza slice. It all depends on the size of the pizza. So in this case, it depends upon the range of frequencies. And and Dylan, do you think we could go for some pizza, you know, just just

19:53

put the episode on hole and go get pizza. Dylan's nodding. It's great for audio. Yeah, so, uh, pizza, We'll be right back. Okay, that was good pizza. Now um oh man, I got a whole bunch more notes. Okay, well, let's let's go ahead and and do the rest of this. All right, So you've got your sound divided up into those five seventy six sub brands of frequencies, you know, the thing I compared to pizza slices earlier. Now you

20:19

get two different mathematical processes applied to this data. One is the fast Furrier transform or f f T, and the other is the modified discrete cosine transform or m d c T. Now I am not going to dive deeply into how these transforms work because frankly, they are beyond my mathematical understanding. But I know what they do. I just cannot explain the process like how they do what they do. So I'm going to give you the explanation of what they do what the outcome of each

20:54

of these transformed processes happens to be. But I'm not going to be able to tell you the actual mathematical steps involved in each because I don't math. So good guys, But let's start with a fast for your transform. So transform is kind of what it sounds like. It's all about transforming information in some way. So in this particular case, the f f T transforms the frequency bands we just talked about into data that can be further analyzed by

21:22

a psychoacoustic model that's in the encoder. So this is that simulated human ear and brain we were talking about earlier. So what the encoder does is it analyzes each bed of data and looks for signs that it represents audio that wouldn't be perceived by a human. So it's looks

21:41

looking for any potential for masking possibilities. So are there collections of frequencies that are grouped close together, and is one of those frequencies louder than the others, you might be able to do away with those softer frequencies because of frequency masking. The encoder will also look at whether or not the audio has a lot of complexity to it, if it has a lot of changes, or if it's

22:03

just relatively steady or simple audio. Any transient sounds that are present in the audio might end up being temporal masking, so it'll analyze those as well and see if that's a possibility. So really what they're looking is for, you know, just any really loud sounds that stand out above the rest of the recording. That's what the f f T

22:26

is doing. So what about the modified discrete cosign transform. Well, this is happening in parallel with the f f T and the samples get sorted into different patterns called windows uh and the criterion for sorting all has to do with whether the sample represents a steady sound or varied sound. So if you have a simple steady sound that goes into a long window, if there's a lot of variation in the sound, like there are a lot of consonants in a vocal line or it's like a drum solo

22:56

or something like that. It would get sorted into it series ease of three short windows, and each short window contains one two samples. That amounts to four whole milliseconds, so four thousands of a second in three patterned windows. So you've got these windows now, either long windows for simple sounds or short windows for the more complex sounds. And then the modified discrete cosine transform kicks into gear.

23:24

It looks at each long window or set of three short windows and converts them into a set of spectral values. To some of you, that probably sounds meaningless. So let's talk about spectral analysis for a second. First, I was very disappointed to learn that spectral analysis doesn't involve a psychologist talking to a ghost about its emotional state, so bummer. But spectral analysis is when you look at a spectrum of information, like a spectrum of frequencies or related information

23:54

like energy states. That's what this transform does. It takes data that originally represents a slice of time in a sound waveform. That's what sample is. A sample is an instance of time in a wave form and converts it into information representing sound as energy across a range of frequencies. Now, you can plot out spectral information in a lot of different ways, but one common method is to use brightness to indicate energy levels. Higher energy levels are brighter patches

24:26

in your visual representation of spectral data. High frequencies would appear at the top of a spectral view, like imagine a box, and at the top of the box that's where you would find high frequencies, at the bottom of the box that's where you find low frequencies, and it's

24:41

just lots of patches of color. The really bright patches of color represent very high energy frequencies, so they could be high or low in in actual frequency, but we're talking about energy levels, not whether it's a higher low pitch. Looking left to write represents the passing of time, and looking along any vertical points shows you the actual frequency or pitch, and then the respective energy level is the brightness.

25:09

So it's kind of like looking at sound as a wave, but instead of being a wave, you're looking at information that indicates frequency range and energy level. That representation is actually kind of analogous to how we hear audio. So an encoder can analyze the spectral view and start to

25:25

filter out the data we wouldn't perceive due to psychoacoustics. Now, after all that processing, the encoder looks at the frequency sub brands and the levels of spectral intensity for each and that information can then be used for the next phase, which is compression. But right now I think we could all stand a little decompression, So let's take another quick break to thank our sponsor. All right, so now you're ready to compress your analyzed audio. Good for you, and

26:04

by you I mean encoders. This has to be simpler than that analysis segment, right, I mean that got a little crazy with all the different bands and sub bands and windows and frames and granules. Sadly it gets more complicated, all right. So there are two layers of compression going on with MPEG Layer three. One of those layers depends upon the psychoacoustic analysis and the other doesn't. So why

26:34

would you use two layers with different strategies like that? Well, the reason is that one strategy is great for complex audio with lots of components, but not so great with simpler sounds, and the other strategy is kind of the opposite. So the psychoacoustic approach is the one that's really good

26:49

for complicated sounds. If if you've got a lot of volume changes, lots of different frequencies, it's just complicated and rich sound, you've got a lot of opportunity to look for masking and other acoustic elements that limit the actual sounds that people perceive. So it means there are a lot of chances for you to uh fudge by dropping

27:11

all the stuff that people probably wouldn't notice anyway. And uh, if you take a piece that's got a lot of elements at varying volumes, there are likely several opportunities to to do this. But if you're talking about relatively straightforward audio with few components, few changes in volume, there's really not a whole lot of data you can ditch without it actually affecting the quality of the audio in a

27:35

perceptible way. And this is part of what Brandenburg, that guy I was talking about in our first episode in this series. Uh, that's what he discovered when he was working with the MP three standard and he was listening back to that Suzanne Vega acapella track Tom's Diner. He was listening to a compressed version of it, and he said it was terrible. He said it ruined the quality of the audio. And part of that is because that

28:01

particular song is fairly simple. There's just not a lot of opportunity to take advantage of masking and other tricks without potentially compromising the quality. So they decided to also incorporate some traditional compression strategies, which which work better with

28:17

those types of recordings. So the MP three format takes advantage of both the traditional approach and the psychoacoustic approach, and that allows the encoder to compressed files into smaller size without just following a single strategy, like it doesn't have to do a one size fits all for all elements of audio. Now, combining those two strategies requires a little more mathematical gymnastics. So let's go back to those five seventy six frequency bins. You know, those sub bands

28:47

we talked about earlier. You've got to quantize those suckers. What does that mean. It means assigning a quantity to each to each frequency bin, you have to give it a quantity of some sorts so that you can end up judging how much you can get away with dropping data. So to do this, the encoder sorts those five six bins into twenty two scale factor bands. How you doing over there? Dylan just checking in on you? Okay, Dylan's got Dylan's got a thousand yards stare going. I hope

29:20

you guys are doing okay over there? All right, So before smoke starts coming out of your ears, let me explain what the scale factor bands are all about. The whole purpose of the scale factor bands is to determine how the information will be stored within the compressed state. So you want to get away with as little data as possible before affecting sound quality. So if you can say the same thing in a shorter space without affecting the quality of what it is you're saying, you go

29:49

with it. Brevity is the soul of compression. So if we were talking about language, I would say it's more efficient to say it's raining outside, or even just it's raining, because you would assume that it would be outside where the rain is happening, and it would be inefficient for me to say it's coming down like cats and dogs out there. It's not as efficient as saying it's raining.

30:16

So if you can get away with shorter statements without affecting the actual quality, and you could argue that by switching from it's coming down like cats and dogs out there and it's raining changes the quality, And that could be a valid argument. But if you can get away

30:33

with shorter without affecting quality, you do it. So each scale factor band is represented by a quantity, Then the encoder divides that quantity by a given number called the quantizer, which is the same across the entire frequency spectrum for that recording. The resulting number is then rounded up or down to a whole digit. And here's an important point. Individual scale factor bands can be scaled up or down for more or less precision to represent the actual value

31:08

of those bands. So what the heck does all that mean? Well, the purpose of dividing and rounding is just to simplify the data to reduce the amount you need in order to store the information. So let's go with a totally

31:20

hypothetical example. Let's say you've got a scale factor band and you've decided you're representing that scale factor band with the quantity seven eight four zero seven thousand, eight hundred forty, and you've chosen the number one hundred to quantize your data, meaning that you will divide each uh scale factor bands quantity by one hundred. So this is seven thousand, eight hundred forty. You divide it by one hundred. Uh and the scale factor for this particular band you have determined

31:52

is one point zero. That means that once you get that result where you've divided the quantity by the quantizer, you multiply by one. That means there's no change. Multiply by one you get the same number. More on that end a bit. Okay, So you take that seven thousand, eight hundred forty you divided by one hundred. That gives you seventy eight point four. Well, now you have to

32:14

round that number, so you round it down to seventy eight. Now, when you have a decoder and you're ready to play back the information, it comes across this quantity the seventy eight, and it knows what the quantizer number was, so it multiplies by one hundred to get back to seven thousand, eight hundred. So the replicated number is actually forty off from the original number. The original number again with seven thousand,

32:38

eight hundred forty, the replicated number is seven thousand, eight hundred. Now, those inconsistencies manifest as noise in the actual playback. So if you wanted to increase the precision of any given scale factor band, you could do so by changing the scale factor number. So in that example, just now, I said the number was one point zero, meaning there's no change to that result. But I could have said it was ten, which means we would multiply the quantized number

33:05

by ten. So we would take that seven thousand, eight hundred forty divided by one hundred you get seventy eight point four, then multiplied by ten to get seven four. So when the decoder decompresses the file, it would reverse this this whole thing. It would just multiply by a hundred um. You would end up getting seven thousand, hundred forty again, which means that you wouldn't introduce any noise

33:27

to the file. You would have a perfect representation. But in some cases, the encoder may determine that any noise that you generate wouldn't be noticed or it wouldn't impact the quality of the audio enough for it to be a problem because of other factors for that particular scale factor band, like maybe it's really quiet, or maybe it's really complex. So in those cases, you could reduce the scale factor number by making it something else like point

33:52

one instead of one point oh. So that means you would multiply the quantized number by point one, So the seventy eight point four would become seven point eight four, and then you have to round it to get a whole integer, so you get eight seven point eight four rounds up to eight. Now, when a decode or decompresses

34:09

the audio, it multiplies eight by one hundred. That quantizer that we've talked about so much, uh and uh, actually at this point would have to be eight thousand because it's also taking into account the scale factor, so it's multiplying it by a thousand, not just a hundred. So you would get a number that would pop up to eight thousand. And remember the original with seven thousand, eight

34:32

hundred forty. So you look at the difference between these two, the original seven thousand forty, the new fact number is eight thousand. There's a pretty big difference there. That change might introduce enough noise for it to be a problem. So how does the encoder determine if a scale factor band is meeting the proper criteria? How can it tell if there is ah too much noise or if the noise falls below the threshold? Well, it goes through what

34:56

it's called a Huffman coding process. At this point, Dylan is currently just staring at the wall and drool is coming out. Huffman coding process. It's converts scale factor bands into binary strings, and the process goes through a series of tables to determine if the data within the scale factor band requires more or less precision to describe the sound without affecting the audio quality. So, Huffman coding is

35:22

a process. And when you start with a large number of possibilities and you begin to narrow it down, uh. Some people describe it as the coding equivalent of twenty questions. So you ask your first question like animal, vegetable or mineral. You get an answer so animal. While that first answer eliminates a ton of other possibilities and narrows the focus like anything that doesn't pertain to animal, you can automatically discount because you already know it can apply to that answer.

35:51

With MP three compression, this means making certain the number of bits representing a granule because remember I mentioned that in MP three formats you have frames, and each frame. Each frame has a thousand, one or fifty two samples and consists of two granules with five s each. So when you answer the first question, it eliminates a lot

36:11

of other possibilities and narrows the focus. So like with animal, vegetable, mineral, if I say animal, you're gonna not ask any questions that have to do with minerals or vegetables only because it wouldn't make sense. You know, those aren't gonna apply. Same thing with m P three's except this time it means making certain the number of bits representing a granule.

36:31

Remember their two granules per frame with the MP three layer, Uh, you want to make sure that the number of bits representing that granule match the chosen bit rate for a compression. So if after going through this process, the encoder says, hey, this granule has more bits than what's allowed. It's too

36:48

many bits. The we gotta get rid of some of these, the encoder can adjust the scale factor band so that there's less precision meaning that multiplier in other words, that but I talked about earlier, and thus reduce the amount of data needed to represent that particular granule. If a granule comes in under the bit rate, the encoder can increase the precision to reduce noise and fill that granule

37:15

out properly so it matches the actual threshold. After all this, the pairs of granules become frames within the MP three files. And the only other component in an MP three file apart from these frames is the I D three metadata. This is pretty simple. This is like a header, and it comes before all the frames in the audio file and contains information about about the file itself, which can include stuff like the title of a song, an artist name,

37:42

an album title, other stuff like that. It can also include copyright information as well as information about the file itself, such as whether or not it's a stereo recording or a mono recording. So when you use a decoder like an MP three player, it takes this compressed information. These these these representations that the music has been reduced to, and it converts that Huffman data back into the quantized format, scales the data back up to its original size or

38:14

close approximation. Remember the the uncompressed version may actually be off by a significant amount depending upon each individual granule. And all of that data gets recombined into a new pc M sample that can be played back to you. And that's all there is to it. Nothing could be easier. All right, that took a lot out of me, so I got really technical, and I apologize if I lost any of you out there, or for those of you who have a lot of experience working on compression algorithms,

38:46

for oversimplifying in several cases. But now we've got a full episode about this, and I hope you have a better understanding of how a big sound file can be reduced to a smaller sound file. Next time, I'll just say magic. It will make everyone happier. But I hope you guys appreciated this. In the next episode in this series it will be far less technical. I'm going to

39:09

be more historical. I'm going to talk about the progression of the MP three player, how it came, about, how it evolved, and how the iPod ended up becoming the dominant brand in a c of MP three players, and then maybe kind of explore where MP three players are today, like how many are there, how how big is the market? Are are people still buying them? That kind of question. If you guys have any questions for me, or comments

39:37

or suggestions anything like that, send me a message. My email is tech Stuff at how stuff works dot com, or you can drop me a line on Facebook or Twitter, the handle of both of those those tech stuff h s W and I'll talk to you guys again really soon for more on this and sense of other topics. Is it how stuff works? Dot com m

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript