Compression - podcast episode cover

Compression

Oct 23, 202346 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Matt and Ben talk about how compression works, specifically deflate, which is apparently everywhere. Ben gets particular about compression ratios. Matt explains how to compress /dev/random by sorting it first.

Transcript

Matt Godbolt

Hey, Ben.

Ben Rady

Hey Matt.

Matt Godbolt

What are we gonna talk about today?

Ben Rady

Uh, I think we're gonna talk about compression today.

Matt Godbolt

So like squishing things?

Ben Rady

Yeah. Squishing, squishing bits, takings taking bits and squishing them. So like, which, you know, if you

Matt Godbolt

Squish a zero, like flat, it becomes a one.

Ben Rady

Yeah. Right.

Matt Godbolt

That kind of thing. I see.

Ben Rady

That's one way, unless you squish it down, in which case it becomes an underscore.

Matt Godbolt

Oh, or a minus.

Ben Rady

Well, if you gotta squish it from both sides to make a minus.

Matt Godbolt

Oh, That's true.

Ben Rady

These are the compression algorithms that we need.

Matt Godbolt

These are the compression algorithms that we're here to talk about today. Which one of those is z Lib

Ben Rady

Yeah. Let's talk about that

Matt Godbolt

What do you wanna talk about? I mean, like, what, what is, what even are a compression?

Ben Rady

? Uh, so yeah. Aside from turning, you know, zeros into other various characters on your keyboard,

Matt Godbolt

Which is not really what they do. Just to be clear,

Ben Rady

Um, yeah. I mean, you know, quite often it, it makes sense to trade, uh, a little bit of extra CPU time in order to get a little bit less storage space or basically just bytes because you're transferring something over a network or whatever it might be.

Matt Godbolt

Right.

Ben Rady

Lots of other contexts. Um, and so you want to take a bunch of data and shrink it down into something smaller. Uh, and there are many algorithms for this and, uh, we're gonna talk about them today. But I think the first thing we should talk about are sort of the different trade-offs because it's, it's like, well, why wouldn't I wanna make my data smaller?

Matt Godbolt

Right.

Ben Rady

And, and how do I wanna make my data smaller? And other considerations that you might have when choosing one of these algorithms.

Matt Godbolt

I mean, so obviously the first thing that springs to mind is how good is the compression algorithm at squishing my data in the first place? Like, given a input file, how much smaller can it be after it's been compressed? And that's really easy and objective to measure. You know, you take your a hundred K file and you apply your compressor to it, and you like go, Hey, it's 200 bytes now. That's awesome. Fantastic. And this one makes it 190 bytes. You're like, that's obviously better. Right. Let's just use the 190 byte one. Right. So that's the first thing is the compression ratio.

Ben Rady

Yeah. And people do tend to talk about it as a ratio. Right. Or, or sometimes a percentage. Right. You'll get like, you know, 90% compression, which the way I interpret 90% compression is you took something that, uh, it's sort of, it's a little, I feel like it's a little weird when you, when you say that, 'cause it's like I took a hundred bytes and I compressed it down to 10 bytes.

Matt Godbolt

Right? I saved 90% of the original

Ben Rady

Size. I that's 90% of the original, right? Yeah. Which, you know, it's like, so I got 10% was the result

Matt Godbolt

Oh yeah. That's,

Ben Rady

But that sort of breaks that model too, right? So I think it's really important when you're talking about, and I, this is true all the time when you're talking about percentages and ratios and things like that, where it's like, it's kind of easy to lose track of what the numerator and the denominator were. Um, so I feel like when, when, when you're discussing these things, it never hurts to sort of be explicit,

Matt Godbolt

Right? So when we're talking about compression ratios, well, I dunno that we'll have any concrete things , but then Yeah. In, in that, in, in that specification, 90% would mean it is 90% smaller. Means that Yeah. 90% of the original size went away, and you're left with a file that is 10% of the original, right? . That makes, that makes sense. So , obviously, naively you want to pick the thing that compresses your input to be the smallest output, right?

Ben Rady

Right.

Matt Godbolt

But there are other trade-offs, obviously, right.

Ben Rady

Well, there's compression how long it takes to do the compression.

Matt Godbolt

Right. Uh,

Ben Rady

And there's also how long it takes to decompress it.

Matt Godbolt

Right? Right.

Ben Rady

Uh, and for certain algorithms there's how lossy it is if it is lossy.

Matt Godbolt

Right. Right. That's a good point. I, I suppose, yeah, the, the, the idea that we could talk about loss, e compression as well might be too much for one podcast.

Ben Rady

Yeah. I mean, at least asking the question is always a good one, right? Like, is this actually lossless compression? 'cause not every compression algorithm is,

Matt Godbolt

Can I assume that when I decompress it, I get the exact same bytes that I put into the compressor? That seems , it's kind of the definition of lossless, uh, versus image compression and stuff, which actually loses some of the data and you don't get it quite back as we've all seen from

Ben Rady

That audio video can all be a little lossy. Yeah. Yeah,

Matt Godbolt

Yeah. Yeah. . So obviously compression time and decompression time. Uh, another thing that I think's an interesting one to consider is how, how much code it takes to decompress your input. Because kind of an implied input to your, uh, decompression data is the thing that can decompress it now, obviously in, in on a Unix system or whatever, you know, you've got gunzip and it's like, okay, there's five, you know, 500, uh, bytes of, sorry, 500 K of, of executable somewhere. You don't have to worry about it because everybody has gunzip somewhere. And so you can kind of like, oh, all right, we can decompress this. But like, if you are working on an embedded system, which has got limited resources, which is one of the main reasons you might want to do compression, like, Hey, we've got an SSD and we need to read things off of the SSD and decompress them into ram to run them or to do whatever, then you realize that you actually have to budget for the decompression algorithm and the amount of code space that it takes up. And if you have like a, you know, a compression algorithm, which is like, Hey, I have a lookup table, and the lookup table is 200 megabytes of like values, canned values, then, and you just store in your compressed version, you're like, well, it's just number seven in that table.

Ben Rady

Right, right. Yeah. That's a great consideration. Actually, hadn't really thought that.

Matt Godbolt

I mean, at it's top of mind just because of like looking at, uh, uh, compression like old eight bit machines that, that I'm looking at like, oh, this is cool. Look how, how much you can compress. But unfortunately the decoder is like five K and that's too big

Ben Rady

Yeah. Yeah. Yeah. And I mean, I guess part and parcel to that is kind of like, uh, fonts also have this property is like there's a certain percentage of these things are gonna be available in any environment that you're delivering the data to, because again , a lot of times when you're compressing stuff, it's in order to send it somewhere. And the question is, can the person on the other side actually, uh, uncompress it? Right. And so depending on the exact transport mechanism and the clients involved and a bunch of other factors, you may or may not have access to those algorithms and thinking about which ones are gonna be more ubiquitous and which ones are gonna be like, oh, actually we need to first send the decompression library and then we can send Yeah. The, uh, data itself. Yeah.

Matt Godbolt

Actual data, right?

Ben Rady

Yeah. Is, is, uh, a sort of another dimension of that in a way.

Matt Godbolt

And I mean, if we talk about ubiquitous, I mean, the most ubiquitous compression algorithm that I'm aware of would be deflate, which is the algorithm that powers zip files and gz and every web, um, transaction you've ever done. Um , although I guess Z standard has now taken that over some, but yeah. Alright. Most

Ben Rady

Okay. Yeah.

Matt Godbolt

So one of the things that I used to do a million years ago on eight bit machines was, um, you would observe that very often you'd have sequences of bites that repeat. So you've got some, usually like a picture, an image, or a Sprite. In this instance, you'd be like, well, okay, it's the, it's the value zero 80 times followed by a one and a seven and another three zeros, and then another a hundred zeros, and then whatever. That doesn't make sense, but whatever, you know that, so even just eyeballing in a hexed editor, you can see that most of it is zeros. And so one might just go a simple, um, compression algorithm would be, okay if I see a zero count how many more zeros there are , and then when the, the, the de the, the compressed output is zero, followed by how many zeros there were.

And, uh, that's great. If zero is the only thing you want to do compression for, and this is like run length encoding as is RLE, um, where we're saying, Hey, I'm gonna look at how long the run of this particular single number is , and I'm just gonna repeat it now. That's cool. It makes sense. The decoder is easy. You read bites and every time and you just copy them. But every time you see a zero, you read the next byte and you write out that many zeros straightforward. Right. And it's pretty effective. And obviously you can imagine that the decoder is trivial. The encoder is pretty trivial as well. But the first and obvious thing you might notice is, well, what if there's only one zero? Right now you are sending zero followed by a one.

So you made it bigger, Right, and that's twice as big

And then you can try to r e compress 512 bytes, and if it gets bigger than 512 bytes, you go, ah, let's just store it as is. And then you just put the flag that says this is just exactly as, um, it's just byte for byte, copy the next 512 bytes. And of course you can generalize that, but you know, in the worst case, now what we have is every 512 bytes, we add one byte that says this isn't compressed. And now we still made it bigger, but hopefully not as bad as it would be if every byte was, was there. And that's another reason. So the sort of metadata about the data that explains how the compressor is configured or what how to decompress the next bit can take up more space than if it was just, uh, raw data. . So that's run length encoding, that's one of the easier ones. What else could we do to compress data?

Ben Rady

Uh, well, that's a good question. I don't have an answer to that.

Matt Godbolt

What is the most common letter in the English language?

Ben Rady

Uh, e

Matt Godbolt

RE?

Ben Rady

Uh, well, normally it's, um, one byte.

Matt Godbolt

One byte

Ben Rady

Right? Yeah.

Matt Godbolt

But we know that E is more common, the most if we're compressing, let's say text. E is more common than almost every other letter. , uh, well, is the most common letter, sorry. Um, so what if we could say, let's forget about bytes. Let's just treat this as a stream of bits. What if I said I could encode an E with only two bits, like maybe let's just say zero, and then one is an E as in the binary zero one , and then every other letter for the sake of ease over the air, and not without a

Ben Rady

Right?

Matt Godbolt

Right. So this is a kind of like, you know, you think of it like the rle in terms of we're trading off something to encode something which is hopefully more common or is more compressed, but we're, but we've made it slightly more pessimistic in another case. But, um, now if you imagine like a third of all of your text is an E, you've dropped that third down to two bits, and everything else is nine bits. Now I can't do the math in my head, it's terrible to do. But, but, um, there is a way to construct a optimal tree of bits, ones, and zeros that give each character that you could possibly have inside your output a variable number of bits based on how common it is or not. So while an E may be one zero, um, I can't remember what I just said, zero one, whatever. And now maybe the next most common letter is, I dunno, t that might be 1 0 1, and that's unambiguously a prefix. That only means it's T and it's three bits now . And then you can start building a tree with all of the other things. And until you get to like, you know, the Q and the Z take maybe even now 12 bits to, to unambiguously encode. Um, but now, you know, effectively we are storing the letters with codes that are inversely...inversely proportional in length to how common they are,

And now are we, uh, the, the, the document we're storing, provided it's an English document, um, and fits this sort of statistical, um, arrangement of probabilities of how likely each letter is. Now we can store it in considerably less, uh, bits. I mean, it, it may make sense, you know, like, you know, like that's, there's, there's a reason why the, the, the phone number for emergency services is Right. Just 1, 1 2 or 9, 9, 9.

Ben Rady

9, 1, 1.

Matt Godbolt

What is it? Whatever. Oh, yeah. That's what it is. What's the number?

Ben Rady

You should probably know what that is. If you live in the United States,

Matt Godbolt

Okay. I'm gonna write that down. It's like, yeah. Uh, yeah. So I was gonna say that, that like, phone numbers are, have like this sort of like feeling of like, there aren't ambiguous, you know, 9 1 1. Doesn't code, that's not the beginning of someone else's number. You never accidentally code it, but that means that anyone else starts with like a zero or a one or something like that, which is like a reason why. So you were gonna say something?

Ben Rady

Oh, I was just gonna say, so one of the things that I've heard about, uh, frequently when talking about different compression algorithms is using a dictionary, uh, as a, as a compression mechanism . And one of the things that I've heard about that is that you do, and, and I don't know if this is universally true, but it's something to at least that I've considered in the past, is you, you either need to do this thing where you're saying like, okay, well let's assume that we're gonna be compressing English words. Like, okay, that's a big assumption, but sure. For certain applications that makes sense. In that case, you can kind of define some of these things up front, including something like a dictionary, right? So you can make just a dictionary of words, right. And you can have each, uh, byte sequence refer to whole words instead of individual letters.

And that would be another way, potentially, uh, to do this. And, you know, there's like, you know, somewhere in the order of tens of thousands of words in the English language, so your dictionary wouldn't be all that big . And, you know, maybe you get good compression outta that. Alternatively, you could build the dictionary after you've run through the data and found the most frequently occurring patterns or words or sequences of bytes or whatever you might find. But the disadvantage of that is that now you sort of have to run through everything before you can build that dictionary, or you need to find, have an algorithm that'll build it on the fly or something like that. Um, and so that seems like a whole category of things where you're sort of trading off between either knowing what kind of data you're processing beforehand or doing some things post hoc after you've run through the data that might make it a little bit more difficult to work with if you were, for example, trying to do compression, um, as is like a stream, like you, you, you don't ever wanna have to seek through the data multiple times or use a whole bunch of memory while you're compressing it. And so you have these, these trade-offs that you need to make when you're, when you're doing that. Um, are there algorithms that you've, compression algorithms that you've worked with before that sort of make these trade-offs in different ways?

Matt Godbolt

Well, absolutely. So, um, we're actually slowly walking towards what deflate is going to be, which is a cool, cool thing, you know, and, and totally unintentional, which is great. Uh, but, um, so just on that one thing you mentioned about like, um, if it was English words and the probabilities, and the thing as I described, obviously if you knew the compressor and the decompressor both knew it was English words and they'd previously agreed on the frequency of letters, sorry, letters, I'm talk about the letters thing first. They've previously agreed on the frequency of letters, then they would both already know that E is zero one and, and P is whatever, 1 0 0 1 or whatever like that. But if they don't know that, or if you want the, as you say, um, have a sort of dynamic idea about, well, I ran through the data and I counted how many of each letter there were

and now I'm gonna build my, my, my tree of like how many bits, code for which letter uniquely for every time, every piece of data, then you need to actually transmit that tree to the other end so that the other end can say, okay, this is, we agree on the code book of like how many bits mean which letter. So that's a downside with, um, uh, dynamically sending that information across. 'cause the, the tree is not free , it takes up some space, right? , so already . So you have to offset the compression you get from the tree with the fact that you have to, to send the tree at all. One of the cooler tricks is like exactly as you said, is what if you instead of treated each byte individually, you also said, well, dictionary elements themselves are just other symbols. So now we're just, instead of compressing character at a time, bite at a time, we'll just call them symbols. And let's say the first 255 symbols are the literal value zero to 255 of the bite, zero to 255 .

Ben Rady

Okay.

Matt Godbolt

And then let's say symbol 256 is the first of my dictionary. Mm-hmm.

Ben Rady

,

Matt Godbolt

Right? And it just means the letter, I dunno the word cat, right. And the 257 means dog and whatever. Right. And you, again, if you could previously agree this on both sides, um, the tree building step that we talked about, where depending on the relative probability or uh, number of occurrences of each of the symbols that you see, doesn't care how many bits long those symbols are. Right? You can have a billion symbols in that tree if you wanted. And it will still be the case that the most commonly occurring ones will have the shortest bit encoding and the least frequently occurring ones will have the longest one. And obviously as you put more and more things in, in the limiting case, that can be a very, very long, uh, prefix. So to the point where you're like, well, this is longer than the word that it's coding for in the dictionary, let's not bother with this anymore.

We might as well just encode it using smaller, uh, words. So a way of naturally extending the table of, um, of characters and their prefixes. It's just to throw in words as well, or pre previously agreed dictionary components so that you know, hey, this is this, this pattern appears in my data more often than not. So, um, when you see 1 0 1 1, that actually means the entire string cats are cool. Right. Because that happens in my, my application a lot. Right? Right. That's the kind of thing you could do. And um, but there's another sort of aspect which is, um, now I have to either transmit that dictionary. Boo!

Ben Rady

Yes.

Matt Godbolt

Or we have to previously agree on it again. And now we're back into the world of like, oh my golly, uh, how big is this dictionary

Ben Rady

Right? Right. And if you do transmit it, you have to transmit it it after you've examined the data. Correct. Which means if you're, and maybe I'm wrong about this, but if you're decompressing on the other side, you can't really decompress then in a stream because you don't get the dictionary until the very end.

Matt Godbolt

Well, usually.

Ben Rady

Or is that not true?

Matt Godbolt

Just, you'd have to get the dictionary up ahead, uh, because the compressor needs the dictionary as well, if you were to do it that way. Right. Because as you're going through, you can you, if you are, I'm assuming that you have a static tree that you've built that is like, here are all the probabilities, and they therefore don't change. But obviously we can talk about how you can actually update them. And that sort of leads us naturally to the next thing, which is what if you didn't transmit the dictionary at all?

Ben Rady

Okay. Yeah.

Matt Godbolt

'cause if I've, if I've just transmitted the letters C, A and T, because that's the first three letters of a document that I'm, I'm sending , I have actually also just transmitted the letter, the, the word cat , right? It, it took three bytes, or it took however many the relative probabilities of the C, the A and the T were to to transmit, uh, to transmit. But now I effectively have a dictionary record for the, for the word cat. It is three bites back from where we're currently decompressing. If you've just wr if you're the decompressor and you've just written out the C, the A and the T 'cause you've decoded them. . So what, instead of sending an explicit dictionary of things that I've previously agreed, the dictionary entries themselves are effectively references back from earlier in the document.

Ben Rady

Hmm. Interesting.

Matt Godbolt

And so now I, anytime I want to say the word cat, I can say, oh, well, depending on where you are in the output, go back however many bites to get to the, the, the C and copy three bytes, C and A and a t . And so now I've effectively got a dynamic dictionary that, um, allows me to refer back to anything I've previously transmitted. And so suddenly I, if I've transmitted the word cat, I can say cat over and over and over again with just one symbol. And that's pretty cool.

Even cooler is what if I had the letter a 700 times in a row? Do you remember in our original description, we were talking, well, you know, you have run length in coding. I could transmit something that says, Hey, there's an A followed by. There's a ton of them. All right. So, you know , just copy that A 600 and, sorry, you know, there's 700 of these A's in a row . But if I say, um, transmit the code here is an A, so the decompressor is written out an A and now I say, Hey, there's a dictionary element I would like you to use it is go back one character and copy 699 times.

Ben Rady

Yeah. Okay.

Matt Godbolt

So what I've just done is I'm, if you can visualize it, I'm reading the A immediately that I've just decompressed and I'm copying it to the current location, and then the two pointers are gonna walk along, and so that A is gonna get smeared out 700 times . And so this dictionary compression that we've just talked about, where I can back reference to a parti, a previous part of the, the document gives us rle for free, that is run length encoding. I can just take an A and then I transmit the dictionary entry that says, go back one copy seven, 600 . And I've got my run length encoding.

Ben Rady

So the, the size of that, uh, amount of data that refers to how far back in the stream you go. Right. Whether it's four bytes, eight bytes, 16 bytes, however big it is. Am I correct in assuming that that is the amount of previous data that you need to hold in memory in order to decompress the, the data as you stream through it?

Matt Godbolt

Exactly. Right. This is referred to in, uh, as a sliding window because it's the, the last

M bytes usually something like 64 K that you, even if you are a streaming decompressor and you're passing the bytes onto some other thing, you know, you're printing them directly to the screen, whatever you're doing , because of this, you still have to hang onto the 64 K of prior data in RAM so that when the next byte comes from the, the, the wire that says, Hey, no copy back 32 K, uh, and you know, it's 27 bytes from there, you need to be able to get access to that still. So it obviously it does. Um, and I guess this is something we didn't talk about, like the runtime memory requirements of

Ben Rady

Right.

Matt Godbolt

The decompressor, right. You know, you have to keep something around in order to, to to, to to, to be able to decompress.

Ben Rady

Yeah, that's right. Yeah. That is another consideration that we forgot to consider.

Matt Godbolt

Forgot. Yeah. An unconsidered consideration.

Ben Rady

Right? Yeah.

Matt Godbolt

So anyway, we, so what we've got so far is we have, uh, a tree that lets us encode symbols. And symbols are not just bytes. They could be dictionary elements. And, um, some of those, and those dictionary elements are actually now phrased as go back N and copy M bytes. Um, now this is glossing over this, the actual specifics of deflate, which actually has different trees for different things or whatever. But like, let's just stick with this for the simp quote, simplicity, um, of, of trying to explain it to somebody who's walking their dog at the moment, or is is pottering around the house.

On the train. Exactly. Yeah. So, um, you mentioned, and a, a number of times, you know, having to do multiple passes and the, the dealing of, of streaming and things like that.

Ben Rady

On the train?

Matt Godbolt

Unfortunately that is something that the compressor is forced to do in order to be able to make these determinations. What it's likely, and, and there, although there are algorithms for online building of the trees and whatever, um, typically what happens is that the compressor chunks, its input data into sort of reasonable sizes, and then the compressor normally tries a few different types of algorithms, uh, exactly as I described right at the beginning with the rle, you know, like it goes, Hey, let's try doing it with, uh, just a static tree. Let's try it with a dynamic tree. Let's try it with back references. Let's just, uh, and, and each time it says, you know, does this, does this work out as being smaller? And it would pick the one, the block that is the, the, the smallest, um, o of them all.

And ultimately there is a didn't compress type block, you know, storing it only, which only has a couple of bits at the beginning that says, oh, sugar, I couldn't actually compress this. And then here's just the literal data. And so that is the hedge for like the un compressibility and it bounds the maximum, uh, size of the, of the compressed thing to be, you know, only a tiny percentage bigger than the input in the worst possible case. So having done this, it's now run through, it's, it, it, it sort of encodes, uh, the dictionary based compression. It's trying, and, and that is for what it's worth a very difficult problem, right. It has to kind of search through the space of like, should I go back two bites and copy one bite or should I go back further in the document.

and carry copy slightly more. Now going further back, I might be able to get a longer match that, you know, 'cause maybe I've got CAT and I've got ca and I can build CAT by going back 700 bytes and copying three bytes. Or I've recently done a ca I can go back a smaller distance to get just the C and A copy those two and then just output a T. And those are equivalent and the output stream, but the exact bit patterns of how to describe those two operations of going back a long way and copying three versus going back a short distance will change the overall output compression, uh, uh, ratio because, um, maybe the nearer ones are used more often. And so, you know, they, they, they'll have a smaller coding in the tree. Yeah. Okay. And that kind of stuff.

So it's not, there isn't like a one true encoding, and that's essentially when you are, you are, you know, gzipping something and you do gzip -9, you're telling it to try really, really, really hard and go through all sorts of different combinations and try different encodings to see if it'll work out as well as, um, there are some internal acceleration structures that are sized accordingly. Yeah. Um, when you do that, and it's why it takes longer in the compressor because it's essentially searching through to find what might be the best, um, encoding of the data.

Ben Rady

Yeah. I didn't realize that when you're, when you're telling these algorithms to basically try harder, what they're doing is trying more internal algorithms, more approaches, I guess is one way to think about it.

Matt Godbolt

Amongst other things. Yeah. I mean, so most them end up having to do, um, I mean if you imagine what it looks like you're trying to find a prefix, um, in some data you've already sent out, so you've got 64K of the stuff you've previously sent, and then you're looking at the next byte in the compressor and you're saying, oh, what is the longest match in the 64K buffer I've got behind me. Um, and the one way to do that is to just try, go back all 64 K and look for the, but that's a hugely order N squared operation you're now doing. Right. Right. And so most things do some kind of stuff where they hash a few bytes and go, well, I, let's keep a hash table of like the prefixes of A, B, C, and I look at my hash table and kind of go, oh, I've not seen A, B, C, or if I have, it was too far ago, nevermind.

Um, and, and then as I say, there's all these other sort of su subtles about like, well, maybe you wanna prefer nearer matches than further matches, because you're more likely to say, copy three bytes ago over and over and over again, rather than copy six thou 64,922 bytes ago, which is a very, you know, unique encoding compared to nearby. Right. So, so, so there's a whole bunch of trade-offs, uh, up there. But anyway, so you've done this a number of times, and then you are, you are gonna tell the decompressor, Hey, I picked to use, um, a dynamic tree, which I'm gonna transmit to you in this block. And, um, uh, then, you know, obviously it's dictionary decompression after that. So the tree gets transmitted and there's a pretty compact representation for the tree in terms of which bits mapped to which symbols, and then the, the, the decompressor has to generate that tree and then obviously look through, um, and just copy from before.

But those, those chunks we're talking about these like 64K-ish blocks that the compressor has chosen to like look ahead and kind of make its determination, those are different from the sliding window of 64K. Right. That's a, that's a separate, those are separate separate things there. Right. So, because the compressor could just choose to look at the whole file and say, actually this is this technique that I'm about to do. This tree that I'm gonna build is good for the whole file. Let's send you the whole file. But for pragmatic reasons of not having to go all the way to the end of the file and buffer it all in memory, if you're a streaming thing, usually it's chunked into smaller pieces. Chunking into smaller pieces also has a nice side effect because it means that you get a chance to give a different tree every 32K, 64K, whatever the algorithm decides, the compressor.

And so if you have a text file that's in different languages, you know, hey, the first half of this is in English and then it goes into Norwegian, maybe you need a different set of like, you know, right. Um, vowels, uh, more often or whatever, diuretic marks and stuff like that. Or, you know, more generally in your, um, in your data file format, you are compressing, there's likely to be a big header at the beginning, which has some characteristic, and then a whole bunch of like other data in the middle, which is a different thing. So, so it gives it an opportunity to kind of reset and change the, um, the relative frequencies of all of the symbols that are occurring depending on, you know, on the data. Um, which of course is another degree of freedom the compressor has, Hey, I could choose not to do that.

It could do the whole thing. It could try, it could, it could try doing one K, then two K, then four K, then it's got a lot of degrees of freedom there. Which means, you know, again, it's, it can be, uh, difficult to write an, an optimal or provably optimal compressor. And this is all deflate for what it's worth, this is all what deflate does. Deflate has, um, dictionary base, which, um, if people know what Lempel-Ziv LZ 77, um, I think recently one Ziv, oh, Lempel, one of the two died recently. There was a, which was, which was sad, but anyway, uh, Leal Ziv is like all this dictionary based thing where you have like these back references to n bytes ago copy m and there are a hundred different ways of encoding it or choosing to, to, to do it. And there's all the different lz, 70 sevens and other, uh, numbers.

I can't remember what the other one is now. Lz4? No, it's gone. Uh, then obviously the, the, um, the tree, uh, the, the tree is called a Huffman tree, and it's a huffman encoding tree. It's a way of building these unambiguous, um, sequences of bit patterns that will encode a node in a tree. . And the length of that sequence is provably the shortest, unique sequence that will get you to that particular point in the tree based on its relative, um, uh, occurrence. How, how often it occurs. So that's pretty cool. Um, and deflate is built, built of those two primitives in various, like, you know, clever bit packing, encoding and other bits and bit, bit bits and pieces, sort of, the devil is in the detail with a lot of these things, but there are some really other left field ways of, of compressing data.

Okay. I mean, you mentioned lossy.

Ben Rady

Yeah. Right.

Matt Godbolt

So obviously you could decide to just, you know, I dunno, divide everything by two and, and then, uh, uh, and then that immediately you've, you've, you've saved one bit for every Right. This has not changed in any meaningful way. The contents of the data, it's just rearranged it.

Ben Rady

Okay. Okay.

Matt Godbolt

But the, the side effect of the Bo's wheeler transform is that it is sort of sorted. Your data, your data is more sorted, it's more ordered than it was before 'cause of the transform than it was before.

Ben Rady

Interesting.

Matt Godbolt

Right? Yeah. It is interesting. You know, you take a megabyte of data, you apply the BW T to it, and it's a permutation of it. And the data, if you looked at it, was more likely to be a, a, a, a, a, a, a, a, B, B, B, B, B, C, C, C, D, D, d, E. Right. A Right. And it might go back to a, again, it's not, you know, it can't, it's not just sorting it. Yeah. Uh, and, and again, you, there's a great Wikipedia, uh, article on the bwt, and you, it's got a picture and you can see what it's doing.

Ben Rady

Interesting.

Matt Godbolt

The way of of reversing it is more tricky than you might think, but it, it can be done, right? Yeah. Okay. So if you apply the bwt before you throw it into a compressor,

Ben Rady

Right, yeah.

Matt Godbolt

What you've done is you've made it more compressible!

Ben Rady

Right. Right. Yes.

Matt Godbolt

Which is cool. And that's what, um, oh, which one is it? Tar dot. Not xz. What's, no, 'cause that's another one, one of the, um, uh, uh, bz2, uh, yeah, the BZ, um, uh, extension BZ2. Uh, for like tar bz. T two is is Burrow's wheeler, I think the B inside that. Oh, okay. Of B zip. Um, so it applies it before it, it, it runs it into the, the compressor. And actually while we're talking about that, there's another general sort of technique for compression, which is to apply a domain specific filter to your data before you throw it into a compressor.

So for example, you might know that there are certain sequences in, uh, say arm assembly code, or arm machine code, I should say that are very common, but they are relative to their, or rather they're absolute locations, right. They are absolute locations in the file or relatives. Mm-hmm. No, no, they're relative. Sorry, let me get it straight. They are relative. So you can imagine there's a, a branch instruction, then it has a relative destination. Um, and so, but if you are, if if every branch, if every third branch is calling the same absolute destination they're encoding will be different because relative to where that branch is, it's a different offset, right? Which means it's, you know, unique. But if you were to extract that out and say, okay, well these are all going to the same location, I'm gonna make them absolute right? Yeah.

Ben Rady

Then they'll compress better.

Matt Godbolt

Now suddenly it's the same bits every time, every time we see that branch, whatever, or something like that. I've just butchered it, obviously. Um, but, um, yeah. So there are certain things where if you know ahead of time that you can make the data more compressible in a reversible way, BWT being like a more generalized way of doing it. Then that's a, that's, that's another, uh, a technique, but obviously the decompressor has to know how to undo that too. And it has to know that that process has happened. And very often those are incredibly application specific.

Ben Rady

Yeah. Interesting. So it's like you're, you're intentionally introducing duplication in your data in a re uh, reversible way because it makes the compressions algorithms job much easier.

Matt Godbolt

Right.

Ben Rady

Uh, and you know, I've never actually done the experiment before of just taking a, a, like a bunch of random text sorting it and then compressing it,

Matt Godbolt

I mean, it's a measure, I mean, generally speaking is the measure of, uh, of how much disorder in the file is the entropy. And, you know, that gives you some kind of lower bound on how much you can compress a file. So obviously if you took a completely random file , you know, if you catted dev random Right. And you compressed it, you'd imagine it shouldn't compress at all. It should be slightly bigger. Right. But obviously if you sorted it, uh, what you're likely to have is there equal number of zeros, ones two, threes, 4, 5, 6. It's all about 255.

Ben Rady

Yeah. It would compress really well,

Matt Godbolt

And you'd imagine it would compress down to almost nothing because it'd be like, yeah, there's 250 million zeros, there's 250 million ones, there's 250 and so on. And you're like, all right. So in the limiting case you can get a, a really good ratio, but obviously it's not reversible.

Ben Rady

Right. Right, right, right, right

Matt Godbolt

But if you can come up with a way of encoding how to make, put the zeros back in the right place in the final document, then , you are sort of edging towards a compression algorithm, but the amount of data it would take to encode the positions of all the zeros, for example, and then all the positions of the ones should be the same amount of data as you put in in the first place. So you, you know, there's no free lunch there, but it might be better to encode or it might be easy to encode, which is like essentially what the, the BWT is, is doing. Yeah.

Ben Rady

Yeah.

Matt Godbolt

And I'm sure, I'm sure I've butchered some description of it there and we'll get people tweeting at us, uh, or mastedon-ing at us, um, for those things, but , one thing I did wanna say, actually just while, while I'm here, 'cause I've got the page open in the background now I've been thinking about it, is that, that sometimes, um, the, the trade off you wanna make in the compressor is,

Ben Rady

Uh,

Matt Godbolt

You know, like I imagine your Google and you wanna send out the Google, uh, you know, there's a single Google dot PNG or Google dot, whatever , right? Yeah. And like literally every third web transfer will be of this in the whole world is on Google png. Right. You know, so, you know, every bite you can save on that PNG is worth its weight in gold. Right. It Right. I dunno how much a byte weighs, but you know, whatever. Given that PNG internally is using deflate, which is another, another thing it might be worth saying, I don't care about if it takes me six days to compress this PNG, if it saves me one byte that'll pay for itself over the course of probably a, a day

Ben Rady

Millions and millions and millions of page loads. Yeah,

Matt Godbolt

Exactly. And so there are some, um, compression libraries that are dedicated to generating almost perfect bitstreams of like, this is the smallest conceivable, um, sequence of valid deflate bits that will cause the output to be what you want it to be. And one of those is something called Zopfli, Z O P F L I, it hasn't been touched in years. Um, but it is a way of creating, um, uh, and compressing a, a, um, uh, a source document and generating essentially the optimal output. And it takes hours,

Ben Rady

interesting.

Matt Godbolt

A really long time. And it's very limited. You can't do very much with it. Um, the only reason I know about it is because of, um, hobby projects that use this to compress stuff on a PC to be then put onto a, an eight bit computer where again, that trade off's fine. You know, like the source data is small, it's, you know, 20 k, but you wanna compress it as much as actually possible for your, you know, your eight K demo you're trying to get working or your four K demo you're trying to get working on your, your 6502 based machine. And so it's, it's worth saying, well, I'll spend 30 minutes in my build squishing it, those two more bytes 'cause I can do something cool with it, or again, it'd be worth it.

Ben Rady

That's neat.

Matt Godbolt

And I think the way that they work is actually they start backwards. They start at the end of the file and work backwards, and they try and like find the optimal thing that would encode what remains something like that. Somebody explained it to me once, uh, over a pint and, uh, it made perfect sense at the time

Ben Rady

How, how did that work? Yeah, exactly.

Matt Godbolt

Right. Uh, so I'm gonna just do honor all mentions now we've talked about deflate all the way through , um, Google and Facebook have both got their own like, uh, versions, uh, of, again, I think they're all using the same techniques and slicing and dicing this, the same tricks but with different trade-offs. But Z standard, which is now like the seemingly the most defacto compression, at least in our company, uh Z standard is, is coming. Uh, and snappy, they're both libraries again that, that, that are not deflate, but they are, they use broadly the same, um, parts, but yeah, honorable mentions for those only because, you know, we've talked about gzip, which is, you know, ancient really. Right.

Ben Rady

Uh,

Matt Godbolt

Right. And they, you know, those things, the Snappy and Z standard are usually faster and have comparable, if not slightly better compression ratios. too, so , uh, and, uh, yeah, one last thing there was there, somebody posted a, um, uh, so we talked about PNG just then, which is, you know, very complicated file format and that ultimately the internals of it are in fact deflate compressed and whatever. Um, some random, uh, coder just came up with their own image file format using a relatively made up off the fly, like, well, one bit. And then it's like, this is how we encode the next pixel two bits, and this is how we code it in the next pixel, whatever. And, um, it beats PNG hands down for the lossless. So there is still work room in the world for compression formats that are incredibly bespoke. So this is just for image data, but

Ben Rady

Still though, I mean, it's, yeah, there's a lot of image data in the world that's pretty impressive

Matt Godbolt

Exactly. I, I wish I could remember it. And again, I'm sure we'll put something in the notes that says what it really is,

Ben Rady

. Right, exactly.

Matt Godbolt

But yeah, as you can tell, I love this stuff. Um, if I remember rightly, I've got somewhere I've got like a, a, an online view where you can type in words and see how the, um, a dictionary compressor would see it and how a Hoffman tree would see it.

Ben Rady

Oh, well, we're definitely gonna have to put that in the notes. [Matt's referring to https://mattgodbolt.github.io/zindex-pres/#/4/2 where the tree and LZ compression update as you edit the text]

Matt Godbolt

We'll link that in the notes Yeah. Uh, too, so that you can have a little play around and kind of give a more of a visceral understanding about what's going on behind this. But, um, yeah, it's fun.

Ben Rady

Yeah. That that's really cool. Really cool. All right. Uh, maybe part one of two, I don't know. I'm not gonna make that claim. We'll, we'll see, we'll see if it works out like that. Alright. Right man. Well, I'll talk to you later.

Matt Godbolt

Maybe who knows Until next time my friend.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android