How File Compression Works (The Basics)

8m read

·Nov 3, 2024

Hey guys, this is Matt Kendell, and today I want to talk a little bit about file compression. Now, file compression is something everyone uses pretty much every day, whether you know it or not. I just kind of think that it's a lot simpler than people believe it is and that it's actually pretty straightforward to at least understand kind of how compression is doing what it's doing.

So, what I want to do with this video is give a couple really simple examples and actually walk through how compression might work on them. The main example I'm going to be using is DNA sequences, and the reason this is such a good example is because they're pretty simple. There's only four letters, and an entire string of DNA is just made up of these four letters; so, it's good to be able to play around and give various examples.

Let's start off by just talking about how we might encode a DNA sequence without any compression. Now, this will seem probably pretty straightforward to a lot of you, but for people who haven't done any programming before, this might be new. So, let's say just to start things off that a DNA sequence is just made up of these four letters and each letter is equally probable; they're evenly distributed. It's just a random, you know, string of these letters, and we just want to figure out how we can encode this as a string of ones and zeros so that we can store it on a computer or send it over the internet, something like that.

One way to do this is to just assign two bits, like a two-digit sequence, to each letter. So, for G I would have 00, for C I could have 01, etc. I chose this randomly; you can assign whichever two-bit sequence you want to whichever letter. But the key is there's four different ways you can combine two bits and I'm assigning one of those to each letter.

So, if I have a sequence like ACTGCT, like I do here at the bottom, what I would do is I would look up what two-bit thing is A, what two-bit thing is C, etc., and I could write down a binary string for the DNA. This is how we would encode DNA using this very basic encoding scheme. There's no actual compression going on yet; this is just a way to encode DNA without any compression, and it's really easy to decode. Because if I see a string of bits and I want to turn it back into DNA, I just look at the pairs of bits and figure out which letter each one corresponds to.

So, now I want to consider a more complex example where different letters are actually more or less likely. I'm going to call this lumpy DNA sequences. For instance, at the bottom, I've included an example. You see that G is a lot more probable, and C is a lot more probable, and A and T don't happen as much. So, this is just an example of a very simple kind of data where certain symbols are more probable than other symbols and happen more frequently. This is actually not something I've just made up; like, for instance, the English language is like this, where certain letters are much more frequent than others.

So, it might be really helpful to figure out a better way to encode this than just our old scheme where each letter gets the same number of bits. So, here is a way we could encode DNA that's a little bit different than how we did it before, where each thing was the same number of bits. So, in this game, G, which is the most probable, is encoded as a single bit; it's just 0, and C, which is the second most probable, is encoded as two bits; it's 10. And A and T are both encoded as three bits.

At first, this might seem a little weird because I said G was 0; like, so if you're encoding data and you see a G, you write a 0. You know, you send a 0 as the next bit, but then other things have zeros in them. Like, C has a 0 in it, A has a 0 in it. So, how do we, how like how is this actually a valid code? Before I show you why this might actually be better than the old code we had, where each thing was the same number of bits, I just want to demonstrate that this actually is a valid code and that you can encode with it and decode with it.

The way I'm gonna demonstrate this is just by going through an example, and hopefully you'll see you'll start to get a feel for how this is actually a valid code. So, let's say we're given the string of binary that I have at the bottom of the screen here, and we're told it was encoded as follows: if the encoder saw a G in the DNA sequence they were encoding, they sent a 0; if they saw a C, they sent a 10; if they saw an A, they sent a 110, etc. And these are the bits we've now received from them and we want to just figure out what were the letters they had been looking at before that caused them to generate the string of bits.

So, just to jump right in, we're gonna look at the first bit and figure out which letters that could correspond to. So, actually, if we look at all the codes for all the letters, all of the codes start with 1 except for G. So, we can immediately eliminate G, and we know it has to be one of these three letters because it couldn't have been a G. If the person had been looking at a G when they encoded this, there would be a zero there.

Now we look at the next bit, which is a 1, and so we know it can't be a C because C starts with 10. So, because we see 11 right here, we know that it has to be one of the two things that starts with 110. And finally, looking at the third bit, it has to be an A because A is the only thing that reads 110. So now, we've figured out what the first letter the encoder saw was when they sent us the string of binary.

An interesting thing to note is when we started looking at the string of binary, we didn't actually know how many bits the first letter would take up. It could have been one bit or two bits or three bits. But as we kept reading, it had to be three bits; like, it couldn't have been two bits because that would mean it had to be a C, and it just wasn't a C. It didn't start with 10.

So now, let's go ahead and try to read the second letter. We see a 1, so it has to start with 1, and then we see a 0, so it has to be C. That's the only thing that starts with 10. Let's do the third one: we see a 1, so that narrows it down to three; we see another one which narrows it down to two; and then we see a third one; it has to be a T. And if we do this next one, it's a 0, so immediately it has to be a G.

You could continue this process, but the key is at any given point, we've seen a bunch of bits, and there's, you know, we can just figure out all of the letters that start with that string of bits, and we can go off of that and we can use that to decode. So, we could have continued that process and we would get this.

So hopefully by now, I've convinced you that it's actually possible to encode things like DNA or other kinds of data without the restriction of having the same number of bits for each letter. We can actually use a code like the one I've shown here to have certain letters take up more space than other letters, and this is going to be really important for our story about how compression works.

The next thing I want to do is demonstrate that this code for the lumpy DNA sequences we've been working with is actually better than the original code. So, if you'll recall, the original code, on average, used two bits per letter because every letter was two bits. But now, I want to figure out on average how many bits per letter this new code uses in our lumpy DNA sequences, and to do that we're gonna use a weighted average.

So, 50% of the time we have one bit, so we're gonna do 0.5 times 1. That just represents, you know, half of the time we have one bit. 25% of the time we're going to be using two bits because C happens 25% of the time; you know, 25% of the letters in these lumpy sequences are C. So, we're gonna add onto our weighted average 0.25 times 2 bits.

Then the other two things are three bits, but they each happen 12.5% of the time, so we can do 0.125 times 3 and we just add that twice. This will give us a weighted average; it's a measure of if we encoded a long string of DNA; and I tell you it is like 20 letters, I just want to know how many bits it will be. So, you know how many bits the average letter takes up.

And the answer in this case, if we actually do the math like I've done here, is 1.75 bits per letter. This is less than the answer we got with the original code where everything was 2 bits, because that was 2 bits per letter. The reason we were able to do better with lumpy DNA sequences than we could with regular DNA sequences is because we actually were able to utilize our knowledge of how frequently different things happen.

At this point, I feel it's important to note that this encoding scheme is actually only better for lumpy DNA sequences. So, if we were to try to encode regular DNA sequences where all the letters happen with the same probability, you know, the same distribution for every letter, this would actually not do as well as encoding things as 2 bits each. The reason it's better is because we're encoding lumpy DNA sequences with this lumpy code. So, you kind of want to, you want the lumpiness of the code to match the lumpiness of the data you're encoding, in a sense.

So now, I kind of want to move on and talk more generally about how compression might actually work in the real world, not just in an example like this where we have these lumpy DNA sequences. So, one very basic example is the English language. If I want to compress a bunch of text, I'll probably get a lot to encode it in such a way that common letters like E and T and A are a lot smaller than the last frequent letters like Q. So, Q might actually be pretty big; you know, it might take up a lot of bits. But since I don't have to encode it very often, it's worth it because I can squeeze smaller, you know, I can make things like E and T take up less space, and they happen a lot more frequently.

The way more advanced compression works is it moves beyond this idea of letter-by-letter compression like we did with the DNA sequences, and it actually builds a dictionary. So, a compression algorithm will look at the data it's about to compress, and maybe it's, let's say, it's English because we all know English, presumably, if we're watching this. It might pick out certain words that happen a lot or certain pairs of letters or certain very common things.

Like, maybe CH happens a lot because that's a common, you know, pair of letters that appears in English or maybe the word “and” happens a lot or the word “that” happens a lot. It'll build this dictionary of these words, and it will actually encode the dictionary such that really frequent entries of the dictionary take up less space than less frequent entries of the dictionary.

You can kind of imagine them getting creative because basically, the better you can model and the better you can predict what's going to happen in your data, the better you can compress it by, you know, making the likely things smaller and the less likely things larger, which is kind of a consequence. You kind of have to do that; you can't get away with just making everything smaller. But anyway, that's kind of where I want to end things.

I don't want to talk about too many specifics because then I would have to get into specific compression algorithms, but my idea was just to make compressions seem a little more accessible to you, to try to get out the two main ideas, which are this idea of encoding things in binary in a variable-length way, so some things are longer than others, and the other idea of probabilities, so that the likely things are smaller and the less likely things are bigger. But this is just kind of a high-level overview of how compression happens and how your computer does it.

So, thanks for watching! I hope you enjoyed and goodbye!

How File Compression Works (The Basics)

More Articles