Guaranteed compression

A few eons ago I was a member of a network called Fidonet, which had an echoarea (the equivalent of a mailing list) dedicated to file compressors. As I was fascinated by them at the time, I was naturally a subscriber.

Every once in a while someone would come in and ask if there could be a compressor that were guaranteed to compress the input data by at least one byte. The response was invariably that no, there couldn’t be one, and the argument was based on the pigeonhole principle: if there were such a compressor, you could use it to compress some data, then use it on the compressed data to compress it even more, and so on until you are left with a byte, which would mean that there are only 256 possible distinct files in all of the Universe. This is clearly absurd, so there can’t be such a compressor.

While I agree with the conclusion, I find the argument unsound, so let’s explore it with the help of a close cousin of this compressor everyone asked about.

There is a compressor that is guaranteed to either produce an output the same size of the input, or produce an output one bit smaller than the input. This compressor simply interprets the input data as a big binary number and subtracts 2 from it. The corresponding decompressor is also very easy: just interpret the input data as a big binary number and add 2 to it.

It is easy to prove that the compressor will always produce an output file that is either the same number of bits as the input file, or 1 bit smaller. Therefore, you can use the compressor, get an output file, use the compressor on the output file, get a new output file, use the compressor on the new output file, and so on until you get a file that only contains 1 bit that can be 0 or 1. According to the argument above, this would mean that only two documents exist in the whole Universe; however this is clearly not true, so what’s the catch?

The catch is that we can’t reconstruct the original document with the compressor’s output alone (the 0 or 1 bit). We need more information to do that: we need to know how many times we applied the compressor so we can apply the decompressor an equal number of times to reconstruct the original file. So let’s add the number of compression operations to the last compressed file. How big is the resulting file?

The compressor works by subtracting 2 from the number we interpret the input file as. If I apply the compressor once, I will have subtracted 2; if I apply it twice, I will have subtracted 4 from the original file; if I apply it 1000 times I will have subtracted 2000. To get a single bit I need to apply the compressor enough times that the original file minus twice the number of applications is 0 or 1. That means that, if I call the original file “\(i\)”, the ultimately compressed bit “\(c\)” and the number of compressions “\(n\)”, I have that \(i=2*n+c\).

Therefore, to compress the input file \(i\) you will need \(n\) operations, and \(n\) is one half of \(i\). You will now need to append the number \(n\) to the output of the last compression step to be able to reconstruct the original file. The input file \(i\) has \(b\) bytes, and \(n\) is one half of \(i\), so \(n\) has \(b-1\) bytes. Therefore, you will have to store 1 bit for the compressed data, plus \(b-1\) bits for \(n\), which amounts to \(b\), which is the length of the original file!

So yes, while it is not true that being able to reduce a file to 1 bit or 1 byte by successive compressions means that there are only 2 or 256 files in the Universe, it is still true that doing so will be of no benefit.