Compression is a fascinating subject because it appears, at the surface, to create something from nothing and nothing from something. And it's been a part of computing since, well, before computers. Morse Code, in use since 1838, used shorter codeword substitution for certain letters, a function which is the very basis of compression.
It wasn't until 1951 however that the science started in earnest with then PhD student David Huffman, building on the earlier work of luminaries Claude Shannon and Robert Fano, devising Huffman encoding - a technique still used today.
By the 60s new methods were emerging that improved on Huffman, and by the 70s compression utilities were commonplace on Unix systems. Much better algorithms were also being developed, including the famous (for compression geeks at least) LZ77 and LZ78 by Lempel and Ziv, which led in the 80s to LZW - one of the most ubiquitous techniques used today in everything from GIF and TIFF images through to modems, file systems, and as a basis of other compressors. Indeed, LZW gave rise to PKZIP in 1989, and if you don't know what zip is you probably shouldn't be reading this site.
By the late 80s it was becoming apparent that, even advanced as they were, new methods were needed for the deluge of formats for images, audio, and video. This spurred the introduction of JPG in 1992 and MP3 in 1996, and the rest as they say is history. Literally, compression has changed our world.
The importance of compression
While once the development of compression was partly driven by limited diskspace, memory, and bandwidth (hands up who used V.42bis on their US Robotics 56k modems?), it certainly hasn't gone out of fashion as technology has leapfrogged. Today it plays a role in both hardware and software from the tiniest of embedded devices to behemoth supercomputers.
And even though we have more memory, storage, and bandwidth than ever before, the volume of data we play with has also increased and so compression continues to play an important role.
It's used to maximise memory and storage in your phone, and in transmitting and receiving the calls you make. It's used in your wireless router on its compressed file system and in wireless networking, on your camera as it saves images, in your portable media player with music and its onboard OS, it's even used in your car (well, if it's kinda modern). It's used in the games you play to maximise system resources through compressed audio and texture formats. And it's ubiquitous all over embedded industrial and military applications.
Even the pages of the Atomic magazine are sent digitally to the printers in a compressed format.
And it hasn't just helped advance technology; it's created entire new industries - such as the iPod, and the transition from VHS to DVD. From communicating to satellites on the other side of our solar system, to the humble interweb back down here on Earth, compression is everywhere.
Now that we've covered the what and the why, let's look at the how.
Substitution
To be sure, compression is a hefty science; most people learn about it through expensive tomes of hallowed knowledge. Nevertheless, I'll endeavour to cover the basics here, because that's part of the beauty of compression - even though today's complex algorithms could do your head in, they nonetheless work off the same principles I'll cover here, because at the end of the day, they're all trying to do the same thing.
The first of these is substitution. As mentioned earlier, Morse Code used some substitution and so did Huffman. In its simplest form, we can demonstrate a form of substitution known as RLE, or Run-Length Encoding. RLE simply substitutes the repetition of a character with a value for that many characters repeated, such that a string 'aaaaabbbbbcccc' would become 'a5b5c4'. Saving or sending the RLE version would take less time, yet the original data is there if you know how to decode (and you do).
But let's take it a step further. When you save a text document the letters in the words are saved as their 8-bit binary ASCII representations. For example, the letter 'e' is '01100101'. If we wanted to create a compressed version, we don't have to stick with 8-bit values for the whole alphabet, and instead might choose to allocate less bits for some letters. For example:
a - 00
b - 01
c - 100
d - 101
e - 110
And so on. Note in practice there's a bit more to it, as you'd need to determine when an '01' belongs to a 'b' and not a segment for 'd' when decoding, but the theory is what's important for the moment. You can encode around half the alphabet with 4-bits (up to 16 possible values), so if you were to run a document through this compressor you could (theoretically) shave around 25 per cent off (half the alphabet, half the number of bits) the file size.
But that depends on the letters being used doesn't it? Only half your letters can be represented by 4-bits, and if those letters aren't used, then there's no saving.
So let's improve it. It's accepted that in the English language letters such as 'e', 'o', 't', 'i', and 'n' are much more common than the rest of the alphabet. Even sticking with our current encoding method, we could improve the performance of the technique by assigning these most common letters to the shorter bit substitutions:
e - 00
o - 01
t - 100
i - 101
n - 110
While this is a simple example, it reflects the principles of compression and how much of the work goes into finding new ways to encode more data into less bits. Some advances take an already effective method, and add to it, just as LZ77 and LZ78 led to LZW, and LZW led to PKZIP.
Which, in turn, leads us to dictionaries.
Dictionaries
It's implicit in our above example that we've already got a table in mind that maps letters to numbers. This is known as a dictionary, as it's a reference for the algorithm to look up the values for each letter. So what if you added whole words to the dictionary, and represented words with codes in the output? Suddenly words like 'the', 'david', 'makeover', and 'atomic' (you just had to... -ed) can all be reduced to numbers like 00, 01, 10, 11. The space savings increase. Indeed, entire paragraphs can be represented by a handful of numbers.
So dictionaries are another principle compressors use today, but their implementation is a science within itself.
Our above example is a static dictionary; we know in advance the substitutions we want to make. But static dictionaries have limitations. For one, what if the source file doesn't use many of the words in our dictionary? Feeding French into it, for example, won't yield a good result. Additionally, the decompressor needs to know the dictionary to decode the compressed form, so that needs to be sent (or be already present) with the output data.
An alternative is a semi-adaptive dictionary, one which is created by first scanning the input file and building a dictionary based on this. This would provide greater flexibility and better performance, but it's impractical and slow to scan a file in advance (think about compressing a 1GB file), not to mention uses up more system resources. Also, the dictionary needs to be sent with the output file, and a large dictionary can be counter-productive to the efficiency of the compression.
Which is why most dictionaries used today are fully-adaptive. This type of dictionary is created on the fly as the file itself is compressed, and it has a number of advantages. For example the first time the compressor comes across the word 'atomic' it stores it as is, and adds it to the dictionary. The next time is sees 'atomic', it simply adds an index reference to the dictionary. This method doesn't use many resources, is fast because it's not scanning in advance, and best of all you don't need to send the dictionary with the file - instead the decompressor simply performs the process in reverse: it'll see the word 'atomic', add it to its dictionary and assign an index, and thereafter expand all references of the index to 'atomic' as it writes the output file.
Lean, mean, and nifty cool. Which leads us to sliding windows.
Sliding Windows
Another principle that's used often comes about as a function of the dictionary. It's all well and good to build a dictionary on the fly, but especially with large files of disparate or binary data, a dictionary is going to have limits. How do you define a word or phrase (like 'lean' and 'mean') to match further in the file? How many bytes do you grab, and how large do you let the dictionary become?
Leaving aside a discussion on choosing phrases for a moment (as this is, also, a science within itself), one method of both optimising performance and keeping resource usage under control is to use a sliding input window. Instead of trying to process a file as a whole, and continually add words into the dictionary, only a given 'window' of data is processed at a time, and the dictionary built around it. As data is read, it's added to the start of the window while data at the end is moved out, hence it 'slides'. For example a 64KB widnow would process 64KB at a time as the input stream moves through it.
A sliding window allows a compressor to adapt to the content of the source, and prevent the dictionary from becoming stale. For example, if a file contained both text and binary data, the dictionary built on the text segment wouldn't help much when the binary data starts feeding in. It's inefficient to keep expanding the dictionary (see above), and the current contents don't help, so you simply let the dictionary continue to develop on the fly as the compressor starts finding phrases in the binary data.
Naturally this means that the bigger a sliding window, and the bigger a dictionary, the better a compressing algorithm can perform. However this also uses more system resources (memory and processing power), and takes longer to run, up to and including a breakeven point where the trade-off isn't worth it. Hence, as you're aware, when you're looking at compressors like Zip, Rar, and 7Zip you'll always find options for speed/space tradeoffs, and some may even let you set a sliding window dictionary size. Methods that squeeze the most data take longer to do.
As before, all forms of compression are about finding ways to reduce a given input stream into a shorter version of the same. However not all methods and algorithms work best for different types of data. Images, audio, and video files aren't quite so amiable to the methods that work well with text for example, because the pattern of data in these files follows different conventions. Which is of course why we have compressors and formats specifically tailored to them.
Image, video, and audio
As mentioned earlier, formats like GIF, JPG, TIFF and many others, not to mention MP3, Ogg, WMV and the rest are all borne from compressors optimised for these target sources. These are also, of themselves, a science. For example unlike text and binary, which have their own patterns, the data in images can often consist of lots of data that, while not always changing from one sequence to the next, can consist of variance of data - such as a color gradient. Rather than repeatedly store the actual data for every pixel, for example, a compressed image format might store simply the variance between one pixel and the next. Then, if you lax the threshold of what constitutes a variance, you can suddenly save a lot of data, though of course with a trade-off, which we'll get onto in a moment.
Video, too, has its own specific traits. Some video compressors treat each frame as an image, and compress the images individually, but just as the data within an image can be repetitive (or closely related), the same is true of video as well. When an actor is talking on screen, their mouth might move but the background is static, and the data between all those repeated frames can be substituted, and thus compressed, to save space. These two methods are known as intraframe and interframe compression respectively.
And audio, of course, is another beast. Using music as an example, rather than store the data of the actual waveforms, the differences between the change in a waveform itself can be stored. Some techniques work by modelling what is expected next in a waveform, and storing the variance of the actual waveform to the predicted model.
Of course all of these methods also share something in common (with the exception of some formats) - they are a lossy approach to compression. While we generally notice if words go missing in a text document, and binaries don't take kindly to bits disappearing, lossless compression isn't necessarily required for images, video, and audio. As humans we have limits to our perception, and tend not to mind or notice degradation of quality up to a certain point. Said another way, there's sometimes a lot of data a compressor can simply throw away entirely with these formats because its loss is imperceptible, or acceptable, to us. And this, of course, can save a lot more space. The art of good lossy techniques lies in determining what can be dropped and what can be compressed to approximate equivalents, and still provide an almost indistinguishable result from the original. Or, of course, electing to save even more space and be more generous with approximations until we do notice the difference, but it's a choice we make for the quality/space tradeoff - such as low or high bit-rate audio and video formats.