DictionariesIt's implicit in our above example that we've already got a table in mind that maps letters to numbers. This is known as a dictionary, as it's a reference for the algorithm to look up the values for each letter. So what if you added whole words to the dictionary, and represented words with codes in the output? Suddenly words like 'the', 'david', 'makeover', and 'atomic' (you just had to... -ed) can all be reduced to numbers like 00, 01, 10, 11. The space savings increase. Indeed, entire paragraphs can be represented by a handful of numbers.
So dictionaries are another principle compressors use today, but their implementation is a science within itself.
Our above example is a static dictionary; we know in advance the substitutions we want to make. But static dictionaries have limitations. For one, what if the source file doesn't use many of the words in our dictionary? Feeding French into it, for example, won't yield a good result. Additionally, the decompressor needs to know the dictionary to decode the compressed form, so that needs to be sent (or be already present) with the output data.
An alternative is a semi-adaptive dictionary, one which is created by first scanning the input file and building a dictionary based on this. This would provide greater flexibility and better performance, but it's impractical and slow to scan a file in advance (think about compressing a 1GB file), not to mention uses up more system resources. Also, the dictionary needs to be sent with the output file, and a large dictionary can be counter-productive to the efficiency of the compression.
Which is why most dictionaries used today are fully-adaptive. This type of dictionary is created on the fly as the file itself is compressed, and it has a number of advantages. For example the first time the compressor comes across the word 'atomic' it stores it as is, and adds it to the dictionary. The next time is sees 'atomic', it simply adds an index reference to the dictionary. This method doesn't use many resources, is fast because it's not scanning in advance, and best of all you don't need to send the dictionary with the file - instead the decompressor simply performs the process in reverse: it'll see the word 'atomic', add it to its dictionary and assign an index, and thereafter expand all references of the index to 'atomic' as it writes the output file.
Lean, mean, and nifty cool. Which leads us to sliding windows.
Sliding WindowsAnother principle that's used often comes about as a function of the dictionary. It's all well and good to build a dictionary on the fly, but especially with large files of disparate or binary data, a dictionary is going to have limits. How do you define a word or phrase (like 'lean' and 'mean') to match further in the file? How many bytes do you grab, and how large do you let the dictionary become?
Leaving aside a discussion on choosing phrases for a moment (as this is, also, a science within itself), one method of both optimising performance and keeping resource usage under control is to use a sliding input window. Instead of trying to process a file as a whole, and continually add words into the dictionary, only a given 'window' of data is processed at a time, and the dictionary built around it. As data is read, it's added to the start of the window while data at the end is moved out, hence it 'slides'. For example a 64KB widnow would process 64KB at a time as the input stream moves through it.
A sliding window allows a compressor to adapt to the content of the source, and prevent the dictionary from becoming stale. For example, if a file contained both text and binary data, the dictionary built on the text segment wouldn't help much when the binary data starts feeding in. It's inefficient to keep expanding the dictionary (see above), and the current contents don't help, so you simply let the dictionary continue to develop on the fly as the compressor starts finding phrases in the binary data.
Naturally this means that the bigger a sliding window, and the bigger a dictionary, the better a compressing algorithm can perform. However this also uses more system resources (memory and processing power), and takes longer to run, up to and including a breakeven point where the trade-off isn't worth it. Hence, as you're aware, when you're looking at compressors like Zip, Rar, and 7Zip you'll always find options for speed/space tradeoffs, and some may even let you set a sliding window dictionary size. Methods that squeeze the most data take longer to do.
As before, all forms of compression are about finding ways to reduce a given input stream into a shorter version of the same. However not all methods and algorithms work best for different types of data. Images, audio, and video files aren't quite so amiable to the methods that work well with text for example, because the pattern of data in these files follows different conventions. Which is of course why we have compressors and formats specifically tailored to them.
Issue: 133 | February, 2012