Friday February 10, 2012 1:35 PM AEST

Principles of compression

By Ashton Mills
14:51 Mar 2, 2009 | 1 Comment
Tags: compression | archiving
«  »
Principles of compression

Dictionaries
It's implicit in our above example that we've already got a table in mind that maps letters to numbers. This is known as a dictionary, as it's a reference for the algorithm to look up the values for each letter. So what if you added whole words to the dictionary, and represented words with codes in the output? Suddenly words like 'the', 'david', 'makeover', and 'atomic' (you just had to... -ed) can all be reduced to numbers like 00, 01, 10, 11. The space savings increase. Indeed, entire paragraphs can be represented by a handful of numbers.

So dictionaries are another principle compressors use today, but their implementation is a science within itself.

Our above example is a static dictionary; we know in advance the substitutions we want to make. But static dictionaries have limitations. For one, what if the source file doesn't use many of the words in our dictionary? Feeding French into it, for example, won't yield a good result. Additionally, the decompressor needs to know the dictionary to decode the compressed form, so that needs to be sent (or be already present) with the output data.

An alternative is a semi-adaptive dictionary, one which is created by first scanning the input file and building a dictionary based on this. This would provide greater flexibility and better performance, but it's impractical and slow to scan a file in advance (think about compressing a 1GB file), not to mention uses up more system resources. Also, the dictionary needs to be sent with the output file, and a large dictionary can be counter-productive to the efficiency of the compression.

Which is why most dictionaries used today are fully-adaptive. This type of dictionary is created on the fly as the file itself is compressed, and it has a number of advantages. For example the first time the compressor comes across the word 'atomic' it stores it as is, and adds it to the dictionary. The next time is sees 'atomic', it simply adds an index reference to the dictionary. This method doesn't use many resources, is fast because it's not scanning in advance, and best of all you don't need to send the dictionary with the file - instead the decompressor simply performs the process in reverse: it'll see the word 'atomic', add it to its dictionary and assign an index, and thereafter expand all references of the index to 'atomic' as it writes the output file.

Lean, mean, and nifty cool. Which leads us to sliding windows.

Sliding Windows
Another principle that's used often comes about as a function of the dictionary. It's all well and good to build a dictionary on the fly, but especially with large files of disparate or binary data, a dictionary is going to have limits. How do you define a word or phrase (like 'lean' and 'mean') to match further in the file? How many bytes do you grab, and how large do you let the dictionary become?

Leaving aside a discussion on choosing phrases for a moment (as this is, also, a science within itself), one method of both optimising performance and keeping resource usage under control is to use a sliding input window. Instead of trying to process a file as a whole, and continually add words into the dictionary, only a given 'window' of data is processed at a time, and the dictionary built around it. As data is read, it's added to the start of the window while data at the end is moved out, hence it 'slides'. For example a 64KB widnow would process 64KB at a time as the input stream moves through it.

A sliding window allows a compressor to adapt to the content of the source, and prevent the dictionary from becoming stale. For example, if a file contained both text and binary data, the dictionary built on the text segment wouldn't help much when the binary data starts feeding in. It's inefficient to keep expanding the dictionary (see above), and the current contents don't help, so you simply let the dictionary continue to develop on the fly as the compressor starts finding phrases in the binary data.

Naturally this means that the bigger a sliding window, and the bigger a dictionary, the better a compressing algorithm can perform. However this also uses more system resources (memory and processing power), and takes longer to run, up to and including a breakeven point where the trade-off isn't worth it. Hence, as you're aware, when you're looking at compressors like Zip, Rar, and 7Zip you'll always find options for speed/space tradeoffs, and some may even let you set a sliding window dictionary size. Methods that squeeze the most data take longer to do.

As before, all forms of compression are about finding ways to reduce a given input stream into a shorter version of the same. However not all methods and algorithms work best for different types of data. Images, audio, and video files aren't quite so amiable to the methods that work well with text for example, because the pattern of data in these files follows different conventions. Which is of course why we have compressors and formats specifically tailored to them.

 
«  »
 
This article appeared in the February, 2009 issue of Atomic.

Behind the scenes with Mass Effect 3! GTX 560 VGA round-up! Essential Skyrim tweaks to improve your game! Plus reviews, news, hardware, more games, and easy to following modding guides for PC builders. ON SALE NOW!
1 Comment
ahsoka
Mar 2, 2009 3:37 PM
Awesome article. Keep up the good work.
Comments have been disabled on this article.
 
Latest Competitions
 
Atomic Magazine

Issue: 133 | February, 2012

Atomic is a magazine aimed squarely at computer enthusiasts, gamers, and serious PC upgraders.

Every month we bring you the latest reviews of new technology and PC components, in depth features on everything from overclocking to console hacking, and gaming previews and interviews.
 
Latest Comments
 
Latest User Reviews
Battlefield 3 is the new benchmark online FPS
90%
A very fun and realistic multiplayer ride.
 
Antec Kuhler 920 - liquid cool
90%
Antec Kuhler 920 silent but effientive out of the box no maintence water cooling kit
 
Antec's Lanboy Air - our new favourite case
90%
Antec Lan boy Air in red a very cool design
 
Antec's Lanboy Air - our new favourite case
90%
This product overall is awesome.
 
MSI's GT780 laptop as fast as it gets
90%
Nice laptop
 
 
Close Get the February, 2012 issue of Atomic mailed to you for $8.95, including postage.

SubscribeBuy nowDigital Version