Sunday November 22, 2009 5:21 AM AEST

Principles of compression

  • Email a Friend
  • Print Page
«  »
Principles of compression
By Ashton Mills
Mar 2, 2009 | 1 Comment
Tags: compression | archiving

Dictionaries
It's implicit in our above example that we've already got a table in mind that maps letters to numbers. This is known as a dictionary, as it's a reference for the algorithm to look up the values for each letter. So what if you added whole words to the dictionary, and represented words with codes in the output? Suddenly words like 'the', 'david', 'makeover', and 'atomic' (you just had to... -ed) can all be reduced to numbers like 00, 01, 10, 11. The space savings increase. Indeed, entire paragraphs can be represented by a handful of numbers.

So dictionaries are another principle compressors use today, but their implementation is a science within itself.

Our above example is a static dictionary; we know in advance the substitutions we want to make. But static dictionaries have limitations. For one, what if the source file doesn't use many of the words in our dictionary? Feeding French into it, for example, won't yield a good result. Additionally, the decompressor needs to know the dictionary to decode the compressed form, so that needs to be sent (or be already present) with the output data.

An alternative is a semi-adaptive dictionary, one which is created by first scanning the input file and building a dictionary based on this. This would provide greater flexibility and better performance, but it's impractical and slow to scan a file in advance (think about compressing a 1GB file), not to mention uses up more system resources. Also, the dictionary needs to be sent with the output file, and a large dictionary can be counter-productive to the efficiency of the compression.

Which is why most dictionaries used today are fully-adaptive. This type of dictionary is created on the fly as the file itself is compressed, and it has a number of advantages. For example the first time the compressor comes across the word 'atomic' it stores it as is, and adds it to the dictionary. The next time is sees 'atomic', it simply adds an index reference to the dictionary. This method doesn't use many resources, is fast because it's not scanning in advance, and best of all you don't need to send the dictionary with the file - instead the decompressor simply performs the process in reverse: it'll see the word 'atomic', add it to its dictionary and assign an index, and thereafter expand all references of the index to 'atomic' as it writes the output file.

Lean, mean, and nifty cool. Which leads us to sliding windows.

Sliding Windows
Another principle that's used often comes about as a function of the dictionary. It's all well and good to build a dictionary on the fly, but especially with large files of disparate or binary data, a dictionary is going to have limits. How do you define a word or phrase (like 'lean' and 'mean') to match further in the file? How many bytes do you grab, and how large do you let the dictionary become?

Leaving aside a discussion on choosing phrases for a moment (as this is, also, a science within itself), one method of both optimising performance and keeping resource usage under control is to use a sliding input window. Instead of trying to process a file as a whole, and continually add words into the dictionary, only a given 'window' of data is processed at a time, and the dictionary built around it. As data is read, it's added to the start of the window while data at the end is moved out, hence it 'slides'. For example a 64KB widnow would process 64KB at a time as the input stream moves through it.

A sliding window allows a compressor to adapt to the content of the source, and prevent the dictionary from becoming stale. For example, if a file contained both text and binary data, the dictionary built on the text segment wouldn't help much when the binary data starts feeding in. It's inefficient to keep expanding the dictionary (see above), and the current contents don't help, so you simply let the dictionary continue to develop on the fly as the compressor starts finding phrases in the binary data.

Naturally this means that the bigger a sliding window, and the bigger a dictionary, the better a compressing algorithm can perform. However this also uses more system resources (memory and processing power), and takes longer to run, up to and including a breakeven point where the trade-off isn't worth it. Hence, as you're aware, when you're looking at compressors like Zip, Rar, and 7Zip you'll always find options for speed/space tradeoffs, and some may even let you set a sliding window dictionary size. Methods that squeeze the most data take longer to do.

As before, all forms of compression are about finding ways to reduce a given input stream into a shorter version of the same. However not all methods and algorithms work best for different types of data. Images, audio, and video files aren't quite so amiable to the methods that work well with text for example, because the pattern of data in these files follows different conventions. Which is of course why we have compressors and formats specifically tailored to them.

 
«  »
 
This article appeared in the February, 2009 issue of Atomic.

The latest issue is on sale now!

Want to learn all about Diablo III? Want to find out what the best Solid State Drive is on the market today, and how to look after it? Want to catch up on the latest hardware, games and in depth tech from Australia's best enthusiast mag?

Get your copy today :)
1 Comment
Thoughts on this article? Add a comment below.
ahsoka
Mar 2, 2009 3:37 PM
Awesome article. Keep up the good work.
Login or register to submit a comment.
 
 
 
Atomic Magazine

Issue: 107 | December, 2009

Atomic is a magazine aimed squarely at computer enthusiasts, gamers, and serious PC upgraders.

Every month we bring you the latest reviews of new technology and PC components, in depth features on everything from overclocking to console hacking, and gaming previews and interviews.
 
Latest Comments
"Signed"
by Vanoyen | Nov 22, 2009 3:32 AM
 
"I got an XP pro oem with a game build rig 18 months ago and continued to ignore Vista, to my ..."
by TonyB | Nov 21, 2009 10:24 PM
 
"Holy shit, batman.

*runs"
by colganaitor | Nov 21, 2009 7:17 PM
 
""sudo preupgrade"
...failed to download installer metadata
------------
So ..."
by wlayton27 | Nov 21, 2009 8:16 AM
 
"^ I find with CoD4 that I can jump on an empty server and be joined by 6-12 others before the ..."
by Ezekill | Nov 20, 2009 10:10 PM
Latest User Reviews
Shenmue II
10%
asdfasdf
 
EVGA X58 Classified
90%
great board, a few things could be better
 
EVGA X58 Classified
90%
Gorgeous looking
 
Sapphire 4890
90%
So good, I immediately wanted a second one!
 
MSI 790FX-GD70 motherboard
90%
Allmost the prefect gaming board