Saturday November 21, 2009 3:58 PM AEST

Principles of compression

  • Email a Friend
  • Print Page
 »
Principles of compression
By Ashton Mills
Mar 2, 2009 | 1 Comment
Tags: compression | archiving

We take it for granted, this everyday innovation. Yet it's one of the building blocks of most everything we do with technology.

Compression is a fascinating subject because it appears, at the surface, to create something from nothing and nothing from something. And it's been a part of computing since, well, before computers. Morse Code, in use since 1838, used shorter codeword substitution for certain letters, a function which is the very basis of compression.

It wasn't until 1951 however that the science started in earnest with then PhD student David Huffman, building on the earlier work of luminaries Claude Shannon and Robert Fano, devising Huffman encoding - a technique still used today.

By the 60s new methods were emerging that improved on Huffman, and by the 70s compression utilities were commonplace on Unix systems. Much better algorithms were also being developed, including the famous (for compression geeks at least) LZ77 and LZ78 by Lempel and Ziv, which led in the 80s to LZW - one of the most ubiquitous techniques used today in everything from GIF and TIFF images through to modems, file systems, and as a basis of other compressors. Indeed, LZW gave rise to PKZIP in 1989, and if you don't know what zip is you probably shouldn't be reading this site.

By the late 80s it was becoming apparent that, even advanced as they were, new methods were needed for the deluge of formats for images, audio, and video. This spurred the introduction of JPG in 1992 and MP3 in 1996, and the rest as they say is history. Literally, compression has changed our world.

The importance of compression
While once the development of compression was partly driven by limited diskspace, memory, and bandwidth (hands up who used V.42bis on their US Robotics 56k modems?), it certainly hasn't gone out of fashion as technology has leapfrogged. Today it plays a role in both hardware and software from the tiniest of embedded devices to behemoth supercomputers.

And even though we have more memory, storage, and bandwidth than ever before, the volume of data we play with has also increased and so compression continues to play an important role.

It's used to maximise memory and storage in your phone, and in transmitting and receiving the calls you make. It's used in your wireless router on its compressed file system and in wireless networking, on your camera as it saves images, in your portable media player with music and its onboard OS, it's even used in your car (well, if it's kinda modern). It's used in the games you play to maximise system resources through compressed audio and texture formats. And it's ubiquitous all over embedded industrial and military applications.

Even the pages of the Atomic magazine are sent digitally to the printers in a compressed format.

And it hasn't just helped advance technology; it's created entire new industries - such as the iPod, and the transition from VHS to DVD. From communicating to satellites on the other side of our solar system, to the humble interweb back down here on Earth, compression is everywhere.

Now that we've covered the what and the why, let's look at the how.

Substitution
To be sure, compression is a hefty science; most people learn about it through expensive tomes of hallowed knowledge. Nevertheless, I'll endeavour to cover the basics here, because that's part of the beauty of compression - even though today's complex algorithms could do your head in, they nonetheless work off the same principles I'll cover here, because at the end of the day, they're all trying to do the same thing.

The first of these is substitution. As mentioned earlier, Morse Code used some substitution and so did Huffman. In its simplest form, we can demonstrate a form of substitution known as RLE, or Run-Length Encoding. RLE simply substitutes the repetition of a character with a value for that many characters repeated, such that a string 'aaaaabbbbbcccc' would become 'a5b5c4'. Saving or sending the RLE version would take less time, yet the original data is there if you know how to decode (and you do).

But let's take it a step further. When you save a text document the letters in the words are saved as their 8-bit binary ASCII representations. For example, the letter 'e' is '01100101'. If we wanted to create a compressed version, we don't have to stick with 8-bit values for the whole alphabet, and instead might choose to allocate less bits for some letters. For example:

a - 00
b - 01
c - 100
d - 101
e - 110

And so on. Note in practice there's a bit more to it, as you'd need to determine when an '01' belongs to a 'b' and not a segment for 'd' when decoding, but the theory is what's important for the moment. You can encode around half the alphabet with 4-bits (up to 16 possible values), so if you were to run a document through this compressor you could (theoretically) shave around 25 per cent off (half the alphabet, half the number of bits) the file size.

But that depends on the letters being used doesn't it? Only half your letters can be represented by 4-bits, and if those letters aren't used, then there's no saving.

So let's improve it. It's accepted that in the English language letters such as 'e', 'o', 't', 'i', and 'n' are much more common than the rest of the alphabet. Even sticking with our current encoding method, we could improve the performance of the technique by assigning these most common letters to the shorter bit substitutions:

e - 00
o - 01
t - 100
i - 101
n - 110

While this is a simple example, it reflects the principles of compression and how much of the work goes into finding new ways to encode more data into less bits. Some advances take an already effective method, and add to it, just as LZ77 and LZ78 led to LZW, and LZW led to PKZIP.

Which, in turn, leads us to dictionaries.

 

 
 »
 
This article appeared in the February, 2009 issue of Atomic.

The latest issue is on sale now!

Want to learn all about Diablo III? Want to find out what the best Solid State Drive is on the market today, and how to look after it? Want to catch up on the latest hardware, games and in depth tech from Australia's best enthusiast mag?

Get your copy today :)
1 Comment
Thoughts on this article? Add a comment below.
ahsoka
Mar 2, 2009 3:37 PM
Awesome article. Keep up the good work.
Login or register to submit a comment.
 
 
 
Atomic Magazine

Issue: 107 | December, 2009

Atomic is a magazine aimed squarely at computer enthusiasts, gamers, and serious PC upgraders.

Every month we bring you the latest reviews of new technology and PC components, in depth features on everything from overclocking to console hacking, and gaming previews and interviews.
 
Latest Comments
"Fucking signed.

"
by index680i | Nov 21, 2009 2:54 PM
 
""sudo preupgrade"
...failed to download installer metadata
------------
So ..."
by wlayton27 | Nov 21, 2009 8:16 AM
 
"I thought Vista outlived it's usefulness about the same time it was released , lol"
by mr.gargoyle | Nov 21, 2009 12:28 AM
 
"^ I find with CoD4 that I can jump on an empty server and be joined by 6-12 others before the ..."
by Ezekill | Nov 20, 2009 10:10 PM
 
"check

LOMAC
DCS Black Shark
X-plane"
by Bastard Child | Nov 20, 2009 8:13 PM
Latest User Reviews
Shenmue II
10%
asdfasdf
 
EVGA X58 Classified
90%
great board, a few things could be better
 
EVGA X58 Classified
90%
Gorgeous looking
 
Sapphire 4890
90%
So good, I immediately wanted a second one!
 
MSI 790FX-GD70 motherboard
90%
Allmost the prefect gaming board