Chapter IV.4. Data Compression Algorithms

The main idea behind data compression is to shrink information, which takes up less space. Not only does compressed data take up less storage space, but it also takes less time to transfer.

Here are two types of data compression algorithms - lossless and lossy.

With lossless compression, the algorithm can compress the data without losing any information, which is used for archiving and compressing multiple files into a single file, such as a ZIP archive. Think of lossless compression as a way to pack existing data more efficiently, like refolding clothes to make them fit in a suitcase.

With lossy compression, the algorithm actually loses some information to compress data. Typically, this lost information isn't noticed anyway. The most common examples of lossy compression involve MP3 audio files and video files. When compressing audio, the MP3 standard throws out the audio parts that the human ear can't distinguish. When compressing video, compression algorithms toss out colors that the human eye doesn't notice. By throwing out information, lossy compression algorithms can save space, much like throwing away clothes to make packing the remaining clothes in a suitcase easier.

Because lossy compression throws out data, it can compress the same data much smaller than lossless compression. However, lossy compression has only limited use. You wouldn't want to use lossy compression when storing documents because you can't afford to lose any data, for example. If you absolutely must preserve data, use lossless compression. If you can afford to lose some data in exchange for tighter compression, use lossy compression.

Lossless Data Compression Algorithms

The basic idea behind lossless data compression is to find a way to pack data in a smaller space more efficiently without losing any of the data in the process. To do this, lossless data compression algorithms are typically optimized for specific data, such as text, audio, or video, although the general principles remain the same no matter what type of data the algorithm is compressing.

Run-length encoding

The simplest lossless data compression algorithm is run-length encoding (RLE). Basically, this method looks for redundancy and replaces any redundant data with a much shorter code instead. Suppose you had the following 17-character string:

WWWBBWWWWBBBBWWWW

RLE looks for redundant data and condenses it into a 10-character string, like this:

3W2B4W4B4W

The number in front of each letter identifies how many characters the code replaced, so in this example, the first two characters, 3W represents WWW, 2B represents BB, 4W represents WWWW, and so on.

Note

Run-length encoding is used by fax machines because most images consist of mainly white space with occasional black areas that represent letters or drawings.

The Burrows-Wheeler transform algorithm

One problem with run-length encoding is that it works best when repetitive characters appear grouped together. With a string, like WBWBWB#, RLE can't compress anything because no groups of W and B characters are bunched together. (However, a smart version of the RLE algorithm notices the two-character repetitive string WB and encodes the string as 3(WB)#, which would tell the computer to repeat the two-character pattern of WB three times.)

When redundant characters appear scattered, run-length encoding can be made more efficient by first transforming the data to group identical characters together and then use run-length encoding to compress the data. That's the idea behind the Burrows-Wheeler transform (BWT) algorithm, developed by Michael Burrows and David Wheeler.

The BWT algorithm must use a character that marks the end of the data, such as the # symbol. Then the BWT algorithm works in three steps. First, it rotates text through all possible combinations, as shown in the Rotate column of Table 4-1. Second, it sorts each line alphabetically, as shown in the Sort column of Table 4-1. Third, it outputs the final column of the sorted list, which groups identical characters together in the Output column of Table 4-1. In this example, the BWT algorithm transforms the string WBWBWB# into WWW#BBB.

Table IV.4-1. Rotating and Sorting Data

Rotate

Sort

Output

WBWBWB#

BWBWB#W

W

#WBWBWB

BWB#WBW

W

B#WBWBW

B#WBWBW

W

WB#WBWB

WBWBWB#

#

BWB#WBW

WBWB#WB

B

WBWB#WB

WB#WBWB

B

BWBWB#W

#WBWBWB

B

At this point, the BWT algorithm hasn't compressed any data but merely rearranged the data to group identical characters together; the BWT algorithm has rearranged the data to make the run-length encoding algorithm more efficient. Run-length encoding can now convert the WWW#BBB string into 3W#3B, thus compressing the overall data.

After compressing data, you'll eventually need to uncompress that same data. Uncompressing this data (3W#3B) creates the original BWT output of WWW#BBB, which contains all the characters of the original, uncompressed data but not in the right order. To retrieve the original order of the uncompressed data, the BWT algorithm repetitively goes through two steps, as shown in Figure 4-1.

The BWT algorithm works in reverse by adding the original BWT output (WWW#BBB) and then sorting the lines repetitively a number of times equal to the length of the string. So retrieving the original data from a 7-character string takes seven adding and sorting steps.

After the final add and sort step, the BWT algorithm looks for the only line that has the end of data character (#) as the last character, which identifies the original, uncompressed data. The BWT algorithm is both simple to understand and implement, which makes it easy to use for speeding up ordinary run-length encoding.

Reconstructing the original data from the BWT transformation.

Figure IV.4-1. Reconstructing the original data from the BWT transformation.

Dictionary encoding

Run-length encoding is a simple algorithm that works well with redundant characters grouped together but doesn't work as well with redundant data scattered throughout. An alternative to RLE is dictionary coding. The basic idea behind dictionary coding is to replace large data chunks with much smaller data chunks. Suppose you had the following text:

See Dick. See Jane.

You could replace the redundant text See by a simple code, such as 1, and create a new string:

1 Dick. 1 Jane.

Now you could replace Dick with 1 code and Jane with another code, such as 2 and 3 respectively, to create a compressed version, like this:

1 2. 1 3.

Uncompressing this data means replacing the codes with the actual data, using a dictionary. Each code refers to the actual data, so looking up 1 retrieves the See string, 2 retrieves Dick, and 3 retrieves Jane, as shown in Figure 4-2.

Uncompressing data requires using a dictionary to replace codes with actual data.

Figure IV.4-2. Uncompressing data requires using a dictionary to replace codes with actual data.

If you know the data you want to compress ahead of time, such as condensing the entire contents of an encyclopedia on a DVD, you can optimize the dictionary to create the smallest codes, which represent the most common data chunks, such as the word the. In most cases, you don't know the data to compress ahead of time, so you need an algorithm that can analyze data, create a dictionary on the fly, and then compress data using that dictionary. Three popular dictionary encoding algorithms include LZ77, LZ78, and LZW.

The LZ77 algorithm

The LZ77 algorithm was created by two computer scientists — Abraham Lempel and Jakob Ziv, who first published their algorithm in 1977 (hence the name LZ77). The LZ77 algorithm works by looking for repetitive data. Rather than storing this repetitive data in a separate dictionary, the LZ77 remembers the location of this data in the original file.

When the algorithm finds this same data stored somewhere else, it removes this data (compressing the overall information) and substitutes a pointer to the previously recognized data, as shown in Figure 4-3.

The LZ77 algorithm replaces redundant data with pointers.

Figure IV.4-3. The LZ77 algorithm replaces redundant data with pointers.

Because pointers take up less space than the actual data, the LZ77 algorithm compresses information. The more redundant data, the more efficient the compression.

The LZ78 algorithm

One problem with the LZ77 algorithm is that it stores redundant data directly in the compressed data itself. To improve compression, the same computer scientists developed a variation of the LZ77 algorithm — the LZ78 algorithm.

The LZ78 algorithm removes all redundant data and stores it in a separate dictionary. Then the algorithm substitutes the redundant data with much smaller codes stored in the dictionary. By removing this data, the LZ78 algorithm can compress data even further than the LZ77 algorithm.

To uncompress this data, the computer follows each code to the dictionary to retrieve the appropriate data chunk.

The LZW algorithm

The LZW algorithm gets its name from Terry Welch, who created his own variation of the LZ78 algorithm, dubbed the LZW algorithm. The LZW algorithm works by creating a dictionary, just like the LZ78 algorithm. But whereas the LZ78 algorithm creates a dictionary of codes that consist of the same size, the LZW algorithm creates a dictionary of codes of different sizes.

When compressing text, the LZW algorithm starts out by creating a dictionary of individual letters. Assuming all uppercase letters, A would be represented by 1, B by 2, and so on. However, substituting a number for a single character isn't likely to save much space, so the LZW algorithm continues examining the text for repetitive multiple-character strings to store as a number, such as AB, ABC, ABCD, and so on.

Like most compression algorithms, the LZW algorithm works best on data that contains redundant information, like this:

IAMSAMSAMIAM#

First, the LZW algorithm creates a dictionary of single characters represented by numbers. I gets stored as 9, A as 1, M as 13, and S as 19.

When the LZW algorithm finds the second letter A, it doesn't encode the letter A all over again because it's done that once already. Instead, the algorithm encodes the next two characters, which happen to be AM, and assigns this two-character combination to the next available number, which is 27. (Numbers 1 through 26 are assigned to the individual letters of the alphabet.)

When the algorithm sees the letter S again, it encodes the next two-character string, SA, as the number 28. Then it finds the letter M again, so it encodes the next two-character string, MI, as the number 29. Finally, it sees the letter A again, so it checks the next two-character string, which is AM. Because the algorithms already encoded AM before (as the number 27), the algorithm expands to encode the three-character string, AM#, as the number 30, as shown in Figure 4-4.

At the beginning of data, the LZW algorithm isn't very efficient because it's slowly creating its dictionary. When the algorithm's dictionary grows with larger amounts of redundant data, it can replace these large chunks of redundant data with small number codes.

Note

The LZW algorithm is used to compress graphic images stored in the Graphic Interchange Format (GIF). Originally, this algorithm was patented in 1985 and the patent holder, Unisys, demanded royalties from software companies that sold programs that could create GIF files. This patent problem caused computer scientists to create and promote an alternate graphics format — Portable Network Graphics (PNG). However, the PNG format never replaced the GIF file format, especially after the LZW patent expired on June 20, 2003.

The LZW algorithm stores increasing larger strings as numbers.

Figure IV.4-4. The LZW algorithm stores increasing larger strings as numbers.

Lossy Data Compression

Lossy data compression shrinks data through a combination of packing data more efficiently (like lossless compression) and by throwing out chunks of data that aren't considered crucial. As a result, lossy compression is used less often for text (where losing data is unacceptable because a single missing word or number can alter the entire meaning of the text) and more often for audio, graphics, and video.

Basically, lossy data compression reduces data much greater than lossless compression because lossy data compression can pack data more efficiently, like lossless compression, while also saving additional space by throwing out small chunks of data that aren't missed anyway.

Warning

Most lossy compression methods use lossless compression algorithms in addition to throwing out unnecessary data.

For example, the human eye and ear can only distinguish a fixed range of colors and sounds. So lossy compression simply removes colors and audio that most people don't notice. When done selectively, compressed audio, graphic, or video can be indistinguishable from the original, but at a certain point, lossy compression eventually degrades the original to an unacceptable level, as shown in Figure 4-5.

Comparison of compressed graphic images.

Figure IV.4-5. Comparison of compressed graphic images.

Note

A specific method for compression audio or video files is a codec, or COmpressor-DECompressor. Some popular audio codecs include MP3, AAC (Advanced Audio Coding), and WMA (Windows Media Audio). Some popular video codecs include RealVideo, WMV (Windows Media Video), and MPEG-4.

The trick behind lossy compression is knowing which data can be removed without degrading quality too far. In an audio file, such as an MP3 file, lossy compression throws out the audio portion that's beyond the human hearing range. In graphics, an image might consist of three shades of blue that are so close as to be nearly indistinguishable. That's when the algorithm strips out the two least-used shades of blue and replaces them with the most frequently used shade of blue. This saves space by reducing the number of colors to store in the file.

Video basically saves successive still images, so lossy compression can save space by looking for identical backgrounds between video frames. Rather than store the same background multiple times, lossy compression stores the background only once and uses that identical image multiple times. Because the same background may appear in several video frames, this technique can shrink the size of a video considerably.

Another way to compress data is to alter the bit depth. Bit depth defines how many bits are used to store data, such as 96-bit or 160-bit. The more bits used, the greater the quality but the larger the file size. The fewer bits used, the less storage space required and the less data saved, reducing the file size. That's why a 96-bit MP3 file is smaller than the same file saved as a 160-bit MP3 file. The 96-bit file can't store as much data as the 160-bit file, which means lower audio quality than the 160-bit file.

When compressing a file, lossy compression may use constant bit rate (CBR) or variable bit rate (VBR) compression. CBR reduces the bit rate uniformly throughout the entire file and makes compression faster. Unfortunately, this also means that silent portions of an audio file get compressed at the same rate as noisier parts of the audio file, resulting in less-than-optimum compression.

VBR alters the bit rate, depending on the complexity of the data. This improves quality but at the cost of a slower compression time. For even higher quality, some compression algorithms offer two-pass VBR, which means the program analyzes the file twice to get the maximum quality and the smallest file size possible, but at the expense of much slower compression speed.

All types of compression are always a trade-off. With lossless compression, the trade-off is between size and speed. The smaller you want to compress the file, the longer it takes. With lossy compression, the trade-off is mostly between size and quality. The smaller the file size, the lower the overall quality. Both lossless and lossy compression algorithms are necessary, depending on which type better suits your needs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset