11

Entropy, Predictive Coding and Quantization

Chapter Outline

In this chapter, we will discuss some of the basic concepts of data compression, including video and image compression. Until now we have considered only uncompressed video formats, such as RGB or YCrCb, where each pixel is individually represented (although this is not strictly true for 4:2:2 or 4:2:0 forms of YCrCb). However, high levels of compression are possible with little loss of video quality. Reducing the data needed to represent an individual image or a sequence of video frames is very important when considering how much storage is needed on a camera flash chip or computer hard disk, or the bandwidth needed to transport cable or satellite television, or stream video to a computer or handheld wireless device.

11.1 Entropy

We will start with the concept of entropy. Some readers may recall from studies in thermodynamics or physics that entropy is a measure of the disorderliness of a system. Further, the second law of thermodynamics states that in a closed system entropy can only increase, and never decrease. In the study of compression, and also the related field of error correction, entropy can be thought of as the measure of unpredictability. Entropy can be applied to a set of digital data.

The less predictable a set of digital data is, the more information it carries. Here is a simple example: assume that a bit can be equally likely to be either zero or one. By definition, this will be one bit of data information. Now assume that this bit is known to be one with 100% certainty. This will carry no information, because the outcome is predetermined. This relationship can be generalized by:

Info of outcome = log2 (1 / probability of outcome) = − log2 (probability of outcome)

Let’s look at another example. If there is a four outcome event, with equal probability of outcome1, outcome2, outcome3, or outcome4:

 Outcome 1: Probability = 0.25, encode as 00.

 Outcome 2: Probability = 0.25, encode as 01.

 Outcome 3: Probability = 0.25, encode as 10.

 Outcome 4: Probability = 0.25, encode as 11.

The entropy can be defined as the sum of the probabilities of each outcome multiplied by the information conveyed by that outcome.

Entropy = prob (outcome1) × info (outcome1) + prob (outcome2) × info (outcome2) + … prob (outcomen) × info (outcomen)

For our simple example:

Entropy = 0.25 × log2(1 / 0.25) + 0.25 × log2(1 / 0.25) + 0.25 × log2(1 / 0.25) + 0.25 × log2(1 / 0.25) = 2 bits

This is intuitive – two bits is normally what would be used to convey one of four possible outcomes.

In general, the entropy is the highest when the outcomes are equally probable, and therefore totally random. When this is not the case, and the outcomes are not random, the entropy is lower, and it may be possible to take advantage of this and reduce the number of bits to represent the data sequence.

If the probabilities are not equal, for example:

 Outcome 1: Probability = 0.5, encode as 00.

 Outcome 2: Probability = 0.25, encode as 01.

 Outcome 3: Probability = 0.125, encode as 10.

 Outcome 4: Probability = 0.125, encode as 11.

Entropy = 0.5 × log2(1 / 0.5) + 0.25 × log2(1 / 0.25) + 0.125 × log2(1 / 0.125) + 0.125 × log2(1 / 0.125)  = 1.75 bits

11.2 Huffman Coding

Since the entropy is less than two bits, then, in theory, we should be able to convey this information in less than two bits. What if we encoded these events differently as below:

 Outcome 1: Probability = 0.5, encode as 0.

 Outcome 2: Probability = 0.25, encode as 10.

 Outcome 3: Probability = 0.125, encode as 110.

 Outcome 4: Probability = 0.125, encode as 111.

For half the time we would present the data with one bit, one quarter of the time with two bits and one quarter of the time with three bits.

Average number of bits = 0.5 × 1 + 0.25 × 2 + 0.125 × 3 + 0. 125 × 3 = 1.75

The nice thing about this encoding is that the bits can be put together into a continuous bit stream, and unambiguously decoded.

010110111110100….

The bit stream above can only represent one possible outcome sequence.

outcome1, outcome2, outcome3, outcome4, outcome3, outcome2, outcome1

An algorithm known as Huffman coding works in this manner, by assigning the shortest codewords (the bit-word that each outcome is mapped to, or encoded) to the events of highest probability. For example, Huffman codes are used in JPEG-based image compression. Originally, this concept was used over 150 years ago in Morse code for telegraphs, where each letter of the alphabet is mapped to a series of short dots and long dashes.

Common letters:

 E ·

 I ··

 T -

Uncommon letters:

 X -··-

 Y -·--

 Z --··

In Morse code, the higher probability letters are encoded in a few dots, or a single dash. The lower probability letters use several dashes and dots. This minimized, on average, the amount of time required by the operator to send a telegraph message.

11.3 Markov Source

Further opportunities for optimization arise when the probability of each successive outcome or symbol is dependent upon previous outcomes. An obvious example is that if the nth letter is a “q”, you can be pretty sure the next letter will be “u”. Another example is that if the nth letter is a “t”, this raises the probability that the following letter will be an “h”. Data sources that have this kind of dependency relationship between successive symbols are known as Markov sources. This can lead to more sophisticated encoding schemes. Common groups of letters with high probabilities can be mapped to specific codewords. Examples are the letter pairs “st” and “tr”, or “the”.

This can occur when we map the probabilities of a given letter based on the few letters preceding. In essence, we are making predictions based on the probability of certain letter groups appearing in any given construct of the English language. Of course, different languages, even if using the same alphabet, will have a different set of multi-letter probabilities, and therefore different codeword mappings. An easy example in the United States is the sequence of letters: HAWAI_. Out of 26 possibilities, it is very likely the next letter is an “I”.

In a first order Markov source, a given symbol’s probability is dependent upon the previous symbol. In a second order Markov source, a given symbol’s probability is dependent upon the previous two symbols, and so forth. The average entropy of a symbol tends to decrease as the order of the Markov source increases. The complexity of the system also increases as the order of the Markov source increases.

A first order binary Markov source is described in Figure 11.1. The transition probabilities depend only on the current state.

image

Figure 11.1 Markov diagram.

In this case, the entropy can be found as the weighted sum of the conditional entropies corresponding to the transitional probabilities of the state diagram. The probability of a zero given the previous state is a one is described as P(0|1).

The probability of each state can be found by solving the probability equations:

P(0) = P(0) × P(0|0) + P(1) × P(0|1) = P(0) × 0.8 + P(1) × 0.4

Both methods can output a zero (meaning to arrive at state = 0 circle), either starting from state zero or one.

P(1) = P(1) × P(1|1) + P(0) × P(1|0) = P(1) × 0.6 + P(0) × 0.2

Both methods can output a one (meaning to arrive at state = 0 circle), either starting from state zero or one.

By definition P(0) + P(1) = 1

From this, we can solve for P(0) and P(1):

 P(0) = {2 / 3}

 P(1) = {1 / 3}

Now we can solve for the entropy associated with each state circle.

For state zero and state one:

Entropy0 = 0.8 × log2(1 / 0.8) + 0.2 × log2(1 / 0.2) = 0.258 + 0.464 = 0.722 bits

Entropy1 = 0.6 × log2(1 / 0.6) + 0.4 × log2(1 / 0.4) = 0.442 + 0.529 = 0.971 bits

The entropy of the Markov source or system is then given by:

P(0) × Entropy0 + P(1) × Entropy1 = 1 / 3 × 0.722 + 2 / 3 × 0.971 = 0.888 bits

11.4 Predictive Coding

Finding the sequence “HAWAII” is a form of predictive coding. The probabilities of any Markov source, such as language, can be mapped in multi-letter sequences with an associated probability. These in turn can be encoded using Huffman coding methods, to produce a more efficient representation (fewer number of bits) of the outcome sequence.

This same idea can be applied to images. As we have previously seen, video images are built line by line, from top to bottom, and pixel by pixel, from left to right. Therefore, for a given pixel, the pixels above and to the left are available to help predict the next pixel. For our purposes, let us assume an RGB video frame, with eight bits per pixel and color. Each color can have a value from 0 to 255.

The unknown value pixel, for each color, can be predicted from the pixels immediately left, immediately above, and diagonally above. Three simple predictors are given in Figure 11.2.

image

Figure 11.2 Three examples of video predictors.

The entire frame could be iteratively predicted from just three initial pixels, but this is not likely to be a very useful prediction. The usefulness of these predictors becomes apparent when used in conjunction with differential encoding.

11.5 Differential Encoding

Suppose we take the actual pixel value and subtract the predicted pixel value. This is differential encoding, and this difference is used to represent the pixel value for that location. For example, to store a video frame we would perform the differential encoding process, and store the results.

To restore the original video data for display, we would simply need to compute the predicted pixel value (using the previous pixel data) and then add to the stored differential encoded value. This is the original pixel data. (The three pixels on the upper left corner of the frame are stored in their original representation, and are used to create the initial predicted pixel data during the restoration process).

All of this sounds unnecessarily complicated, so what is the purpose here? The differential encoder outputs are most likely to be much smaller values than the original pixel data, due to the correlation between nearby pixels. Since statistically the differentially encoded data is just representing the errors of the predictor, this signal is likely to have the great majority of values concentrated around zero, or to have a very compact histogram. In contrast, the original pixel values are likely to span the entire color value space, and therefore are a high entropy data source with an equal or uniform probability distribution.

Differential encoding sends less information than if the video frame was simply sent pixel by pixel across the whole frame. The entropy of the data stream has been reduced through use of differential encoding, as the correlation between adjacent pixels has been largely eliminated. Without differential encoding, we have no information as to the type of video frames that will be processed, the initial pixel value outcomes are assumed to be equally distributed, meaning each of the 256 possible values has a probability of 1/256, and an entropy of eight bits per pixel. The possible outcomes of the differential encoder tend to be much more probable for small values (due to the correlation to nearby pixels) and much less likely for larger values. As we saw in our simple example above, when the probabilities are not evenly distributed, the entropy is lower. Lower entropy means less information. Therefore, on average, significantly fewer bits are required to represent the image, and the only cost is increased complexity due to the predictor and differential encoding computations.

Let’s assume that averaged over a complete video frame, the probabilities work out thus:

Pixel color value Probability = 1/256 for values in range of 0 to 255

And that after differential encoding, the probability distribution comes out as:

Differential color Probability = 1/16 for value equal to 0
Differential color Probability = 1/25 for values in the range −8 to −1, 1 to 8
Differential color Probability = 1/400 for values in the range −32 to −9, 9 to 32
Differential color Probability = 1/5575 for values in the range −255 to −33, 33 to 255

We earlier defined entropy as:

Entropy = prob (outcome1) × info (outcome1) + prob (outcome2) × info (outcome2) + … prob (outcomen) × info (outcomen)

with info of outcome = log2 (1 / probability of that outcome)

In the first case, we have 256 outcomes, each of probability 1/256:

Entropy = 256 × (1 / 256) × log2 (1 / (1 / 256) ) = 8 bits

In the second case, we have 511 possible outcomes, with one of four probabilities:

Entropy = 1 × (1 / 16) × log2 (1 / (1 / 16)) + 16 × (1 / 25) × log2 (1 / (1 / 25)) + 48 × (1 / 400) × log2 (1 / (1 / 400)) + 446 × (1 / 5575) × log2 (1 / (1 / 5575)) = 0.250 + 2.972 + 1.037 + 0.996 = 5.255 bits

11.6 Lossless Compression

The entropy has been significantly reduced through the use of the differential coder. While this step function probability distribution is obviously contrived, typical entropy values for various differential encoders across actual video frame data tend to be in the range of four to five ½ bits. Huffman coding, or similar mapping of the values to bit codewords, can reduce the number of bits used to transmit each color plane pixel to ~5 bits compared to the original eight bits. This has been achieved without any loss of video information, meaning the reconstructed video is identical to the original. This is known as lossless compression, as there is no loss of information in the compression (encoding) and decompression (decoding) process.

Much higher degrees of compression are possible if we are willing to accept some level of video information loss, which can result in video quality degradation. This is known as lossy compression. Note that with lossy compression, each time the video is compressed and decompressed, some information is lost, and there will a resultant impact on video quality. The trick is to achieve high levels of compression without noticeable video degradation.

11.7 Quantization

Many of the techniques used in compression, such as the DCT, Huffman coding and predictive coding, are fully reversible with no loss in video quality. Quantization is often the principal area of compression where information is irretrievably lost, and consequently decoded video will suffer quality degradation. With care, this degradation can be almost imperceptible to the viewer.

Quantization occurs when a signal with many or infinite number of values must be mapped into a finite set of values. In digital signal processing, signals are presented in binary numbers, with 2n possible values mapping into an n bit representation.

For example, suppose we want to present the range −1 to +1 (well, almost +1) using an 8-bit fixed point number. With 2n or 256 possible values to map to across this range, the step size is 1 / 128, which is 0.0078125. Let’s say the signal has an actual value of 0.5030. How closely can this value be represented? What if the signal is 1/10 the level of the first sample, or 0.0503? And again, consider a signal with value 1 / 10 the level as the second sample, at 0.00503. Below is a table showing the closest representation just above and below each of these signal levels, and the resultant error in the 8-bit representation of the actual signal to sampled signal value.

The actual error level remains more or less in the same range over the different signal ranges. This error level will fluctuate, depending upon the exact signal value, but with our 8-bit signed example will always be less than 1 / 128, or 0.0087125. This fluctuating error signal will be seen as a form of noise (called quantization noise), or unwanted signal in the digital video processing. It can be modeled as an injection of noise when simulated in an algorithm with unlimited numerical precision.

Table 11.1

Image

When the signal level is fairly large for the allowable range, (0.503 is close to one half the maximum value), the percentage error is small – less than 1%. As the signal level gets smaller, the error percentage gets larger, as the table indicates.

Quantization noise is always present and is, on average, the same level (any noise-like signal will rise and fall randomly, so we usually concern ourselves with the average level). But as the input signal decreases in level, the quantization noise becomes relatively more significant. Eventually, for very small input signal levels, the quantization noise can become so significant that it degrades the quality of whatever signal processing is to be performed. Think of it as like static on a car radio: as you get further from the radio station, the radio signal gets weaker, and eventually the static noise makes it difficult or unpleasant to listen to, even if you increase the volume.

So what can we do if our signal is sometimes strong (0.503 for example), and sometimes weak (0.00503 for example)? Another way of describing this is to say that the signal has a large dynamic range. The dynamic range describes the ratio between the largest and smallest value of the signal, in this case 100.

Suppose we exchange our 8-bit representation with 12-bit representation? Then our maximum range is still from −1 to +1, but our step size is now 1 / 2048, which is 0.000488. Let’s make a 12-bit table similar to the 8-bit example.

This is a significant difference. The actual error is always less than our step size, 1 / 2048. But the error as a percentage of signal level is dramatically improved and this is our concern in signal processing. Because of the much smaller step size of the 12-bit representation, the quantization noise is much less, allowing even small signals to be represented with acceptable precision. Another way of describing this is to introduce the concept of signal to noise power ratio, or SNR. This describes the power of the largest signal compared to the background noise. This can be very easily seen on a frequency domain or spectral plot of a signal. There can be many sources of noise, but, for now, we are only considering the quantization noise introduced by the digital representation of the video pixel values.

Table 11.2

Image

What we have just described is the uniform quantizer, where all the step sizes are equal. However, there are alternate quantizing mappings, where the step size varies across the signal amplitude. For example, a quantizer could be designed to give the same SNR across the signal range, requiring a small step size when the signal amplitude is small, and a larger step size as the signal increases in value. The idea is to provide a near constant quantization error as a percentage of the signal value. This type of quantizing is performed on the voice signals in the US telephone system, known as μ-law encoding.

Another possible quantization scheme could be to use a small step size for regions of the signal where there is a high probability of signal amplitude occurring, and a larger step for regions where signal amplitude has less likelihood of occurring.

Uniform quantizing is by far the most commonly used scheme in video signal processing, because we are often not encoding simply amplitude, representing small or large signals. In many cases, we could be encoding color information, where the values map to various color intensities, rather than signal amplitude.

If we try to use the likelihood of different values to optimize the quantizer, we have to assume that the probability distribution of the video data is known. Alternatively, the uniform quantizer can be followed by some type of Huffman coding or differential encoding, which will optimize the signal representation for the minimum average number of bits.

Vector Quantization is also commonly used. In the preceding discussion, a single set of values or signals is being quantized with a given bit representation. However, a collection of related signals can be quantized to a single representation using a given number of bits.

For example, if we assume the pixel is represented in RGB format, with 8 bits used for each color, the color red can be represented in 28 or 256 different intensities.

Each pixel uses a total of 24 bits, for a total of 224, or about 16 million possible values. Intuitively, this seems excessive as it is unlikely that the human eye can distinguish between that many colors. So a vector quantizer might map the 16 million possible values into a color table of 256 total colors, allowing 256 combinations of red, green and blue combinations. This mapping results in just eight bits to present each pixel.

This seems reasonable, but the complexity lies in mapping 16 million possible inputs to the allowed 256 representations, or color codewords. If done using a look-up table, memory of 16 million bytes would be required for this quantization, with each memory location containing one of the 256 color codewords. This is excessive, so some sort of mapping algorithm or computation is required to map the 16 million possible color combinations to the closest color codeword. It is hopefully becoming clear that most methods to compress video, or other data for that matter, come at the expense of increased complexity and increased computational rates.

11.8 Decibels

Signal to Noise Power Ratio (SNR) is usually expressed in decibels (denoted dB), using a logarithmic scale. The SNR of a digitally represented signal can be determined by the following equation:

SNRquantization (dB) = 6.02 × (Number of bits) + 1.76

Each additional bit of the signal representation gains 6 dB of SNR. 8-bit representation is capable of a signal with an SNR of about 48 dB, 12-bit can do better at 72 dB, and 16-bit will yield up to 96 dB. This only accounts for the effect of quantization noise: in practice there are other effects which could also degrade SNR in a system.

There is one last important point on decibels. This is a very commonly used term in many areas of digital signal processing subsystems. A decibel is simply a ratio, commonly of amplitude, power or voltage, similar to percentage. But because of the extremely high ratios commonly used (a billion is not uncommon), it is convenient to express this logarithmically. The logarithmic expression also allow chains of circuits, or signal processing operations, each with its own ratio (say of output power to input power), to simply be added up to find the final ratio.

A common area of confusion is differentiating between signal levels or amplitude (voltage if an analog circuit) and signal power. Power measurements are virtual in the digital world, but can be directly measured in analog circuits with which video systems interface, such as the RF amplifiers and analog signal levels for circuits in head-end cable systems, or video satellite transmission.

There are two definitions of dB commonly used:

dBvoltage = dBdigital value = 20 × log (voltage signal 1 / voltage signal 2)

dBpower = 10 × log (power signal 1 / power signal 2)

The designation of “signal 1” and “signal 2” depends on the situation. For example, with an RF power amplifier (analog), the dB of gain will be the 10 log (output power / input power). For digital applications, the dB of SNR will be the 20 log (maximum input signal / quantization noise signal level).

dB can refer to many different ratios in video system designs and it is easy to get confused about whether to use a multiplicative factor of 10 or 20, if the reasoning behind these numbers is unclear.

Voltage squared is proportional to power. If a given voltage is doubled in a circuit, it requires four times as much power. This goes back to a basic Ohm’s law equation:

Power = Voltage2 / Resistance

In many analog circuits, signal power is used because that is what the lab instruments work with, and while different systems may use different resistance levels, power is universal (however, 50 ohm is the most common standard in most analog systems).

When voltage is squared, this needs to be taken into account in the computation of logarithmic decibel relation. Remember, log xy = y log x. Hence, the multiply factor of “2” is required for voltage ratios, changing the “10” to a “20”.

In the digital world, the concept of resistance and power do not exist. A given signal has specific amplitude, expressed in a digital numerical system (such as signed fractional or integer for example).

Understanding that dB increases using the two measurement methods is important. Let’s look at doubling of the amplitude ratio and doubling of the power ratio:

6.02 dBvoltage = 6.02 dBdigital value = 20 × log (2 / 1)

3.01 dBpower = 10 × log (2 / 1)

This is why shifting a digital signal left one bit (multiplying by 2) will cause a 6 dB signal power increase, and why so often the term 6 dB/bit is used in conjunction with ADCs, DACs or digital systems in general.

By the same reasoning, doubling in power to an RF engineer means a 3 dB increase. This will impact upon the entire system: coding gain, as used with error correcting code methods, is based upon power. All signals at antenna interfaces are defined in terms of power, and the decibels used will be power ratios.

In both systems, ratios of equal power or voltage are 0 dB. For example, a unity gain amplifier has a gain of 0 dB.

0 dBpower = 10 × log (1 / 1)

A loss would be expressed as a negative dB. For example a circuit whose output is equal to ½ the input power:

−3.01 dBpower = 10 × log (1 / 2)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset