3

Processing for compression

Virtually all compression systems rely on some combination of the basic processes outlined in this chapter. In order to understand compression a good grasp of filtering and transforms is essential along with motion estimation for video applications. These processes only express the information in the best way for the actual compression stage, which almost exclusively begins by using requantizing to shorten the wordlength. In this chapter the principles of filters and transforms will be explored, along with motion estimation and requantizing. These principles will be useful background for the subsequent chapters.

3.1 Introduction

In compression systems it is very important to remember that the valuable commodity is information. For practical reasons the way information is represented may have to be changed a number of times during its journey. If accuracy or realism is a goal these changes must be well engineered. In the course of this book we shall explore changes between the continuous domain of real sounds and images, the discrete domain of sampled data and the discrete spatial and temporal frequency domains. A proper understanding of these processes requires familiarity with the principles of filters and transforms.

Figure 3.1 shows an optical system of finite resolution. If an object containing an infinitely sharp line is presented to this system, the image will be an intensity function known in optics as a point spread function. Such functions are almost invariably symmetrical in optics. There is no movement or change here, the phenomenon is purely spatial. A point spread function is a spatial impulse response. All images passing through the optical system are convolved with it.

images

Figure 3.1 In optical systems an infinitely sharp line is reproduced as a point spread function (a) which is the impulse response of the optical path. Scanning either object or image produces an analog time-variant waveform (b). The scanned object waveform can be converted to the scanned image waveform with an electrical filter having an impulse response which is an analog of the point spread function. (c) The object and image may also be sampled or the object samples can be converted to the image samples by a filter with an analogous discrete impulse response.

Figure 3.1(b) shows that the object may be scanned by an analog system to produce a waveform. The image may also be scanned in this way. These waveforms are now temporal. However, the second waveform may be obtained in another way, using an analog filter in series with the first scanned waveform which has an equivalent impulse response. This filter must have linear phase, i.e. its impulse response must be symmetrical.

Figure 3.1(c) shows that the object may also be sampled in which case all samples but one will have a value of zero. The image may also be sampled, and owing to the point spread function, there will now be a number of non-zero sample values. However, the image samples may also be obtained by passing the input sample into a digital filter having the appropriate impulse response. Note that it is possible to obtain the same result as (c) by passing the scanned waveform of (b) into an ADC and storing the samples in a memory.

Clearly there are a number of equivalent routes leading to the same result. One consequence is that optical systems and sampled systems can simulate one another. This gives us considerable freedom to perform processing in the most advantageous domain which gives the required result. There are many parallels between analog, digital and optical filters, which this chapter treats as a common subject.

It should be clear from Figure 3.1 why video signal paths need to have linear phase. In general, analog circuitry and filters tend not to have linear phase because they must be causal which means that the output can only occur after the input. Figure 3.2(a) shows a simple RC network and its impulse response. This is the familiar exponential decay due to the capacitor discharging through the resistor (in series with the source impedance which is assumed here to be negligible). The figure also shows the response to a squarewave at (b). With other waveforms the process is inevitably more complex.

images

Figure 3.2 (a) The impulse response of a simple RC network is an exponential decay. This can be used to calculate the response to a square wave, as in (b).

Filtering is unavoidable. Sometimes a process has a filtering effect which is undesirable, for example the limited frequency response of an audio amplifier or loss of resolution in a lens, and we try to minimize it. On other occasions a filtering effect is specifically required. Analog or digital filters, and sometimes both, are required in ADCs, DACs, in the data channels of digital recorders and transmission systems and in DSP. Optical filters may also be necessary in imaging systems to convert between sampled and continuous images. Optical systems used in displays and in laser recorders also act as spatial filters.1

Figure 3.3 shows that impulse response testing tells a great deal about a filter. In a perfect filter, all frequencies should experience the same time delay. If some groups of frequencies experience a different delay from others, there is a group-delay error. As an impulse contains an infinite spectrum, a filter suffering from group-delay error will separate the different frequencies of an impulse along the time axis.

A pure delay will cause a phase shift proportional to frequency, and a filter with this characteristic is said to be phase-linear. The impulse response of a phase-linear filter is symmetrical. If a filter suffers from group-delay error it cannot be phase-linear. It is almost impossible to make a perfectly phase-linear analog filter, and many filters have a group-delay equalization stage following them which is often as complex as the filter itself. In the digital domain it is straightforward to make a phase-linear filter, and phase equalization becomes unnecessary.

images

Figure 3.3 If a filter is not phase-linear, different frequencies will emerge at different times if an impulse is input. This is undesirable in video circuitry.

Because of the sampled nature of the signal, whatever the response at low frequencies may be, all PCM channels (and sampled analog channels) act as low-pass filters because they cannot contain frequencies above the Nyquist limit of half the sampling frequency.

3.2 Transforms

Transforms are a useful subject because they can help to understand processes which cause undesirable filtering or to design filters. The information itself may be subject to a transform. Transforming converts the information into another analog. The information is still there, but expressed with respect to temporal or spatial frequency rather than time or space. Instead of binary numbers representing the magnitude of samples, there are binary numbers representing the magnitude of frequency coefficients. The close relationship of transforms to compression technology makes any description somewhat circular as Figure 3.4 shows. The solution adopted in this chapter is to introduce a number of filtering-related topics, and to interpret these using transforms whenever a useful point can be illustrated.

Transforms are only a different representation of the same information. As a result what happens in the frequency domain must always be consistent with what happens in the time or space domains. A filter may modify the frequency response of a system, and/or the phase response, but every combination of frequency and phase response has a corresponding impulse response in the time domain.

images

Figure 3.4 Transforms are extensively found in convergent systems. They may be used to explain the operation of a process, or a process may actually create a transform. Here the relationship between transforms and DVB is shown.

images

Figure 3.5 If a signal having a given spectrum is passed into a filter, multiplying the two spectra will give the output spectrum at (a). Equally transforming the filter frequency response will yield the impulse response of the filter. If this is convolved with the time domain waveform, the result will be the output waveform, whose transform is the output spectrum (b).

Figure 3.5 shows the relationship between the domains. On the left is the frequency domain. Here an input signal having a given spectrum is input to a filter having a given frequency response. The output spectrum will be the product of the two functions. If the functions are expressed logarithmically in deciBels, the product can be obtained by simple addition.

On the right, the time-domain output waveform represents the convolution of the impulse response with the input waveform. However, if the frequency transform of the output waveform is taken, it must be the same as the result obtained from the frequency response and the input spectrum. This is a useful result because it means that when image or audio sampling is considered, it will be possible to explain the process in both domains. In fact if this is not possible the phenomenon is incompletely understood.

3.3 Convolution

When a waveform is input to a system, the output waveform will be the convolution of the input waveform and the impulse response of the system. Convolution can be followed by reference to a graphic example in Figure 3.6. Where the impulse response is asymmetrical, the decaying tail occurs after the input. As a result it is necessary to reverse the impulse response in time so that it is mirrored prior to sweeping it through the input waveform. The output voltage is proportional to the shaded area shown where the two impulses overlap. If the impulse response is symmetrical, as would be the case with a linear phase filter, or in an optical system, the mirroring process is superfluous.

The same process can be performed in the sampled or discrete time domain as shown in Figure 3.7. The impulse and the input are now a set of discrete samples which clearly must have the same sample spacing. The impulse response only has value where impulses coincide. Elsewhere it is zero. The impulse response is therefore stepped through the input one sample period at a time. At each step, the area is still proportional to the output, but as the time steps are of uniform width, the area is proportional to the impulse height and so the output is obtained by adding up the lengths of overlap. In mathematical terms, the output samples represent the convolution of the input and the impulse response by summing the coincident cross-products.

In digital-to-analog convertors, the impulse response of the reconstruction filter is convolved with the sample pulses to create a continuous output waveform. In a wavelet decoder the output is reconstructed by convolving the wavelet function with a pulse whose amplitude is determined by a coefficient.

3.4 FIR and IIR filters

Filters can be described in two main classes, as shown in Figure 3.8, according to the nature of the impulse response. Finite-impulse response (FIR) filters are always stable and, as their name suggests, respond to an impulse once, as they have only a forward path. In the temporal domain, the time for which the filter responds to an input is finite, fixed and readily established. The same is therefore true about the distance over which an FIR filter responds in the spatial domain. FIR filters can be made perfectly phase linear if a significant processing delay is accepted. Most filters used for image processing, sampling rate conversion and oversampling fall into this category.

images

Figure 3.6 In the convolution of two continuous signals (the impulse response with the input), the impulse must be time reversed or mirrored. This is necessary because the impulse will be moved from left to right, and mirroring gives the impulse the correct timedomain response when it is moved past a fixed point. As the impulse response slides continuously through the input waveform, the area where the two overlap determines the instantaneous output amplitude. This is shown for five different times by the crosses on the output waveform.

images

Figure 3.7 In discrete time convolution, the mirrored impulse response is stepped through the input one sample period at a time. At each step, the sum of the cross-products is used to form an output value. As the input in this example is a constant height pulse, the output is simply proportional to the sum of the coincident impulse response samples. This figure should be compared with Figure 3.6.

images

Figure 3.8 An FIR filter (a) responds only once to an input, whereas the output of an IIR filter (b) continues indefinitely rather like a decaying echo.

Infinite-impulse response (IIR) filters respond to an impulse indefinitely and are not necessarily stable, as they have a return path from the output to the input. For this reason they are also called recursive filters. As the impulse response in not symmetrical, IIR filters are not phase linear. Audio equalizers often employ recursive filters.

3.5 FIR filters

An FIR filter performs convolution of the input waveform with its own impulse response. It does this by graphically constructing the impulse response for every input sample and superimposing all these responses. It is first necessary to establish the correct impulse response. Figure 3.9(a) shows an example of a low-pass filter which cuts off at images of the sampling rate. The impulse response of an ideal low-pass filter is a sin x/x curve, where the time between the two central zero crossings is the reciprocal of the cut-off frequency. According to the mathematics, the waveform has always existed, and carries on for ever.

The peak value of the output coincides with the input impulse. This means that the filter cannot be causal, because the output has changed before the input is known. Thus in all practical applications it is necessary to truncate the extreme ends of the impulse response, which causes an aperture effect, and to introduce a time delay in the filter equal to half the duration of the truncated impulse in order to make the filter causal. As an input impulse is shifted through the series of registers in Figure 3.9(b), the impulse response is created, because at each point it is multiplied by a coefficient as in Figure 3.9(c).

images

Figure 3.9 (a) The impulse response of an LPF is a sin x/x curve which stretches from −∞ to +∞ in time. The ends of the response must be neglected, and a delay introduced to make the filter causal. (b) The structure of an FIR LPF. Input samples shift across the register and at each point are multiplied by different coefficients. (c) When a single unit sample shifts across the circuit of Figure 3.9(b), the impulse response is created at the output as the impulse is multiplied by each coefficient in turn.

These coefficients are simply the result of sampling and quantizing the desired impulse response. Clearly the sampling rate used to sample the impulse must be the same as the sampling rate for which the filter is being designed. In practice the coefficients are calculated, rather than attempting to sample an actual impulse response. The coefficient wordlength will be a compromise between cost and performance. Because the input sample shifts across the system registers to create the shape of the impulse response, the configuration is also known as a transversal filter. In operation with real sample streams, there will be several consecutive sample values in the filter registers at any time in order to convolve the input with the impulse response.

Simply truncating the impulse response causes an abrupt transition from input samples which matter and those which do not. Truncating the filter superimposes a rectangular shape on the time-domain impulse response. In the frequency domain the rectangular shape transforms to a sin x/x characteristic which is superimposed on the desired frequency response as a ripple. One consequence of this is known as Gibb’s phenomenon; a tendency for the response to peak just before the cut-off frequency.2,3 As a result, the length of the impulse which must be considered will depend not only on the frequency response but also on the amount of ripple which can be tolerated. If the relevant period of the impulse is measured in sample periods, the result will be the number of points or multiplications needed in the filter. Figure 3.10 compares the performance of filters with different numbers of points. A high-quality digital audio FIR filter may need in excess of 100 points.

Rather than simply truncate the impulse response in time, it is better to make a smooth transition from samples which do not count to those that do. This can be done by multiplying the coefficients in the filter by a window function which peaks in the centre of the impulse. Figure 3.11 shows some different window functions and their responses. The rectangular window is the case of truncation, and the response is shown at I. A linear reduction in weight from the centre of the window to the edges characterizes the Bartlett window II, which trades ripple for an increase in transition-region width. At III is shown the Hann window, which is essentially a raised cosine shape. Not shown is the similar Hamming window, which offers a slightly different trade-off between ripple and the width of the main lobe. The Blackman window introduces an extra cosine term into the Hamming window at half the period of the main cosine period, reducing Gibb’s phenomenon and ripple level, but increasing the width of the transition region. The Kaiser window is a family of windows based on the Bessel function, allowing various tradeoffs between ripple ratio and main lobe width. Two of these are shown in IV and V.

images

Figure 3.10 The truncation of the impulse in an FIR filter caused by the use of a finite number of points (N) results in ripple in the response. Shown here are three different numbers of points for the same impulse response. The filter is an LPF which rolls off at 0.4 of the fundamental interval. (Courtesy Philips Technical Review)

images

Figure 3.11 The effect of window functions. At top, various window functions are shown in continuous form. Once the number of samples in the window is established, the continuous functions shown here are sampled at the appropriate spacing to obtain window coefficients. These are multiplied by the truncated impulse response coefficients to obtain the actual coefficients used by the filter. The amplitude responses I–V correspond to the window functions illustrated. (Responses courtesy Philips Technical Review)

images

Figure 3.12 The Dolph window shape is shown at (a). The frequency response is at (b). Note the constant height of the response ripples.

The Dolph window4 shown in Figure 3.12 results in an equiripple filter which has the advantage that the attenuation in the stopband never falls below a certain level.

Filter coefficients can be optimized by computer simulation. One of the best-known techniques used is the Remez exchange algorithm, which converges on the optimum coefficients after a number of iterations.

In the example of Figure 3.13, a low-pass FIR filter is shown which is intended to allow downsampling by a factor of two. The key feature is that the stopband must have begun before one half of the output sampling rate. This is most readily achieved using a Hamming window because it was designed empirically to have a flat stopband so that good aliasing attenuation is possible. The width of the transition band determines the number of significant sample periods embraced by the impulse. The Hamming window doubles the width of the transition band. This determines in turn both the number of points in the filter, and the filter delay. For the purposes of illustration, the number of points is much smaller than would normally be the case in an audio application.

images

Figure 3.13 A downsampling filter using the Hamming window

images

Figure 3.14 Frequency response of a 49-point transversal filter with infinite precision (solid line) shows ripple due to finite window size. Quantizing coefficients to twelve bits reduces attenuation in the stopband. (Responses courtesy Philips Technical Review)

As the impulse is symmetrical, the delay will be half the impulse period. The impulse response is a sin x/x function, and this has been calculated in the figure. The equation for the Hamming window function is shown with the window values which result. The sin x/x response is next multiplied by the Hamming window function to give the windowed impulse response shown. If the coefficients are not quantized finely enough, it will be as if they had been calculated inaccurately, and the performance of the filter will be less than expected. Figure 3.14 shows an example of quantizing coefficients. Conversely, raising the wordlength of the coefficients increases cost.

The FIR structure is inherently phase linear because it is easy to make the impulse response absolutely symmetrical. The individual samples in a digital system do not know in isolation what frequency they represent, and they can only pass through the filter at a rate determined by the clock. Because of this inherent phase-linearity, an FIR filter can be designed for a specific impulse response, and the frequency response will follow.

The frequency response of the filter can be changed at will by changing the coefficients. A programmable filter only requires a series of PROMs to supply the coefficients; the address supplied to the PROMs will select the response. The frequency response of a digital filter will also change if the clock rate is changed, so it is often less ambiguous to specify a frequency of interest in a digital filter in terms of a fraction of the fundamental interval rather than in absolute terms. The configuration shown in Figure 3.9 serves to illustrate the principle. The units used on the diagrams are sample periods and the response is proportional to these periods or spacings, and so it is not necessary to use actual figures.

Where the impulse response is symmetrical, it is often possible to reduce the number of multiplications, because the same product can be used twice, at equal distances before and after the centre of the window. This is known as folding the filter. A folded filter is shown in Figure 3.15.

images

Figure 3.15 A seven-point folded filter for a symmetrical impulse response. In this case K1 and K7 will be identical, and the input sample can be multipled once, and the product fed into the output shift system in two different places. The centre coefficient K4 appears once. In an even-numbered filter the centre coefficient would also be used twice.

3.6 Interpolation

Interpolation is an important enabling technology on which a large number of practical digital video devices are based. Some of the applications of interpolation are set out below:

1   Digital video standards often differ from the sampling standard used in MPEG, particularly with respect to the disposition of colour samples. Interpolation is needed to format video into the standard MPEG expects and to return the signal to its original format after decoding. In low-bit rate applications such as Internet video, the picture resolution may be reduced by downsampling. For display the number of pixels may have to be increased again.
2   In digital audio, different sampling rates exist today for different purposes. Rate conversion allows material to be exchanged freely between such formats. Low-bit rate compressed channels may require a reduced sampling rate as an input format.
3   In scaleable compression, a number of different pixel array sizes can be decoded, depending on how much of the compressed bitstream is decoded. Decoding at high resolution involves upsampling the lowerresolution image before adding detail to it.
4   Changing between sampling formats may be necessary in order to view an incoming image on an available display. This technique is generally known as resizing and is essentially a two-dimensional sampling rate conversion, the rate in this case being the spatial frequency of the pixels.
5   In MPEG-4 a compression tool known as warping is introduced. This allows the changes in perspective as objects turn to be coded as a few vectors instead of a large picture change. The warping process is a twodimensional interpolation.

There are three basic but related categories of interpolation, as shown in Figure 3.16. The most straightforward (a) changes the sampling rate by an integer ratio, up or down. The timing of the system is thus simplified because all samples (input and output) are present on edges of the higher-rate sampling clock. Such a system is generally adopted for oversampling convertors; the exact sampling rate immediately adjacent to the analog domain is not critical, and will be chosen to make the filters easier to implement.

Next in order of difficulty is the category shown at (b) where the rate is changed by the ratio of two small integers. Samples in the input periodically time-align with the output. Such devices can be used for converting between the various rates of ITU-601.

images

Figure 3.16 Categories of rate conversion. (a) Integer-ratio conversion, where the lowerrate samples are always coincident with those of the higher rate. There are a small number of phases needed. (b) Fractional-ratio conversion, where sample coincidence is periodic. A larger number of phases is required. Example here is conversion from 50.4 kHz to 44.1 kHz (8/7). (c) Variable-ratio conversion, where there is no fixed relationship, and a large number of phases are required.

The most complex rate-conversion category is where there is no simple relationship between input and output sampling rates, and in fact they may vary. This situation, shown at (c), is known as variable-ratio conversion. The temporal or spatial relationship of input and output samples is arbitrary. This problem will be met in effects machines which zoom or rotate images and in MPEG-4 warping.

The technique of integer-ratio conversion is used in conjunction with oversampling convertors in digital video and audio and in motion estimation and scaleable compression systems where sub-sampled or reduced-resolution versions of an input image are required.

Figure 3.17(a) shows the spectrum of a typical sampled system where the sampling rate is a little more than twice the analog bandwidth. Attempts to reduce the sampling rate by simply omitting samples, a process known as decimation, will result in aliasing, as shown in Figure 3.17(b). Intuitively it is obvious that omitting samples is the same as if the original sampling rate was lower. In order to prevent aliasing, it is necessary to incorporate low-pass filtering into the system where the cut-off frequency reflects the new, lower, sampling rate. An FIR type low-pass filter could be installed, as described earlier in this chapter, immediately prior to the stage where samples are omitted, but this would be wasteful, because for much of its time the FIR filter would be calculating sample values which are to be discarded. The more effective method is to combine the low-pass filter with the decimator so that the filter only calculates values to be retained in the output sample stream. Figure 3.17(c) shows how this is done. The filter makes one accumulation for every output sample, but that accumulation is the result of multiplying all relevant input samples in the filter window by an appropriate coefficient. The number of points in the filter is determined by the number of input samples in the period of the filter window, but the number of multiplications per second is obtained by multiplying that figure by the output rate. If the filter is not integrated with the decimator, the number of points has to be multiplied by the input rate. The larger the rate-reduction factor, the more advantageous the decimating filter ought to be, but this is not quite the case, as the greater the reduction in rate, the longer the filter window will need to be to accommodate the broader impulse response.

images

Figure 3.17 The spectrum of a typical digital sample stream at (a) will be subject to aliasing as in (b) if the baseband width is not reduced by an LPF. At (c) an FIR low-pass filter prevents aliasing. Samples are clocked transversely across the filter at the input rate, but the filter only computes at the output sample rate. Clearly this will only work if the two are related by an integer factor.

images

Figure 3.18 In integer-ratio sampling, rate increase can be obtained in two stages. First, zero-value samples are inserted to increase the rate, and then filtering is used to give the extra samples real values. The filter necessary will be an LPF with a response which cuts off at the Nyquist frequency of the input samples.

When the sampling rate is to be increased by an integer factor, additional samples must be created at even spacing between the existing ones. There is no need for the bandwidth of the input samples to be reduced since, if the original sampling rate was adequate, a higher one must also be adequate.

Figure 3.18 shows that the process of sampling-rate increase can be thought of in two stages. First, the correct rate is achieved by inserting samples of zero value at the correct instant, and then the additional samples are given meaningful values by passing the sample stream through a low-pass filter which cuts off at the Nyquist frequency of the original sampling rate. This filter is known as an interpolator, and one of its tasks is to prevent images of the lower input-sampling spectrum from appearing in the extended baseband of the higher-rate output spectrum.

All sampled systems have finite bandwidth and need a reconstruction filter to remove the frequencies above the baseband due to sampling. After reconstruction, one infinitely short digital sample ideally represents a sin x/x pulse whose central peak width is determined by the response of the reconstruction filter, and whose amplitude is proportional to the sample value. This implies that, in reality, one sample value has meaning over a considerable timespan, rather than just at the sample instant. Were this not true, it would be impossible to build an interpolator.

Performing interpolation steps separately is inefficient. The bandwidth of the information is unchanged when the sampling rate is increased; therefore the original input samples will pass through the filter unchanged, and it is superfluous to compute them. The combination of the two processes into an interpolating filter minimizes the amount of computation.

As the purpose of the system is purely to increase the sampling rate, the filter must be as transparent as possible, and this implies that a linear-phase configuration is mandatory, suggesting the use of an FIR structure. Figure 3.19 shows that the theoretical impulse response of such a filter is a sin x/x curve which has zero value at the position of adjacent input samples. In practice this impulse cannot be implemented because it is infinite. The impulse response used will be truncated and windowed as described earlier.

images

Figure 3.19 A single sample results in a sin x/x waveform after filtering in the analog domain. At a new, higher, sampling rate, the same waveform after filtering will be obtained if the numerous samples of differing size shown here are used. It follows that the values of these new samples can be calculated from the input samples in the digital domain in an FIR filter.

To simplify this discussion, assume that a sin x/x impulse is to be used. There is a strong parallel with the operation of a DAC where the analog voltage is returned to the time-continuous state by summing the analog impulses due to each sample. In a digital interpolating filter, this process is duplicated.5

If the sampling rate is to be doubled, new samples must be interpolated exactly halfway between existing samples. The necessary impulse response is shown in Figure 3.20; it can be sampled at the output sample period and quantized to form coefficients. If a single input sample is multiplied by each of these coefficients in turn, the impulse response of that sample at the new sampling rate will be obtained. Note that every other coefficient is zero, which confirms that no computation is necessary on the existing samples; they are just transferred to the output. The intermediate sample is computed by adding together the impulse responses of every input sample in the window. The figure shows how this mechanism operates. If the sampling rate is to be increased by a factor of four, three sample values must be interpolated between existing input samples. Figure 3.21 shows that it is only necessary to sample the impulse response at one-quarter the period of input samples to obtain three sets of coefficients which will be used in turn. In hardware-implemented filters, the input sample which is passed straight to the output is transferred by using a fourth filter phase where all coefficients are zero except the central one, which is unity.

Fractional ratio conversion allows interchange between different images having different pixel array sizes. Fractional ratios also occur in the vertical axis of standards convertors. Figure 3.16 showed that when the two sampling rates have a simple fractional relationship m/n, there is a periodicity in the relationship between samples in the two streams. It is possible to have a system clock running at the least-common multiple frequency which will divide by different integers to give each sampling rate.6

The existence of a common clock frequency means that a fractional-ratio convertor could be made by arranging two integer-ratio convertors in series. This configuration is shown in Figure 3.22(a). The input sampling rate is multiplied by m in an interpolator, and the result is divided by n in a decimator. Although this system would work, it would be grossly inefficient, because only one in n of the interpolator’s outputs would be used. A decimator followed by an interpolator would also offer the correct sampling rate at the output, but the intermediate sampling rate would be so low that the system bandwidth would be quite unacceptable.

images

Figure 3.20 A 2 × oversampling interpolator. To compute an intermediate sample, the input samples are imagined to be sin x/x impulses, and the contributions from each at the point of interest can be calculated. In practice, rather more samples on either side need to be taken into account.

images

Figure 3.21 In 4 × oversampling, for each set of input samples, four phases of coefficients are necessary, each of which produces one of the oversampled values.

As has been seen, a more efficient structure results from combining the processes. The result is exactly the same structure as an integer-ratio interpolator, and requires an FIR filter. The impulse response of the filter is determined by the lower of the two sampling rates, and, as before, it prevents aliasing when the rate is being reduced, and prevents images when the rate is being increased. The interpolator has sufficient coefficient phases to interpolate m output samples for every input sample, but not all of these values are computed; only interpolations which coincide with an output sample are performed. It will be seen in Figure 3.22(b) that input samples shift across the transversal filter at the input sampling rate, but interpolations are only performed at the output sample rate. This is possible because a different filter phase will be used at each interpolation.

images

Figure 3.22 At (a), fractional-ratio conversion of 3/4 in this example is by increasing to 4 × input prior to reducing by 3 ×. The inefficiency due to discarding previously computed values is clear. At (b), efficiency is raised since only needed values will be computed. Note how the interpolation phase changes for each output. Fixed coefficients can no longer be used.

In the previous examples, the sample rate or spacing of the filter output had a constant relationship to the input, which meant that the two rates had to be phase-locked. This is an undesirable constraint in some applications, including warping processors. In a variable-ratio interpolator, values will exist for the points at which input samples were made, but it is necessary to compute what the sample values would have been at absolutely any point in two dimensions between available samples. The general concept of the interpolator is the same as for the fractional-ratio convertor, except that an infinite number of filter phases is ideally necessary. Since a realizable filter will have a finite number of phases, it is necessary to study the degradation this causes. The desired continuous temporal or spatial axis of the interpolator is quantized by the phase spacing, and a sample value needed at a particular point will be replaced by a value for the nearest available filter phase. The number of phases in the filter therefore determines the accuracy of the interpolation.

The effects of calculating a value for the wrong point are identical to those of sampling with clock jitter, in that an error occurs proportional to the slope of the signal. The result is program-modulated noise. The higher the noise specification, the greater the desired time accuracy and the greater the number of phases required. The number of phases is equal to the number of sets of coefficients available, and should not be confused with the number of points in the filter, which is equal to the number of coefficients in a set (and the number of multiplications needed to calculate one output value).

The sampling jitter accuracy necessary for eight-bit working is measured in picoseconds. This implies that something like 32 filter phases will be required for adequate performance in an eight-bit sampling-rate convertor.

3.7 Downsampling filters

In compression pre-processing, it is often necessary to downsample the input picture. This may be needed to obtain the desired picture size, e.g. 360 pixels across, from a standard definition picture which is 720 pixels across. 4:2:2 inputs will need vertical filtering to produce 4:2:0 which most of the MPEG-2 profiles use.

Figure 3.23 shows that when a filter is used for downsampling, the stopband rejection is important because poor performance here results in aliasing. In an MPEG environment aliasing is particularly undesirable on an input signal as after the DCT stage aliasing will result in spurious coefficients which will require a higher bit rate to convey. If this bit rate is not available the picture quality will be impaired. Consequently the performance criteria for MPEG downsampling filters is more stringent than for general use and filters will generally require more points than in other applications.

images

Figure 3.23 Downsampling filters for MPEG prefiltering must have good stopband rejection to avoid aliasing when the sampling rate is reduced. Stopband performance is critical for a pre-processing filter.

3.8 The quadrature mirror filter

Audio compression often uses a process known as band splitting which splits up the audio spectrum into a series of frequency ranges. Band splitting is complex and requires a lot of computation. One bandsplitting method which is useful is quadrature mirror filtering.7 The QMF is is a kind of twin FIR filter which converts a PCM sample stream into two sample streams of half the input sampling rate, so that the output data rate equals the input data rate. The frequencies in the lower half of the audio spectrum are carried in one sample stream, and the frequencies in the upper half of the spectrum are carried in the other. Whilst the lower frequency output is a PCM band-limited representation of the input waveform, the upper frequency output is not. A moment’s thought will reveal that it could not be because the sampling rate is not high enough. In fact the upper half of the input spectrum has been heterodyned down to the same frequency band as the lower half by the clever use of aliasing. The waveform is unrecognizable, but when heterodyned back to its correct place in the spectrum in an inverse step, the correct waveform will result once more. Figure 3.24 shows how the idea works.

images

Figure 3.24 The sample stream shown would ordinarily represent the waveform shown in (a), but if it is known that the original signal could exist only between two frequencies then the waveform in (b) must be the correct one. A suitable bandpass reconstruction filter, or synthesis filter, will produce the waveform in (b).

Sampling theory states that the sampling rate needed must be at least twice the bandwidth in the signal to be sampled. If the signal is band limited, the sampling rate need only be more than twice the signal bandwidth not the signal frequency. Downsampled signals of this kind can be reconstructed by a reconstruction or synthesis filter having a bandpass response rather than a low pass response. As only signals within the passband can be output, it is clear from Figure 3.24 that the waveform which will result is the original as the intermediate aliased waveform lies outside the passband.

Figure 3.25 shows the operation of a simple QMF. At (a) the input spectrum of the PCM audio is shown, having an audio baseband extending up to half the sampling rate and the usual lower sideband extending down from there up to the sampling frequency. The input is passed through an FIR low-pass filter which cuts off at one-quarter of the sampling rate to give the spectrum shown at (b). The input also passes in parallel through a second FIR filter which is physically identical, but the coefficients are different. The impulse response of the FIR LPF is multiplied by a cosinusoidal waveform which amplitude modulates it. The resultant impulse gives the filter a frequency response shown at (c). This is a mirror image of the LPF response. If certain criteria are met, the overall frequency response of the two filters is flat. The spectra of both (b) and (c) show that both are oversampled by a factor of 2 because they are half-empty. As a result both can be decimated by a factor of 2, which is the equivalent of dropping every other sample. In the case of the lower half of the spectrum, nothing remarkable happens. In the case of the upper half, it has been resampled at half the original frequency as shown at (d). The result is that the upper half of the audio spectrum aliases or heterodynes to the lower half.

images

Figure 3.25 The quadrature mirror filter. In (a) the input spectrum has an audio baseband extending up to half the sampling rate. The input is passed through an FIR low-pass filter which cuts off at one-quarter of the sampling rate to give the spectrum shown in (b). The input also passes in parallel through a second FIR filter whose impulse response has been multiplied by a cosinusoidal waveform in order to amplitude-modulate it. The resultant impulse gives the filter a mirror image frequency response shown in (c). The spectra of both (b) and (c) show that both are oversampled by a factor of two because they are half empty. As a result both can be decimated by a factor of two, resulting in (d) in two identical Nyquist-sampled frequency bands of half the original width.

An inverse QMF will recombine the bands into the original broadband signal. It is a feature of a QMF/inverse QMF pair that any energy near the band edge which appears in both bands due to inadequate selectivity in the filtering reappears at the correct frequency in the inverse filtering process provided that there is uniform quantizing in all the sub-bands. In practical coders, this criterion is not met, but any residual artifacts are sufficiently small to be masked.

The audio band can be split into as many bands as required by cascading QMFs in a tree. However, each stage can only divide the input spectrum in half. In some coders certain sub-bands will have passed through one splitting stage more than others and will have half their bandwidth.8 A delay is required in the wider sub-band data for time alignment.

A simple quadrature mirror is computationally intensive because sample values are calculated which are later decimated or discarded, and an alternative is to use polyphase pseudo-QMF filters9 or wave filters10 in which the filtering and decimation process is combined. Only wanted sample values are computed. In a polyphase filter a set of samples is shifted into position in the transversal register and then these are multiplied by different sets of coefficients and accumulated in each of several phases to give the value of a number of different samples between input samples. In a polyphase QMF, the same approach is used.

Figure 3.26 shows an example of a 32-band polyphase QMF having a 512-sample window. With 32 sub-bands, each band will be decimated to images of the input sampling rate. Thus only one sample in 32 will be retained after the combined filter/decimate operation. The polyphase QMF only computes the value of the sample which is to be retained in each subband. The filter works in 32 different phases with the same samples in the transversal register. In the first phase, the coefficients will describe the impulse response of a low-pass filter, the so-called prototype filter, and the result of 512 multiplications will be accumulated to give a single sample in the first band. In the second phase the coefficients will be obtained by multiplying the impulse response of the prototype filter by a cosinusoid at the centre frequency of the second band. Once more 512 multiply accumulates will be required to obtain a single sample in the second band. This is repeated for each of the 32 bands, and in each case a different centre frequency is obtained by multiplying the prototype impulse by a different modulating frequency. Following 32 such computations, 32 output samples, one in each band, will have been computed. The transversal register then shifts 32 samples and the process repeats.

images

Figure 3.26 In polyphase QMF the same input samples are subject to computation using coefficient sets in many different time-multiplexed phases. The decimation is combined with the filtering so only wanted values are computed.

The principle of the polyphase QMF is not so different from the techniques used to compute a frequency transform and effectively blurs the distinction between sub-band coding and transform coding.

3.9 Filtering for video noise reduction

The basic principle of all video noise reducers is that there is a certain amount of correlation between the video content of successive frames, whereas there is no correlation between the noise content.

A basic recursive device is shown in Figure 3.27. There is a frame store which acts as a delay, and the output of the delay can be fed back to the input through an attenuator, which in the digital domain will be a multiplier. In the case of a still picture, successive frames will be identical, and the recursion will be large. This means that the output video will actually be the average of many frames. If there is movement of the image, it will be necessary to reduce the amount of recursion to prevent the generation of trails or smears. Probably the most famous examples of recursion smear are the television pictures sent back of astronauts walking on the moon. The received pictures were very noisy and needed a lot of averaging to make them viewable. This was fine until the astronaut moved. The technology of the day did not permit motion sensing.

images

Figure 3.27 A basic recursive device feeds back the output to the input via a frame store which acts as a delay. The characteristics of the device are controlled totally by the values of the two coefficients K1 and K2 which control the multipliers.

The noise reduction increases with the number of frames over which the noise is integrated, but image motion prevents simple combining of frames. It will be seen in section 3.14 that if motion estimation is available, the image of a moving object in a particular frame can be integrated from the images in several frames which have been superimposed on the same part of the screen by displacements derived from the motion measurement. The result is that greater reduction of noise becomes possible.11

In a median filter, sample values adjacent to the one under examination are considered. These may be in the same place in previous or subsequent images, or nearby in the same image. A median filter computes the distribution of values on all its input points. If the value of the centre point lies centrally within the distribution then it is considered to be valid and is passed to the output without change. In this case the median filter has no effect whatsoever. However, if the value of the centre point is at the edge of the distribution it is considered to be in error due to impulsive noise and a different input point, nearest the mean, is selected as the output pixel. Effectively a pixel value from nearby is used to conceal the error. The median filter is very effective against dropouts, film dirt and bit errors.

3.10 Warping

Warping is a technique which allows the texture of an object correctly to be mapped onto its surface. The concept is simple, whereas the implementation is not. Figure 3.28 shows a simple example. A flag is stretched absolutely flat so that the true shape of the pattern can be seen. However, when the flag blows in the wind, the pattern which can be seen is not the same because the flag is no longer flat. Warping is a technique which, inter alia, allows the appearance of a flag correctly to be computed from a knowledge of what the flag looked like when it was flat.

Figure 3.29 shows another example. Here a simple rectangular object such as a bus is turning a corner. In one picture, the outline of the bus may be a rectangle, whereas in the next it has become a trapezoid. In MPEG-2 it would be necessary to send a considerable amount of data to convert the first picture into the second. In MPEG-4 the facility exists to transmit warping codes so that the texture of the side of the bus is warped in the decoder to the new shape. This requires fewer data to be transmitted.

The principle of warping is the same as the technique used by cartographers for centuries. Cartographers are faced with the problem that the earth is round, and paper is flat. In order to produce flat maps, it is necessary to project the features of the round original onto a flat surface. There are a number of different ways of projecting maps, and all of them must by definition produce distortion. The effect of this distortion is that distances measured near the extremities of the map appear further than they actually are. Another effect is that great circle routes (the shortest or longest path between two places on a planet) appear curved on a projected map. The type of projection used is usually printed somewhere on the map, a very common system being that due to Mercator. Clearly the process of mapping involves some threedimensional geometry in order to simulate the paths of light rays from the map so that they appear to have come from the curved surface. Video effects machines work in exactly the same way. It is an interesting demonstration of the evolving power of microelectronics that the circuitry developed for DVEs costing hundreds of thousands of dollars in the 1980s is now just part of an MPEG-4 decoder chip.

images

Figure 3.28 Warping is a process which allows the appearance of a non-flat surface to be computed.

images

Figure 3.29 An example of warping in compression. The change in perspective as the vehicle turns would require a large residual in MPEG-2, but with MPEG-4 much of the new shape can be obtained by warping the pixels from an earlier picture.

The distortion of maps means that things are not where they seem. In timesharing computers, every user appears to have his own identical address space in which his program resides, despite the fact that many different programs are simultaneously in the memory. In order to resolve this contradiction, memory management units are constructed which add a constant value to the address which the user thinks he has (the virtual address) in order to produce the physical address. As long as the unit gives each user a different constant, they can all program in the same virtual address space without one corrupting another’s programs. Because the program is no longer where it seems to be, the term mapping was introduced. The address space of a computer is one dimensional, but a video frame expressed as rows and columns of pixels can be considered to have a two-dimensional address as in Figure 3.30. Video manipulators work by mapping the pixel addresses in two dimensions.

images

Figure 3.30 The entire TV picture can be broken down into uniquely addressable pixels.

Every pixel in the frame array has an address. The address is two dimensional, because in order to uniquely specify one pixel, the column address and the row address must be supplied. It is possible to transform a picture by simultaneously addressing in rows and columns. It was discovered some time ago in connection with computer graphics that the two-dimensional problem can sometimes be converted into two onedimensional problems.12 Essentially if a horizontal transform affecting whole rows of pixels independently of other rows is performed on the array, followed by or preceded by a vertical transform which affects entire columns independently of other columns, the effect will be the same as if a two-dimensional transform had been performed. This is the principle of separability.

There are many different manipulations needed in warping, and the approach here will be to begin with the simplest, which require the least processing, and to graduate to the most complex, introducing the necessary processes at each stage. Initially the examples given will be one dimensional. Figure 3.31 shows a single row of pixels which are held in a buffer where each can be addressed individually and transferred to another. If a constant is added to the read address, the selected pixel will be to the right of the place where it will be put. This has the effect of moving the picture to the left. If the buffer represented a column of pixels, the picture would be moved vertically. As these two transforms can be controlled independently, the picture could be moved diagonally.

If the read address is multiplied by a constant, say 2, the effect is to bring samples from the input closer together on the output, so that the picture size is reduced. Again independent control of the horizontal and vertical transforms is possible, so that the aspect ratio of the picture can be modified. Clearly the secret of these manipulations is in the constants fed to the address generators. The added constant represents displacement, and the multiplied constant represents magnification. A multiplier constant of less than one will result in the picture getting larger. Figure 3.31 also shows, however, that there is a problem. If a constant of 0.5 is used, to make the picture twice as large, half of the addresses generated are not integers. A memory does not understand an address of two and a half! If an arbitrary magnification is used, nearly all the addresses generated are non-integer. A similar problem crops up if a constant of less than one is added to the address in an attempt to move the picture less than the pixel spacing. The solution to the problem is to use the interpolation techniques described earlier in this chapter. Because the input image is spatially sampled, those samples contain enough information to represent the brightness and colour all over the screen.

When the address generator produces an address of 2.5, it actually means that what is wanted is the value of the signal interpolated halfway between pixel 2 and pixel 3. The output of the address generator will thus be split into two parts. The integer part will become the memory address, and the fractional part is the phase of the necessary interpolation. In order to interpolate pixel values a digital filter is necessary.

In order to follow the operation of warping, some knowledge of perspective is necessary. Stated briefly, the phenomenon of perspective is due to the angle subtended to the eye by objects being a function not only of their size but also of their distance. Figure 3.32 shows that the size of an image on the rear wall of a pinhole camera can be increased either by making the object larger or bringing it closer. In the absence of stereoscopic vision, it is not possible to tell which has happened. The pinhole camera is very useful for study of perspective, and has indeed been used by artists for that purpose. The clinically precise perspective of Canaletto paintings was achieved through the use of the camera obscura (’darkened room’ in Italian).13

images

Figure 3.31 Address generation is the fundamental process behind transforms.

images

Figure 3.32 The image on the rear of the pinhole camera is identical for the two solid objects shown because the size of the object is proportional to distance, and the subtended angle remains the same. The image can be made larger (dashed) by making the object larger or moving it closer.

images

Figure 3.33 In a planar rotation effect the source plane ABCD is the rectangular input picture. If it is rotated through the angle θ, ray tracing to a single eye at left will produce a trapezoidal image A ‘B’ C ‘D’ on the target. Magnification will now vary with position on the picture.

Warping processors work by producing the correct subtended angles which the brain perceives as a three-dimensional effect. Figure 3.33 shows that to a single eye, there is no difference between a threedimensional scene and a two-dimensional image formed where rays traced from features to the eye intersect an imaginary plane.

Figure 3.33 also shows that when a plane input picture is rotated about a horizontal axis, the distance from the top of the picture to the eye is no longer the same as the distance from the bottom of the picture to the eye. The result is that the top and bottom edges of the picture subtend different angles to the eye, and where the rays cross the target plane, the image has become trapezoidal. There is now no such thing as the magnification of the picture. The magnification changes continuously from top to bottom of the picture, and if a uniform grid is input, after a perspective rotation it will appear non-linear as the diagram shows.

images

Figure 3.34 When a picture is warped, the pixels no longer register with the standard locations. Interpolation in two dimensions is used to create pixels on the standard grid.

Figure 3.34 shows that when a picture is warped, this causes the pixels in the picture to fail to register with the standard pixel spacing. The solution is two-dimensional interpolation. One pixel value actually represents the peak brightness of a two-dimensional intensity function, which is the effect of the modulation transfer function of the system on an infinitely small point. In order to compute an interpolated value, it is necessary to add together the contribution from all relevant samples, at the point of interest. Each contribution can be obtained by looking up the value of a unity impulse curve at the distance from the input pixel to the output pixel to obtain a coefficient, and multiplying the input pixel value by that coefficient.

The process of taking several pixel values, multiplying each by a different coefficient and summing the products can be performed by the FIR (finite impulse response) configuration described earlier. The impulse response of the filter necessary depends on the magnification. Where the picture is being enlarged, the impulse response can be the same as at normal size, but as the size is reduced, the impulse response has to become broader (corresponding to a reduced spatial frequency response) so that more input samples are averaged together to prevent aliasing. The coefficient store will need a three-dimensional structure, such that the magnification and the interpolation phase in each axis must both be supplied to obtain a set of coefficients. The magnification can easily be obtained by comparing successive outputs from the address generator. Alternatively, the coefficients could be computed dynamically.

3.11 Transforms and duality

The duality of transforms provides an interesting insight into what is happening in common processes. Fourier analysis holds that any periodic waveform can be reproduced by adding together an arbitrary number of harmonically related sinusoids of various amplitudes and phases. Figure 3.35 shows how a square wave can be built up of harmonics. The spectrum can be drawn by plotting the amplitude of the harmonics against frequency. It will be seen that this gives a spectrum which is a decaying wave. It passes through zero at all even multiples of the fundamental. The shape of the spectrum is a sin x/x curve. If a square wave has a sin x/x spectrum, it follows that a filter with a rectangular impulse response will have a sin x/x spectrum.

images

Figure 3.35 Fourier analysis of a square wave into fundamental and harmonics. A, amplitude; δ, phase of fundamental wave in degrees; 1, first harmonic (fundamental); 2, odd harmonics 3–15; 3, sum of harmonics 1–15; 4, ideal square wave.

A low-pass filter has a rectangular spectrum, and this has a sin x/x impulse response. These characteristics are known as a transform pair. In transform pairs, if one domain has one shape of the pair, the other domain will have the other shape. Figure 3.36 shows a number of transform pairs.

At (a) a square wave has a sin x/x spectrum and a sin x/x impulse has a square spectrum. In general the product of equivalent parameters on either side of a transform remains constant, so that if one increases, the other must fall. If (a) shows a filter with a wider bandwidth, having a narrow impulse response, then (b) shows a filter of narrower bandwidth which has a wide impulse response. This is duality in action. The limiting case of this behaviour is where one parameter becomes zero, the other goes to infinity. At (c) a time-domain pulse of infinitely short duration has a flat spectrum. Thus a flat waveform, i.e. DC, has only zero in its spectrum. The impulse response of the optics of a laser disk (d) has a sin2x/x2 intensity function, and this is responsible for the triangular falling frequency response of the pickup. The lens is a rectangular aperture, but as there is no such thing as negative light, a sin x/x impulse response is impossible. The squaring process is consistent with a positive-only impulse response. Interestingly the transform of a Gaussian response in still Gaussian.

images

Figure 3.36 Transform pairs. At (a) the dual of a rectangle is a sin x/x function. If one is time domain, the other is frequency domain. At (b), narrowing one domain widens the other. The limiting case of this is (c). Transform of the sinx/x squared function is triangular (d).

Duality also holds for sampled systems. A sampling process is periodic in the time domain. This results in a spectrum which is periodic in the frequency domain. If the time between the samples is reduced, the bandwidth of the system rises. Figure 3.37(a) shows that a continuous time signal has a continuous spectrum whereas at (b) the frequency transform of a sampled signal is also discrete. In other words sampled signals can only be analysed into a finite number of frequencies. The more accurate the frequency analysis has to be, the more samples are needed in the block. Making the block longer reduces the ability to locate a transient in time. This is the Heisenberg inequality which is the limiting case of duality, because when infinite accuracy is achieved in one domain, there is no accuracy at all in the other.

images

Figure 3.37 Continuous time signal (a) has continuous spectrum. Discrete time signal (b) has discrete spectrum.

3.12 The Fourier transform

Figure 3.35 showed that if the amplitude and phase of each frequency component is known, linearly adding the resultant components in an inverse transform results in the original waveform. The ideal Fourier transform operates over an infinite time. In practice the time span has to be constrained, resulting in a short-term Fourier transform (STFT). In digital systems the waveform is expressed as a (finite) number of discrete samples. As a result the Fourier transform analyses the signal into an equal number of discrete frequencies. This is known as a discrete Fourier transform or DFT in which the number of frequency coefficients is equal to the number of input samples. The fast Fourier transform is no more than an efficient way of computing the DFT.14

It will be evident from Figure 3.35 that the knowledge of the phase of the frequency component is vital, as changing the phase of any component will seriously alter the reconstructed waveform. Thus the DFT must accurately analyse the phase of the signal components.

There are a number of ways of expressing phase. Figure 3.38 shows a point which is rotating about a fixed axis at constant speed. Looked at from the side, the point oscillates up and down at constant frequency. The waveform of that motion is a sine wave, and that is what we would see if the rotating point were to translate along its axis whilst we continued to look from the side.

One way of defining the phase of a waveform is to specify the angle through which the point has rotated at time zero (T = 0). If a second point is made to revolve at 90° to the first, it would produce a cosine wave when translated. It is possible to produce a waveform having arbitrary phase by adding together the sine and cosine wave in various proportions and polarities. For example, adding the sine and cosine waves in equal proportion results in a waveform lagging the sine wave by 45°.

images

Figure 3.38 The origin of sine and cosine waves is to take a particular viewpoint of a rotation. Any phase can be synthesized by adding proportions of sine and cosine waves.

Figure 3.38 shows that the proportions necessary are respectively the sine and the cosine of the phase angle. Thus the two methods of describing phase can be readily interchanged.

The discrete Fourier transform spectrum-analyses a string of samples by searching separately for each discrete target frequency. It does this by multiplying the input waveform by a sine wave, known as the basis function, having the target frequency and adding up or integrating the products. Figure 3.39(a) shows that multiplying by basis functions gives a non-zero integral when the input frequency is the same, whereas Figure 3.39(b) shows that with a different input frequency (in fact all other different frequencies) the integral is zero showing that no component of the target frequency exists. Thus from a real waveform containing many frequencies all frequencies except the target frequency are excluded. The magnitude of the integral is proportional to the amplitude of the target component.

images

Figure 3.39 The input waveform is multiplied by the target frequency and the result is averaged or integrated. In (a) the target frequency is present and a large integral results. With another input frequency the integral is zero as in (b). The correct frequency will also result in a zero integral shown in (c) if it is at 90° to the phase of the search frequency. This is overcome by making two searches in quadrature.

Figure 3.39(c) shows that the target frequency will not be detected if it is phase shifted 90° as the product of quadrature waveforms is always zero. Thus the discrete Fourier transform must make a further search for the target frequency using a cosine basis function. It follows from the arguments above that the relative proportions of the sine and cosine integrals reveals the phase of the input component. Thus each discrete frequency in the spectrum must be the result of a pair of quadrature searches.

Searching for one frequency at a time as above will result in a DFT, but only after considerable computation. However, a lot of the calculations are repeated many times over in different searches. The fast Fourier transform gives the same result with less computation by logically gathering together all the places where the same calculation is needed and making the calculation once.

The amount of computation can be reduced by performing the sine and cosine component searches together. Another saving is obtained by noting that every 180° the sine and cosine have the same magnitude but are simply inverted in sign. Instead of performing four multiplications on two samples 180° apart and adding the pairs of products it is more economical to subtract the sample values and multiply twice, once by a sine value and once by a cosine value.

The first coefficient is the arithmetic mean which is the sum of all the sample values in the block divided by the number of samples. Figure 3.40 shows how the search for the lowest frequency in a block is performed. Pairs of samples are subtracted as shown, and each difference is then multiplied by the sine and the cosine of the search frequency. The process shifts one sample period, and a new sample pair is subtracted and multiplied by new sine and cosine factors. This is repeated until all the sample pairs have been multiplied. The sine and cosine products are then added to give the value of the sine and cosine coefficients respectively.

It is possible to combine the calculation of the DC component which requires the sum of samples and the calculation of the fundamental which requires sample differences by combining stages shown in Figure 3.41(a) which take a pair of samples and add and subtract them. Such a stage is called a butterfly because of the shape of the schematic. Figure 3.41(b) shows how the first two components are calculated. The phase rotation boxes attribute the input to the sine or cosine component outputs according to the phase angle. As shown, the box labelled 90° attributes nothing to the sine output, but unity gain to the cosine output. The 45° box attributes the input equally to both components.

Figure 3.41(c) shows a numerical example. If a sine wave input is considered where zero degrees coincides with the first sample, this will produce a zero sine coefficient and non-zero cosine coefficient. Figure 3.41(d) shows the same input waveform shifted by 90°. Note how the coefficients change over.

Figure 3.41(e) shows how the next frequency coefficient is computed. Note that exactly the same first-stage butterfly outputs are used, reducing the computation needed.

images

Figure 3.40 An example of a filtering search. Pairs of samples are subtracted and multiplied by sampled sine and cosine waves. The products are added to give the sine and cosine components of the search frequency.

A similar process may be followed to obtain the sine and cosine coefficients of the remaining frequencies. The full FFT diagram for eight samples is shown in Figure 3.42(a) on page 141. The spectrum this calculates is shown in Figure 3.42(b) on page 141. Note that only half of the coefficients are useful in a real band-limited system because the remaining coefficients represent frequencies above one half of the sampling rate.

In STFTs the overlapping input sample blocks must be multiplied by window functions. The principle is the same as for the application in FIR filters shown in section 3.5. Figure 3.43 (page 142) shows that multiplying the search frequency by the window has exactly the same result except that this need be done only once and much computation is saved. Thus in the STFT the basis function is a windowed sine or cosine wave.

The FFT is used extensively in such applications as phase correlation, where the accuracy with which the phase of signal components can be analysed is essential. It also forms the foundation of the discrete cosine transform.

images

images

images

images

Figure 3.41 The basic element of an FFT is known as a butterfly as in (a) because of the shape of the signal paths in a sum and difference system. The use of butterflies to compute the first two coefficients is shown in (b). An actual example is given in (c) which should be compared with the result of (d) with a quadrature input. In (e) the butterflies for the first two coefficients form the basis of the computation of the third coefficient. (See pages 138–140 for parts (c)–(e).)

3.13 The discrete cosine transform (DCT)

The DCT is a special case of a discrete Fourier transform in which the sine components of the coefficients have been eliminated leaving a single number. This is actually quite easy. Figure 3.44(a) (page 142) shows a block of input samples to a transform process. By repeating the samples in a time-reversed order and performing a discrete Fourier transform on the double-length sample set a DCT is obtained. The effect of mirroring the input waveform is to turn it into an even function whose sine coefficients are all zero. The result can be understood by considering the effect of individually transforming the input block and the reversed block.

Figure 3.44(b) (page 142) shows that the phase of all the components of one block is in the opposite sense to those in the other. This means that when the components are added to give the transform of the double length block all the sine components cancel out, leaving only the cosine coefficients, hence the name of the transform.15 In practice the sine component calculation is eliminated. Another advantage is that doubling the block length by mirroring doubles the frequency resolution, so that twice as many useful coefficients are produced. In fact a DCT produces as many useful coefficients as input samples.

For image processing two-dimensional transforms are needed. In this case for every horizontal frequency, a search is made for all possible vertical frequencies. A two-dimensional DCT is shown in Figure 3.45 on page 143. The DCT is separable in that the two-dimensional DCT can be obtained by computing in each dimension separately. Fast DCT algorithms are available.16

Figure 3.45 shows how a two-dimensional DCT is calculated by multiplying each pixel in the input block by terms which represent sampled cosine waves of various spatial frequencies. A given DCT coefficient is obtained when the result of multiplying every input pixel in the block is summed. Although most compression systems, including JPEG and MPEG, use square DCT blocks, this is not a necessity and rectangular DCT blocks are possible and are used in, for example, Digital Betacam and DVC.

The DCT is primarily used in MPEG because it converts the input waveform into a form where redundancy can be easily detected and removed. More details of the DCT can be found in Chapter 5.

3.14 The wavelet transform

The wavelet transform was not discovered by any one individual, but has evolved via a number of similar ideas and was only given a strong mathematical foundation relatively recently.17,18 Much of the early work was performed in France, where the term ondelette was coined. Wavelet is an anglicized equivalent. The Fourier transform is based on periodic signals and endless basis functions and requires windowing to a shortterm transform in practical use. The wavelet transform is fundamentally windowed and there is no assumption of periodic signals. The basis functions employed are not endless sine waves, but are finite on the time axis. Wavelet transforms do not use a fixed window, but instead the window period is inversely proportional to the frequency being analysed.

images

Figure 3.42 In (a) is the full butterfly diagram for an FFT. The spectrum this computes is shown in (b).

As a result a useful combination of time and frequency resolutions is obtained. High frequencies corresponding to transients in audio or edges in video are transformed with short basis functions and therefore are accurately located. Low frequencies are transformed with long basis functions which have good frequency resolution.19,20 Figure 3.46 shows that in the pure time (or space in imaging) domain, there are only samples or pixels. These are taken at vanishingly small instants and so cannot contain any frequency information. At the other extreme is the Fourier domain. Here there are only coefficients representing the amplitude of endless sinusoids which cannot carry any time/space information.

images

Figure 3.43 Multiplication of a windowed block by a sine wave basis function is the same as multiplying the raw data by a windowed basis function but requires less multiplication as the basis function is constant and can be pre-computed.

images

Figure 3.44 The DCT is obtained by mirroring the input block as shown in (a) prior to an FFT. The mirroring cancels out the sine components as in (b), leaving only cosine coefficients.

images

Figure 3.45 A two-dimensional DCT is calculated as shown here. Starting with an input pixel block one calculation is necessary to find a value for each coefficient. After 64 calculations using different basis functions the coefficient block is complete.

images

Figure 3.46 When in the pure time domain, nothing is known about frequency and vice versa. These pure domains are academic and require infinitely large and small quantities to implement. Wavelets operate in the real world where time and frequency information is simultaneously present.

The Heisenberg uncertainty principle simply observes that at any position on the continuum between extremes a certain combination of time and frequency accuracy can be obtained. Clearly the human senses operate with a combination of both. Wavelets are transforms which operate in the time/frequency domain as shown in the figure. There are pixels at one end of the scale and coefficients at the other. Wavelet parameters fall somewhere in between the pure pixel and the pure coefficient. In addition to describing the amount of energy at a certain frequency, they may also describe where that energy existed. There seems to be no elegant contraction of coefficient and pixel to produce an English term for a wavelet parameter, but the Franglais terms pixelette and ondel have a certain charm.

The fundamental component of a wavelet transform is the filter decimator shown in Figure 3.47(a). This is a complementary high- and low-pass filter which divides the input bandwidth in half. The high-pass filter impulse response is a wavelet. If a block of samples or pixels of a certain size is input, each output is decimated by a factor of two so that the same number of samples is present on the output. There is a strong parallel with the quadrature mirror filter here. The low-pass output is a reduced resolution signal and the high-pass output represents the detail which must be added to make the original signal. The high-pass process is a form of differentiation. Figure 3.48 shows that these two sample streams can be recombined. Each one is upsampled by a factor of two by an appropriate form of interpolator and then the two sample streams are added.

images

Figure 3.47 (a) The fundamental component of a wavelet transform is a filter decimator. (b) These stages can be cascaded.

images

Figure 3.48 This wavelet decoder can perfectly recover the transformed data of Figure 3.47.

Provided that the decoding filters have responses which are appropriate to the responses of the encoding filters, the output sample stream will be identical to the original. This is known as perfect reconstruction and the practical use of wavelets relies upon finding sets of four filters which combine to display this characteristic.

Figure 3.47(b) shows that stages are cascaded in such a way that the low-pass process iterates. At each stage the number of samples representing the original block has halved, but exactly the same filter pair is employed. As a result the wavelet at each stage appears to have become twice as long with respect to the input block.

As a result of this cascading, wavelets are naturally and fundamentally scaleable. At the end of a cascade is a heavily band-limited signal. Adding coefficients from the next frequency band doubles the resolution available. Adding coefficients from the next band further doubles it.

The wavelet transform outputs the same amount of data as is input. Figure 3.49(a) shows a one-dimensional example of a four-stage cascade into which sixteen samples are fed. At the first stage eight difference values are created. At the next, four difference values are created. At the next there are two. Finally a single difference value is created along with a single output value from the sequence of four low-pass filters which represents the average brightness of the sixteen pixels. The number of output values is then:

images

Figure 3.49 (a) A four-stage one-dimensional wavelet cascade. (b) When used with twodimensional pixel arrays, each wavelet stage divides the data into one quarter and three quarters. (c) Cascading the process of (b).

8 + 4 + 2 + 1 differences + 1 average = 16

Like other transforms, the wavelet itself does not result in any compression. However, what it does do is to represent the original information in such a way that redundancy is easy to identify.

Figure 3.50 shows that that a set of wavelets or basis functions can be obtained simply by dilating (scaling) a single wavelet on the time or space axis by a factor of two each time. Each wavelet contains the same number of cycles such that as the frequency reduces, the wavelet gets longer. In a block of fixed size, a large number of short wavelets are required to describe the highest frequency, whereas the number of wavelets at the next resolution down would be halved. The desirable result is that as frequency rises, the spatial or temporal resolution increases. Figure 3.51(a) shows the fixed time/frequency characteristic of the Fourier transform. This can be improved by using window sizes which are a function of frequency as in (b). The wavelet transform in (c) has better temporal resolution as frequency increases due to the dilation of the basis function.

The dual of this temporal behaviour is that the frequency discrimination of the wavelet transform is a constant fraction of the signal frequency. In a filter bank such a characteristic would be described as ‘constant Q’. Figure 3.52 shows the division of the frequency domain by a wavelet transform is logarithmic whereas in the Fourier transform the division is uniform. The logarithmic coverage is effectively dividing the frequency domain into octaves.

When using wavelets with two-dimensional pixel arrays, vertical and horizontal filtering is required. Figure 3.49(b) shows that after a single stage, decimation by two in two axes means that the low-pass output contains one quarter of the original data, whereas the difference data require the other three quarters.

images

Figure 3.50 Unlike discrete Fourier transforms, wavelet basis functions are scaled so that they contain the same number of cycles irrespective of frequency. As a result their frequency discrimination ability is a constant proportion of the centre frequency.

images

Figure 3.51 (a) In transforms greater certainty in the time domain leads to less certainty in the frequency domain and vice versa. Some transform coders split the spectrum as in (b) and use different window lengths in the two bands. In the recently developed wavelet transform the window length is inversely proportional to the frequency, giving the advantageous time/frequency characteristic shown in (c).

Figure 3.49(c) shows that cascading this process results in an ever smaller low-pass pixel block which eventually converges to a single average brightness code.

3.15 The importance of motion compensation

The optic flow axis is the locus of some point on a moving object which will be in a different place in successive pictures. Any device which computes with respect to the optic flow axis is said to be motion compensated. Until recently the amount of computation required in motion compensation was too expensive, but as this is no longer the case the technology has become very important in moving image portrayal systems.

Figure 3.53(a) shows an example of a moving object which is in a different place in each of three pictures. The optic flow axis is shown. The object is not moving with respect to the optic flow axis and if this axis can be found some very useful results are obtained. The process of finding the optic flow axis is called motion estimation. Motion estimation is literally a process which analyses successive pictures and determines how objects move from one to the next. It is an important enabling technology because of the way it parallels the action of the human eye.

images

Figure 3.52 Wavelet transforms divide the frequency domain into octaves instead of the equal bands of the Fourier transform.

Figure 3.53(b) shows that if the object does not change its appearance as it moves, it can be portrayed in two of the pictures by using data from one picture only, simply by shifting part of the picture to a new location. This can be done using vectors as shown. Instead of transmitting a lot of pixel data, a few vectors are sent instead. This is the basis of motioncompensated compression which is used extensively in MPEG as will be seen in Chapter 5.

Figure 3.53(c) shows that if a high-quality standards conversion is required between two different frame rates, the output frames can be synthesized by moving image data, not through time, but along the optic flow axis. This locates objects where they would have been if frames had been sensed at those times, and the result is a judder-free conversion. This process can be extended to drive image displays at a frame rate higher than the input rate so that flicker and background strobing are reduced. This technology is available in certain high-quality consumer television sets. This approach may also be used with 24 Hz film to eliminate judder in telecine machines.

images

Figure 3.53 Motion compensation is an important technology. (a) The optic flow axis is found for a moving object. (b) The object in picture (n + 1) and (n + 2) can be re-created by shifting the object of picture n using motion vectors. MPEG uses this process for compression. (c) A standards convertor creates a picture on a new timebase by shifting object data along the optic flow axis. (d) With motion compensation a moving object can still correlate from one picture to the next so that noise reduction is possible.

Figure 3.53(d) shows that noise reduction relies on averaging two or more images so that the images add but the noise cancels. Conventional noise reducers fail in the presence of motion, but if the averaging process takes place along the optic flow axis, noise reduction can continue to operate.

The way in which eye tracking avoides aliasing is fundamental to the perceived quality of television pictures. Many processes need to manipulate moving images in the same way in order to avoid the obvious difficulty of processing with respect to a fixed frame of reference. Processes of this kind are referred to as motion compensated and rely on a quite separate process which has measured the motion.

Motion compensation is also important where interlaced video needs to be processed as it allows the best possible de-interlacing performance.

It should be clear that any motion-compensated process must be adaptive in order to handle edits or cuts which effectively terminate one set of optic flow axes and replace them with another. In general, attempting to code a cut using motion compensation will be significantly less efficient than using other tools.

3.16 Motion-estimation techniques

There are three motion-estimation tools that are to be found in various applications: block matching, gradient matching and phase correlation. Each has its their own characteristic strengths and weaknesses so that practical systems will often combine these tools in some way.

3.16.1 Block matching

This is the simplest technique to follow. In a given picture, a block of pixels is selected and stored as a reference. If the selected block is part of a moving object, a similar block of pixels will exist in the next picture, but not in the same place. As Figure 3.54 shows, block matching simply moves the reference block around over the second picture looking for matching pixel values. When a match is found, the displacement needed to obtain it is used as a basis for a motion vector.

If there is a match over the whole of the block, the moving area must be bigger than the block. On the other hand, the edge of the moving area may cut through the block. In this case correlation will only be obtained over that part of the block which is moving.

Whilst simple in concept, block matching requires an enormous amount of computation because every possible motion must be tested over the assumed range. Thus if the object is assumed to have moved over a sixteen-pixel range, then it will be necessary to test sixteen different horizontal displacements in each of sixteen vertical positions; in excess of 65 000 positions. At each position every pixel in the block must be compared with every pixel in the second picture. In typical video, displacements of twice the figure quoted here may be found, particularly in sporting events, and the computation then required becomes enormous.

images

Figure 3.54 In block matching the search block has to be positioned at all possible relative motions within the search area and a correlation measured at each one.

One way of reducing the amount of computation is to perform the matching in stages where the first stage is inaccurate but covers a large motion range but the last stage is accurate but covers a small range.16 The first matching stage is performed on a heavily filtered and subsampled picture, which contains far fewer pixels. When a match is found, the displacement is used as a basis for a second stage which is performed with a less heavily filtered picture. Eventually the last stage takes place to any desired accuracy. This hierarchical approach does reduce the computation required, but it suffers from the problem that the filtering of the first stage may make small objects disappear and they can never be found by subsequent stages if they are moving with respect to their background. Many televised sports events contain small, fast-moving objects. As the matching process depends upon finding similar luminance values, this can be confused by objects moving into shade or fades.

The simple block-matching systems described above can only measure motion to the nearest pixel. If more accuracy is required, interpolators will be needed to shift the image by sub-pixel distances before attempting a match. The complexity rises once more. In compression systems an accuracy of half a pixel is accurate enough for most purposes, although AVC offers quarter-pixel precision.

3.16.2 Gradient matching

At some point in a picture, the function of brightness with respect to distance across the screen will have a certain slope, known as the spatial luminance gradient. If the associated picture area is moving, the slope will traverse a fixed point on the screen and the result will be that the brightness now changes with respect to time. This is a temporal luminance gradient. Figure 3.55 shows the principle. For a given spatial gradient, the temporal gradient becomes steeper as the speed of movement increases. Thus motion speed can be estimated from the ratio of the spatial and temporal gradients.21

The method only works well if the gradient remains essentially constant over the displacement distance; a characteristic which is not necessarily present in real video. In practice there are numerous processes which can change the luminance gradient. When an object moves so as to obscure or reveal the background, the spatial gradient will change from field to field even if the motion is constant. Variations in illumination, such as when an object moves into shade, also cause difficulty.

images

Figure 3.55 The principle of gradient matching. The luminance gradient across the screen is compared with that through time.

Block and gradient matching processes can be assisted in various ways that cut down computation by eliminating in advance searches that will not result in correlations. One such method is to project motion in previous pictures forwards to the present picture to act as a basis for the calculation. One of the most powerful ways of steering a block matcher is phase correlation.

3.16.3 Phase correlation

Phase correlation works by performing a discrete Fourier transform on two successive fields and then subtracting all the phases of the spectral components. The phase differences are then subject to a reverse transform which directly reveals peaks whose positions correspond to motions between the fields.22,23 The nature of the transform domain means that if the distance and direction of the motion is measured accurately, the area of the screen in which it took place is not. This may or may not be a consequence of Heisenberg’s uncertainty theorem. Thus in practical systems the phase-correlation stage is followed by a matching stage not dissimilar to the block-matching process. However, the matching process is steered by the motions from the phase correlation, and so there is no need to attempt to match at all possible motions. By attempting matching on measured motion the overall process is made much more efficient.

One way of considering phase correlation is that by using the Fourier transform to break the picture into its constituent spatial frequencies the hierarchical structure of block matching at various resolutions is in fact performed in parallel. In this way small objects are not missed because they will generate high-frequency components in the transform.

Although the matching process is simplified by adopting phase correlation, the Fourier transforms themselves require complex calculations. The high performance of phase correlation would remain academic if it were too complex to put into practice. However, if realistic values are used for the motion speeds which can be handled, the computation required by block matching actually exceeds that required for phase correlation.

The elimination of amplitude information from the phase-correlation process ensures that motion estimation continues to work in the case of fades, objects moving into shade or flashguns firing.

The details of the Fourier transform have been described in section 3.5. A one-dimensional example of phase correlation will be given here by way of introduction. A line of luminance, which in the digital domain consists of a series of samples, is a function of brightness with respect to distance across the screen. The Fourier transform converts this function into a spectrum of spatial frequencies (units of cycles per picture width) and phases.

images

Figure 3.56 The definition of phase linearity is that phase shift is proportional to frequency. In phase-linear systems the waveform is preserved, and simply moves in time or space.

All television signals must be handled in linear-phase systems. A linear-phase system is one in which the delay experienced is the same for all frequencies. If video signals pass through a device which does not exhibit linear phase, the various frequency components of edges become displaced across the screen. Figure 3.56 shows what phase linearity means. If the left-hand end of the frequency axis (DC) is considered to be firmly anchored, but the right-hand end can be rotated to represent a change of position across the screen, it will be seen that as the axis twists evenly the result is phase-shift proportional to frequency. A system having this characteristic is said to display linear phase.

In the spatial domain, a phase shift corresponds to a physical movement. Figure 3.57 shows that if between fields a waveform moves along the line, the lowest frequency in the Fourier transform will suffer a given phase shift, twice that frequency will suffer twice that phase shift and so on. Thus it is potentially possible to measure movement between two successive fields if the phase differences between the Fourier spectra are analysed. This is the basis of phase correlation.

Figure 3.58 shows how a one-dimensional phase correlator works. The Fourier transforms of two lines from successive fields are computed and expressed in polar (amplitude and phase) notation (see section 3.5). The phases of one transform are all subtracted from the phases of the same frequencies in the other transform. Any frequency component having significant amplitude is then normalized, or boosted to full amplitude.

The result is a set of frequency components which all have the same amplitude, but have phases corresponding to the difference between two fields. These coefficients form the input to an inverse transform. Figure 3.59(a) shows what happens. If the two fields are the same, there are no phase differences between the two, and so all the frequency components are added with zero degree phase to produce a single peak in the centre of the inverse transform. If, however, there was motion between the two fields, such as a pan, all the components will have phase differences, and this results in a peak shown in Figure 3.59(b) which is displaced from the centre of the inverse transform by the distance moved. Phase correlation thus actually measures the movement between fields.

images

Figure 3.57 In a phase-linear system, shifting the video waveform across the screen causes phase shifts in each component proportional to frequency.

images

Figure 3.58 The basic components of a phase correlator.

In the case where the line of video in question intersects objects moving at different speeds, Figure 3.59(c) shows that the inverse transform would contain one peak corresponding to the distance moved by each object.

Whilst this explanation has used one dimension for simplicity, in practice the entire process is two dimensional. A two-dimensional Fourier transform of each field is computed, the phases are subtracted, and an inverse two-dimensional transform is computed, the output of which is a flat plane out of which three-dimensional peaks rise. This is known as a correlation surface.

Figure 3.60(a) shows some examples of a correlation surface. At (a) there has been no motion between fields and so there is a single central peak. At (b) there has been a pan and the peak moves across the surface. At (c) the camera has been depressed and the peak moves upwards.

images

Figure 3.59 (a) The peak in the inverse transform is central for no motion. (b) In the case of motion the peak shifts by the distance moved. (c) If there are several motions, each one results in a peak.

images

Figure 3.60 (a) A two-dimensional correlation surface has a central peak when there is no motion. (b) In the case of a pan, the peak moves laterally. (c) A camera tilt moves the peak at right angles to the pan.

Where more complex motions are involved, perhaps with several objects moving in different directions and/or at different speeds, one peak will appear in the correlation surface for each object.

It is a fundamental strength of phase correlation that it actually measures the direction and speed of moving objects rather than estimating, extrapolating or searching for them. The motion can be measured to sub-pixel accuracy without excessive complexity.

However, the consequences of uncertainty are that accuracy in the transform domain is incompatible with accuracy in the spatial domain. Although phase correlation accurately measures motion speeds and directions, it cannot specify where in the picture these motions are taking place. It is necessary to look for them in a further matching process. The efficiency of this process is dramatically improved by the inputs from the phase-correlation stage.

3.17 Motion-compensated displays

MPEG coding is frequently used with film-originated material in DVD and in DVB. When this is done the pictures are coded at their native rate of 24 (or sometimes 25) Hertz. Figure 3.61(a) shows the time axis of film, where entire frames are simultaneously exposed, or sampled, at typically 24 Hz. The result is that the image is effectively at right angles to the time axis. During filming, some of the frame period is required to transport the film, and the shutter is closed whilst this takes place. The temporal aperture or exposure is thus somewhat shorter than the frame period.

When 25 Hz film is displayed with a simple MPEG decoder each frame of a film is generally output twice to produce a flicker frequency of 50 Hz. The result with a moving object is that the motion is not properly portrayed and there is judder. Figure 3.61(b) shows the origin of the judder.

In 60 Hz areas the film frames are decoded at 24 Hz, but odd frames are output as three fields, even frames result in two fields. This is the well-known 3/2 pulldown system used to convert film rate to 60 Hz video.

Motion portrayal (or lack of it) in this case is even worse.

In fact the 24/25 Hz frame rate output of an MPEG decoder is a perfect application for motion compensation. As Figure 3.62 shows, a motioncompensated standards conversion process is used to output whatever frame rate is required without judder, leading to much-improved subjective quality.

images

Figure 3.61 (a) The optic flow axis of the original scene is distorted by the double projection of each frame. (b) A tracking eye sees a double image in the presence of motion.

images

Figure 3.62 A film with a frame rate of 24 Hz cannot be displayed directly because of flicker. Using a motion-compensated standards conversion process extra frames can be synthesized in which moving objects are correctly positioned. Any television picture rate can then be obtained from film.

3.18 Camera-shake compensation

As video cameras become smaller and lighter, it becomes increasingly difficult to move them smoothly and the result is camera shake. This is irritating to watch, as well as requiring a higher bit rate in compression systems. There are two solutions to the problem, one which is contained within the camera, and one which can be used at some later time on the video data.

Figure 3.63 shows that image-stabilizing cameras contain miniature gyroscopes which produce an electrical output proportional to their rate of turn about a specified axis. A pair of these, mounted orthogonally, can produce vectors describing the camera shake. This can be used to oppose the shake by shifting the image. In one approach, the shifting is done optically. Figure 3.64 shows a pair of glass plates with the intervening spaced filled with transparent liquid. By tilting the plates a variableangle prism can be obtained and this is fitted in the optical system before the sensor. If the prism plates are suitably driven by servos from the gyroscopic sensors, the optical axis along which the camera is looking can remain constant despite shake. Shift is also possible by displacing some of the lens elements.

Alternatively, the camera can contain a DVE where the vectors from the gyroscopes cause the CCD camera output to be shifted horizontally or vertically so that the image remains stable. This approach is commonly used in consumer camcorders.

A great number of video recordings and films already exist in which there is camera shake. Film also suffers from weave in the telecine machine. In this case the above solutions are inappropriate and a suitable signal processor is required. Figure 3.65 shows that motion compensation can be used. If a motion estimator is arranged to find the motion between a series of pictures, camera shake will add a fixed component in each picture to the genuine object motions. This can be used to compute the optic flow axis of the camera, independently of the objects portrayed.

images

Figure 3.63 Image-stabilizing cameras sense shake using a pair of orthogonal gyros which sense movement of the optical axis.

images

Figure 3.64 Image-stabilizing cameras. (a) The image is stabilized optically prior to the CCD sensors. (b) The CCD output contains image shake, but this is opposed by the action of a DVE configured to shift the image under control of the gyro inputs.

images

Figure 3.65 In digital image stabilizing the optic flow axis of objects in the input video is measured as in (a). This motion is smoothed to obtain a close approximation to the original motion (b). If this is subtracted from (a) the result is the camera shake motion (c) which is used to drive the image stabilizer.

Operating over several pictures, the trend in camera movement can be separated from the shake by filtering, to produce a position error for each picture. Each picture is then shifted in a DVE in order to cancel the position error. The result is that the camera shake is gone and the camera movements appear smooth. In order to prevent the edges of the frame moving visibly, the DVE also performs a slight magnification so that the edge motion is always outside the output frame.

3.19 Motion-compensated de-interlacing

The most efficient way of de-interlacing is to use motion compensation. Figure 3.66 shows that when an object moves in an interlaced system, the interlace breaks down with respect to the optic flow axis. If the motion is known, two or more fields can be shifted so that a moving object is in the same place in both. Pixels from both fields can then be used to describe the object with better resolution than would be possible from one field alone. It will be seen from Figure 3.67 that the combination of two fields in this way will result in pixels having a highly irregular spacing and a special type of filter is needed to convert this back to a progressive frame with regular pixel spacing. At some critical vertical speeds there will be alignment between pixels in adjacent fields and no improvement is possible, but at other speeds the process will always give better results.

images

Figure 3.66 In the presence of vertical motion or motion having a vertical component, interlace breaks down and the pixel spacing with respect to the tracking eye becomes irregular.

images

Figure 3.67 A de-interlacer needs an interpolator which can operate with input samples which are positioned arbitrarily rather than regularly.

3.20 Compression and requantizing

In compression systems the goal is to achieve coding gain by using fewer bits to represent the same information. The band-splitting filters and transform techniques decribed earlier in this chapter do not achieve any coding gain. Their job is to express the information in a form in which redundancy can be identified. Paradoxically the output of a transform or a filter actually has a longer wordlength than the input because the integer input samples are multiplied by fractions. These processes have actually increased the redundancy in the signal. This section is concerned with the subsequent stage where the compression actually takes place.

Coding gain is obtained in one simple way: by shortening the wordlength of data words so that fewer bits are needed. These data words may be waveform samples in a sub-band-based system or coefficients in a transform-based system. In both cases positive and negative values must be handled. Figure 3.68(a) shows various signal levels in two’s complement coding. As the level falls a phenomenon called sign extension takes place where more and more bits at the most significant end of the word simply copy the sign bit (which is the MSB). Coding gain can be obtained by eliminating the redundant sign extension bits as Figure 3.68(b) shows.

images

Figure 3.68 In two’s complement coding, sign extension takes place as the level falls. These sign-extended bits are redundant and can be eliminated by multiplying by a leveldependent factor and neglecting the trailing zeros as shown in (b).

Taking bits out of the middle of a word is not straightforward and in practice the solution is to multiply by a level-dependent factor to eliminate the sign extension bits. If this is a power of 2 the useful bits will simply shift left up to the sign bit. The right-hand zeros are simply omitted from the transmission. On decoding a compensating division must be performed. The multiplication factor must be transmitted along with the compressed data so this can be done. Clearly if only the sign extension bits are eliminated this process is lossless because exactly the same data values are available at the decoder.

The reason for sub-band filtering and transform coding now becomes clear because in real signals the levels in most sub-bands and the value of most coefficients are considerably less than the highest level.

In many cases the coding gain obtained in this way will not be enough and the wordlength has to be shortened even more. Following the multiplication, a larger number of bits are rounded off from the least significant end of the word. The result is that the same signal range is retained but it is expressed with less accuracy. It is as if the original analog signal had been converted using fewer quantizing steps, hence the term requantizing.

During the decoding process an inverse quantizer will be employed to convert the compressed value back to its original form. Figure 3.69 shows a requantizer and an inverse quantizer. The inverse quantizer must divide by the same gain factor as was used in the compressor and reinsert trailing zeros up to the required wordlength.

A non-uniform quantization process may be used in which the quantizing steps become larger as the signal amplitude increases. As the quantizing steps are made larger, more noise will be suffered, but the noise is arranged to be generated at frequencies where it will not be perceived, or on signal values which occur relatively infrequently.

images

Figure 3.69 Coding gain is obtained by shortening the sample or coefficient wordlength so fewer bits are needed. The input values are scaled or amplified to near maximum amplitude prior to rounding off the low-order bits. The scale factor must be transmitted to allow the process to be reversed at the decoder.

images

Figure 3.70 Shortening the wordlength of a sample reduces the number of codes which can describe the voltage of the waveform. This makes the quantizing steps bigger, hence the term requantizing. It can be seen that simple truncation or omission of low-order bits does not give analogous behaviour. Rounding is necessary to give the same result as if the larger steps had been used in the original conversion.

Shortening the wordlength of a sample from the LSB end reduces the number of quantizing intervals available without changing the signal amplitude. As Figure 3.70 shows, the quantizing intervals become larger and the original signal is requantized with the new interval structure. It will be seen that truncation does not meet the above requirement as it results in signal-dependent offsets because it always rounds in the same direction. Proper numerical rounding is essential for accuracy. Rounding in two’s complement is a little more complex than in pure binary.

If the parameter to be requantized is a transform coefficient the result after decoding is that the frequency which is reproduced has an incorrect amplitude due to quantizing error. If all the coefficients describing a signal are requantized to roughly the same degree then the error will be uniformly present at all frequencies and will be noise-like.

However, if the data are time-domain samples, as in, for example, an audio sub-band coder, the result will be different. Although the original audio conversion would have been correctly dithered, the linearizing random element in the low-order bits will be some way below the end of the shortened word. If the word is simply rounded to the nearest integer the linearizing effect of the original dither will be lost and the result will be quantizing distortion. As the distortion takes place in a band-limited system the harmonics generated will alias back within the band. Where the requantizing process takes place in a sub-band, the distortion products will be confined to that sub-band as shown in Figure 3.71.

images

Figure 3.71 Requantizing a band-limited signal causes harmonics which will always alias back within the band.

In practice, the wordlength of samples should be shortened in such a way that the requantizing error is converted to noise rather than distortion. One technique which meets this requirement is to use digital dithering24 prior to rounding.

Digital dither is a pseudo-random sequence of numbers. If it is required to simulate the analog dither signal of Figure 2.30, then it is obvious that the noise must be bipolar so that it can have an average voltage of zero. Two’s complement coding must be used for the dither values.

Figure 3.72 shows a simple digital dithering system for shortening sample wordlength. The output of a two’s complement pseudo-random sequence generator of appropriate wordlength is added to input samples prior to rounding. The most significant of the bits to be discarded is examined in order to determine whether the bits to be removed sum to more or less than half a quantizing interval. The dithered sample is either rounded down, i.e. the unwanted bits are simply discarded, or rounded up, i.e. the unwanted bits are discarded but one is added to the value of the new short word. The rounding process is no longer deterministic because of the added dither which provides a linearizing random component.

The probability density of the pseudo-random sequence is important. Lipshitz et al.25 found that uniform probability density produced noise modulation, in which the amplitude of the random component varies as a function of the amplitude of the samples. A triangular probability density function obtained by adding together two pseudo-random sequences eliminated the noise modulation to yield a signal-independent white-noise component in the least significant bit.

images

Figure 3.72 In a simple digital dithering system, two’s complement values from a random number generator are added to low-order hits of the input. The dithered values are then rounded up or down according to the value of the bits to be removed. The dither linearizes the requantizing.

References

1.    Ray, S.F., Applied Photographic Optics, Ch. 17.   Oxford: Focal Press (1988)
2.    van den Enden, A.W.M. and Verhoeckx, N.A.M., Digital signal processing: theoretical background. Philips Tech. Rev., 42, 110–144 (1985)
3.    McClellan, J.H., Parks, T.W. and Rabiner, L.R., A computer program for designing optimum FIR linear-phase digital filters. IEEE Trans. Audio and Electroacoustics, AU-21, 506–526 (1973)
4.    Dolph, C.L., A current distribution for broadside arrays which optimises the relationship between beam width and side-lobe level. Proc. IRE, 34, 335–348 (1946)
5.    Crochiere, R.E. and Rabiner, L.R., Interpolation and decimation of digital signals – a tutorial review. Proc. IEEE, 69, 300–331 (1981)
6.    Rabiner, L.R., Digital techniques for changing the sampling rate of a signal. In Digital Audio, edited by B. Blesser, B. Locanthi and T.G. Stockham Jr, pp. 79–89, New York: Audio Engineering Society (1982)
7.    Jayant, N.S. and Noll, P., Digital Coding of Waveforms: Principles and applications to speech and video. Englewood Cliffs, NJ: Prentice Hall (1984)
8.    Theile, G., Stoll, G. and Link, M., Low bit rate coding of high quality audio signals: an introduction to the MASCAM system. EBU Tech. Review, No. 230, 158–181 (1988)
9.    Chu, P.L., Quadrature mirror filter design for an arbitrary number of equal bandwidth channels. IEEE Trans. ASSP, ASSP-33, 203–218 (1985)
10.    Fettweis, A., Wave digital filters: theory and practice. Proc. IEEE, 74, 270–327 (1986)
11.    Weiss, P. and Christensson, J., Real time implementation of sub-pixel motion estimation for broadcast applications. IEE Digest, 1990/128
12.    Newman, W.M. and Sproull, R.F., Principles of Interactive Computer Graphics. Tokyo: McGraw-Hill (1979)
13.    Gernsheim, H., A Concise History of Photography. London: Thames and Hudson, 9–15 (1971)
14.    Kraniauskas, P., Transforms in Signals and Systems. Ch. 6.   Wokingham: Addison-Wesley (1992)
15.    Ahmed, N., Natarajan, T. and Rao, K., Discrete cosine transform. IEEE Trans. Computers, C-23, 90–93 (1974)
16.    De With, P.H.N., Data Compression Techniques for Digital Video Recording. PhD Thesis, Technical University of Delft (1992)
17.    Goupillaud, P., Grossman, A. and Morlet, J., Cycle-octave and related transforms in seismic signal analysis. Geoexploration, 23, 85–102, Elsevier Science (1984/5)
18.    Daubechies, I., The wavelet transform, time–frequency localisation and signal analysis. IEEE Trans. Info. Theory, 36, No. 5, 961–1005 (1990)
19.    Rioul, O. and Vetterli, M., Wavelets and signal processing. IEEE Signal Process. Mag., 14–38 (Oct. 1991)
20.    Strang, G. and Nguyen, T., Wavelets and Filter Banks. Wellesly, MA: Wellesley-Cambridge Press (1996)
21.    Limb, J.O. and Murphy, J.A., Measuring the speed of moving objects from television signals. IEEE Trans. Commun., 474–478 (1975)
22.    Thomas, G.A., Television motion measurement for DATV and other applications. BBC Res. Dept. Rept, RD 1987/11 (1987)
23.    Pearson, J.J. et al., Video rate image correlation processor. SPIE, Vol. 119, Application of digital image processing, IOCC (1977)
24.    Vanderkooy, J. and Lipshitz, S.P., Digital dither. Presented at the 81st Audio Engineering Society Convention (Los Angeles, 1986), Preprint 2412 (C-8)
25.    Lipshitz, S.P., Wannamaker, R.A. and Vanderkooy, J., Quantization and dither: a theoretical survey. J. Audio Eng. Soc., 40, 355–375 (1992)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset