20

Memory Considerations When Building a Video Processing Design

Chapter Outline

Video processing – especially HD video processing – is significantly compute and memory intensive. When building an HD video-processing signal chain in hardware, whether in ASIC or FPGA, you must be aware of both the computation resources and the internal and external memory resources that will be needed.

This chapter will use simple examples of designs to explain how to calculate memory resources. First we will look at external memory bandwidth since in any HD video processing system the pixels required for calculation cannot be stored on-chip. External DDR memory is also a very important consideration. To ascertain this bandwidth we will look at the frame-buffer function and the motion-adaptive deinterlacing function.

We will then assess internal on-chip memory requirements – especially for functions like the video scaler which processes lines of video within a given frame.

20.1 The Frame Buffer

This function is widely used in a range of video-processing signal chains. Frame buffers do exactly what the name suggests – it buffers video frames, not just lines, in an external DDR memory. Buffering is frequently required to match the data rates and thus reduce video burstiness. Frame buffers take in a frame of video – line by line – and transfer it to an external memory. Then the buffer will read either that frame, or any other from the DDR memory, into the internal on-chip memory. It may not transfer the entire frame, usually transferring a few lines of video which are required for processing.

To implement this functionality a buffer needs to have a writer block and a reader block. It is built with a writer which stores input pixels in memory, and a reader which retrieves video frames from the memory and outputs them. See Figure 20.1.

image

Figure 20.1 Double and triple buffering.

Figure 20.1 shows a simple implementation of two types of frame buffer – a double frame buffer and a triple frame buffer. As the name suggests, a double frame buffer stores two frames in the external DDR memory, while a triple frame buffer stores three frames in memory.

Let’s look first at the double frame buffer. When double buffering is in use, two frame buffers are used in external RAM. At any time, one buffer is used by the writer component to store input pixels, while the second buffer is used by the reader component that reads pixels from the memory. When both the writer and reader finish processing a frame, the buffers are exchanged. Double buffering is often used when the frame rate is the same at the input and at the output sides, but the pixel rate is highly irregular at one or both sides. For example, if a function is putting out one pixel on each clock edge, but the next function needs nine pixels in order to do its work, you would insert a frame buffer between the two functions. Double buffering is useful in solving throughput issues in the data path and is often used when a frame has to be received or sent in a short period of time compared with the overall frame rate.

image

Figure 20.2 Clipping a video frame before it is passed it on to a video scaler.

When frame dropping and frame repeating are not allowed, the frame buffer needs to provide a double-buffering function. However, if frame dropping and/or frame repeating are allowed, the frame buffer provides a triple-buffering function and can be used to perform simple frame-rate conversion.

When triple buffering is in use, three frame buffers are used in external RAM. As in the case of double buffering, the reader and writer components are always locking one buffer to respectively store input pixels to memory and read output pixels from memory. The third frame buffer is a spare buffer that allows the input and output sides to swap buffers asynchronously. Triple buffering allows simple frame-rate conversion to be performed when the input and output are pushing and pulling frames at different rates. Also, by further controlling the dropping or repeating behavior, the input and output can be kept synchronized.

Let’s look at some examples where you may need to insert a frame buffer. Figure 20.1 shows an example where we are clipping a video frame before passing it on to a video scaler. This is commonly used to zoom into a portion of an image.

First we clip a corner of the video frame. The rest of the pixels are now not valid – only the clipped pixels go on to the scaler. While the frame rate (frames/sec) is the same, the pixel rates that are input to the clipper and to the scaler are different. If the clipper output is passed directly on to the scaler, we will have periods of valid pixel data and periods of not valid pixel data – we effectively have a bursty video stream after the clipper.

The bursty video stream would cause the scaler to stall from time to time – and this will impact downstream processing. To mitigate that effect, a double-buffering frame buffer should be inserted between the clipper and the scaler.

Another example is shown in Figure 20.3. This shows what happens when a down-scaler output is connected to a function, such as an alpha blender, which blends two or more video streams. A down-scaler also produces fewer pixels than it takes in. Therefore the output of this function is bursty – valid pixels followed by not valid data. A down-scaler is also called a decimation filter and as a rule these filters will produce bursty video data.

image

Figure 20.3 A down-scaler output connected an alpha blender.

As before, we should insert a double-buffering frame buffer between the scaler and the alpha blending mixer to minimize the chance of starving all the downstream modules and creating unstable output at the alpha blending mixer.

As a matter of fact, 2-D FIR filters, 2-D Median filters and deinterlacers (with Bob algorithm) all produce bursty video streams. Depending on the application requirement, a double-buffering frame buffer might need to be connected after each of these functions.

20.2 Calculating External Memory Bandwidth Required

In video-signal chains the two functions that require lots of external memory transactions are the frame buffer and the deinterlacer. The frame-buffer memory transactions are described above. The deinterlacer, especially the motion-adaptive deinterlacer, requires external memory transactions because it compares fields and determines if there is motion or not and that’s where the external memory transactions come into play.

In this chapter we will calculate the memory bandwidth for a hypothetical signal chain that includes a frame buffer and a deinterlacer – so that you can appreciate the staggering amount of memory bandwidth required in HD video processing.

Figure 20.4 shows an example video design. We will focus on the two deinterlacers and the two frame buffers, and for each of these components we will calculate the worst-case memory bandwidth.

image

Figure 20.4 An example video design.

A deinterlacer has to read four fields (or for calculation purposes two frames) and calculate the motion value (vector) for each pixel. This value is then compared with the previous value of the motion vector. So it has to read in the motion vector and then write out the final motion vector.

As shown in Figure 20.5, a motion-adaptive deinterlacer requires five (master) accesses to the DDR memory:

 One field write (@input rate).

 Two field reads (@output rate) – four fields in two accesses.

 One motion vector write.

 One motion vector read.

image

Figure 20.5 A motion-adaptive deinterlacer requires five (master) accesses to the DDR memory.

So to calculate the memory bandwidth demanded by this deinterlacer, we will first assume that this is a simple 4:4:4 video with 10 bits per color plane for each pixel. This means that each pixel requires 30 bits to be represented. The input format to the deinterlacer is fields and the output format is frames.

 Input format: 1080i, 60 fields/sec, 10-bit color

• 1920 × 1080 × 30bits × 60/2 = 1.866 Gbit/s

 Output format: 1080p, 60 frames/sec, 10-bit color

• 1920 × 1080 × 30bits × 60 =  3.732Gbit/s

Let’s also assume that the motion vector calculated is represented as a 10-bit value. So there will be one motion vector read and one motion vector write.

 Motion format: only use 10 bits for the motion values:

• 1920 × 1080 × 10 bits × 60/2 = 0.622 Gbit/s

The total memory access required for the deinterlacer can thus be calculated as:

 1 × write at input rate: 1.866 Gbit/s.

 1 × write at motion rate: 0.622 Gbit/s.

 1 × read at motion rate: 0.622 Gbit/s.

 2 × read at output rate: 7.464 Gbit/s.

Total: 10.574 Gbit/s ← this is for 4:4:4 video.

Just as an exercise, let’s see what happens if we use 4:2:2 video. The pixels are represented by 20 bits, however the motion vectors are still at 10 bits.

 Input format: 1080i, 60 fields/sec, 10-bit color

• 1920 × 1080 × 20bits × 60/2 =  1.24 Gbit/s

 Output format: 1080p, 60 frames/sec, 10-bit color

• 1920 × 1080 × 20bits × 60 = 2.48 Gbit/s

 Motion format: Only use 10bits for the motion values

• 1920 × 1080 × 10bits × 60/2 =  0.622 Gbit/s

Memory access:

 1 × write at input rate: 1.24 Gbit/s.

 1 × write at motion rate: 0.622 Gbit/s.

 1 × read at motion rate: 0.622 Gbit/s.

 2 × read at output rate: 4.96 Gbit/s.

Total: 7.44 Gbit/s ← this is a 30% reduction in DDR memory bandwidth. This should give you an appreciation of why many video designs will chroma subsample and upsample many times – so that they can minimize precious memory and bandwidth resources.

Since in a majority of designs you will be dealing with 4:2:2 video we will use the second number of 7.44 Gbit/s.

Frame buffers are easy to calculate as they just read a frame and write a frame – the memory bandwidth calculation is simply 1920 × 1080 × 20 bits × 60 fps.

Now let’s introduce a real-life constraint. We are using 20-bit YCbCr video data that has been sampled in the 4:2:2 sampling scheme, but what if we are using an external memory with 256-bit data bus? There is a mismatch.

This means that during each burst to the external memory, we are only able to transmit or receive 12 pixels (or 240 bits) total. This indicates that, for each transfer, 16 bits are wasted. Or you could describe this as 1.33 bits wasted for every pixel transmitted/received.

In the same fashion, with 10-bit motion values, only 25 motion values can be transmitted/received during each burst. Therefore, for each transfer, 6 bits are wasted, or 0.24 of a bit is wasted for every motion value transferred/received.

Therefore we must recalculate the memory bandwidth for both the deinterlacer and the framebuffer as shown in Table 20.1.

Table 20.1 Calculating the memory bandwidth for the deinterlacer and the framebuffer

Image

After we figure out the penalties for each burst of video data and motion values, we are able to compute the worst-case external-memory bandwidth requirement for each IP. To do that, we first calculate the memory requirement for the worst-case input/output format and motion values. Then we sum up the bandwidth requirement for each write/read port. In this case, for each motion-adaptive deinterlacer, with the motion-bleed option on, the worst-case requirement is 7.909 Gbits/second. Each frame buffer also requires external memory bandwidth of 5.308 Gbits/second. It does not significantly change the numbers – but it is an important consideration when designing real-life video signal chains.

Since we have two video paths in the design that we started with, we need two deinterlacers and two frame buffers. The total worst-case system memory bandwidth requirement is 26.434 Gbits/sec, as shown in Table 20.2.

Table 20.2 The total worst-case system memory bandwidth requirement

Function Bandwidth (Gbit/s)
2 × deinterlacer 2 × 7.909 = 15.818
2 × frame buffer 2 × 5.308 = 10.616
Total 26.434

If you are using a DDR2 memory running at 267 Mhz, your theoretical peak memory bandwidth possible is:

266.7 MHz × 64 bits × 2 (both clock edges used) = 34.133 Gbit/s

Which means for this design you need an efficiency of 26.434/ 34.133 = 77.5%. Efficiency in this context is the effect of multiple masters trying to pull from the memory or write to it. And in some cases the deinterlacer memory access may be stalled if it is being used by the frame buffer. This would cause the entire processing to stall, so the memory subsystem has to be designed such that it meets the required efficiency.

20.3 Calculating On-Chip Memory

Video scaling will use the most on-chip memory. This calculation is fairly simple, but depends on the complexity of the scaling function.

For example, an important part of upsampling and downsampling is choosing the appropriate filter kernel. This can help preserve the sharpness of the edges during interpolation and avoid aliasing during decimation. Filter response aside, resource usage is another relevant and often-overlooked aspect of the decision making process. It is important to realize that a Nv × Nh filter kernel would translate into Nv + Nh multipliers and Nv line buffers.

Which means that if you choose a 4 × 4 filter kernel, you will need enough on-chip memory to store four lines of video. One video line-buffer stores all the pixels in a single line onto the FPGA memory. The size and the configuration of this video line buffer depends upon many factors.

As we have seen before, each pixel is generally represented by three color planes – RGB, YCrCb, etc. Typically each color plane in turn is encoded using 8, 10 or even 12 bits. We also have to factor chroma sub-sampling into our calculations.

The number of bits required to store one line of video depends on multiple color space variables as shown in Table 20.3.

Table 20.3 The number of bits required to store one line of video

Image

To minimize the number of memory blocks, it is extremely important to have the right memory configuration. Since high-definition is appearing in all facets of the video market, a line buffer size of 1920 pixels (typical HD resolution being 1920 × 1080) must be considered. Each pixel is generally chroma subsampled at 4:2:2, providing 20 bits per pixel.

The ideal configuration of a memory block when implementing a 1080p HD video line would therefore be 20 bits wide and 1920 bits (about 2K) deep.

Altera FPGAs have M9K RAM memory blocks that are designed to accommodate HD video. Each RAM memory block can be configured as 2K × 4 bits. Figure 20.6 shows that cascading five of these blocks in parallel enables a video line-buffer with a memory-bit efficiency of 93.75% (1920 / 2048) to be easily implemented.

image

Figure 20.6 Five M9K RAM memory blocks in parallel.

When selecting FPGAs for HD video processing, the available configuration options of the embedded memory blocks will determine the number of video line buffers that can be implemented. A well-planned and flexible block RAM configuration leads to high bit-efficiency and will allow the design to fit into a smaller and more economical device.

20.4 Conclusion

When implementing HD video processing designs, memory resources are generally the more important consideration, given how memory-intensive video designs are. This chapter gives you a sense of what to expect both in terms of internal on-chip memory and also external DDR memory.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset