21

Debugging FPGA-based Video Systems

Andrew Draper

Altera video engineering

Chapter Outline

In this chapter we will discuss some of the strategies you can use for debugging a video system built in an FPGA. The examples use Altera’s video debugging tools and methodology, although the concepts can be applied generally.

Before moving on to the video-specific parts of debugging it is worth checking that the design has synthesized correctly and has passed a number of basic sanity checks.

21.1 Timing Analysis

Hardware designs that run from a clock need to meet a number of timing constraints. The two most basic of these exist to prevent errors if a signal changes while it is being sampled by a register:

 The input to a register must be stable for a time before the clock edge on which it is sampled – referred to as the setup time.

image

Figure 21.1 Setup and hold times.

 The input to a register must remain stable for a time after the clock edge on which it is sampled – referred to as the hold time.

Most signals originate from registers in the same clock domain, the outputs of which change just after the clock edge (i.e. there is a delay going through the register). There are also delays while the signals pass through combinational logic, and further delays if the signals need to be routed across the chip to their destination. The sum of these delays is known as the propagation delay.

The mathematical relationship between the delays is expressed by the following two equations which must be satisfied for all paths within the chip:

propagation delay + setup time < = clock period

propagation delay > = hold time

Where < = is the mathematical less than or equal symbol, and > = is greater than or equal

There are more complex timing issues when signals cross from one clock domain to another but these are usually handled by specially designed library components.

A hardware design where these equations are satisfied for all signals on the chip is said to meet timing. A design which does not meet timing will usually fail in subtle and unexpected ways so further debugging is not usually productive.

21.1.1 Check that the Design Meets Timing

During synthesis the layout tool will place the logic within the chip and then run a timing analysis to check that the design meets the setup and hold requirements of the chip that will implement it. If these requirements are not met, the tool will adjust the layout and run the timing analysis again, continuing until timing analysis passes.

For the timing analysis stage the designer must provide scripts to tell the tool what timing behavior is required. These scripts are written by the hardware designer and shipped with the library component (if you write your own hardware with multiple clock domains then you will need to provide these scripts). If these scripts are incorrect, or if the clock speeds set in the scripts are lower than the actual speed of the clock, then the design can fail even if it meets timing.

A timing failure in one part of the circuit can cause problems elsewhere in the design, because if one part of the design fails to meet timing then the tool will stop rearranging the design throughout the chip. It will report the errors that have caused it to stop processing, but may suppress errors for other areas of the design which have not been completely processed. Thus a timing failure in one part of the chip is said to hide failures elsewhere in the design.

The propagation delay varies with several factors:

 The temperature of the silicon within the chip: recent chips run fastest at room temperature and slowest at the top and bottom ends of their temperature range.

 Manufacturing variations can change the propagation delay between one batch of chips and the next. Manufacturers partly deal with this by measuring the speed of chips after production and assigning a higher speed grade (and price) to those with lower propagation delays but there is still a small variation within each speed grade.

 Small changes in the supply voltage: tolerances within the power supply components allow for difference between the supply voltages from one board to another.

The timing analysis tool will usually check the timing multiple times with different timing modes – for example it will check both the maximum and minimum propagation delays for the temperature and manufacturing variation.

All timing models for a design must pass before it can be used in a production system is used when timing passes only at lower temperatures: a liberal application of freezer spray to the chip can make a design work for a minute or two – often long enough to indicate that timing is the cause of failures.

21.1.2 Fix Your Design if it Does Not Meet Timing

Here we will be referring back to the two basic timing equations above. In most chips the propagation delay is significantly larger than the hold time, so the first equation is the harder to satisfy. This equation can always be satisfied by decreasing the clock frequency (increasing the clock period) but this is usually unsatisfactory, especially for video designs where a minimum clock frequency is required to process all the pixels in a frame.

Another method is to buy a chip at a faster speed grade. Unfortunately this has cost implications or is not possible because previously shipped products need to be upgraded.

Other methods of making a design meet timing include: inserting buffer registers to reduce the length of combinational paths; changing the layout to place critical registers closer to each other; and reducing the fan out of signals (which can increase switching speed and make the layout simpler).

If the timing tool reports hold-time violations that reducing the clock speed will not fix, design changes are required. Refer to FPGA Design: Best Practices for Team based design (Simpson) ISBN–13: 978-1441963383.

If you are using library components to create your design then the component designer will have already considered these issues and may have included parameters which help their component meet timing (usually in exchange for an increase in size or latency).

Many libraries include components called pipeline bridges (or similar names) which can be used to easily insert buffer registers into all the signals of a bus without affecting its behavior.

21.2 The SystemConsole Debugger

As we will use SystemConsole as an example of a tool running on a debug host we will now provide a basic introduction.

Debug tools usually refer to the system being debugged as the target – the system which you use to debug the target is the debug host. The host will be connected to the target via one or more debug cables (nowadays these are normally JTAG, USB or Ethernet – though debugging over other media is possible).

To enable debugging, the system designer places debug agents within the target system. These agents are sometimes packaged within other components – for example, most processor components now contain a debug module – or they can be explicitly instantiated by the system designer.

Those debug agents that use a JTAG interface to communicate with the host are automatically connected to the JTAG pins on the device by the Quartus software. In the current Altera software, debug agents using other cables (USB and Ethernet) must be explicitly connected to the pins on the device.

image

Figure 21.2 Clock sense indication.

21.3 Check That Clocks and Resets are Working

Incorrectly functioning clocks or resets are a common cause of design failures, which should be ruled out early in the debug process – even experienced engineers have wasted hours of time debugging apparently failed systems where the clock has been disabled or the wire supplying the clock signal from a test device to the board has been knocked off.

Other causes of clock failures include Phase locked loops which are unable to lock because their input signal has too much jitter or is outside the acceptable range of input frequencies.

Reset signals can also become stuck – either holding part of the design in reset permanently or never resetting it. If a design is not reset then it does not start in a consistent state, and may get into a state that its designer did not intend. Sometimes a design will get out of these unusual states and sometimes it will become stuck.

FPGA designs with reset faults sometimes work because the configuration logic within the FPGA sets most registers to their defined reset state at the end of configuration.

Most debug tools, for example the Altera SystemConsole tool, provide ways to check that clocks are running and resets are behaving correctly. In SystemConsole the explorer window shows a green clock badge on nodes that have a running clock and a red clock badge (with associated tooltip) on nodes which can sense the clock but do not detect it running.

It also provides the jtag_debug service to give scripted access to the clock sensing hardware. The TCL below shows an example of its use:

set jd [lindex [get_service_paths jtag_debug] 0]

open_service jtag_debug $jd

puts “Clock running: [jtag_debug_sense_clock $jd]”

puts “Reset status: [jtag_debug_sample_reset $jd]”

21.4 Clocked and Flow Controlled Video Streams

As you have read in earlier chapters most digital video protocols send video frames between boards using a clock and a series of synchronization signals. This is simple to explain but it is an inefficient way to communicate within a device, as all processing modules need to be ready to process data on every clock within the frame, but will be idle during the synchronization intervals.

Using a flow-controlled interface is more flexible because it simplifies processing blocks and allows them to spread the data processing over the whole frame time. Flow-controlled interfaces provide a way to control the flow of data in both directions – the source can indicate on which cycles there is data present and can backpressure when it is not ready to accept data. In the Avalon ST flow-controlled interface the valid signal indicates that the source has data and the ready signal indicates that the sink is able to accept it (i.e. is not backpressuring the source).

If you are building a system from library components, most problems will occur when converting from clocked-video streams to flow-controlled video streams, and vice versa.

21.5 Debugging Tools

Several debugging tools are available: the most basic tools of which are an oscilloscope, logic analyser or (within an FPGA) an embedded logic analyser (such as Altera’s SignalTap tool). These tools provide a high-resolution view of the data being transferred on a number of signals.

If you have data integrity issues between boards, then low-level debugging tools such as these can be used to diagnose the problem. Unfortunately, once the signals between boards or within devices are clean, these tools typically provide too much data to diagnose the types of problems that appear at higher levels.

Higher-level debug tools provide a way to trace the data passing through the system and display it as video packets. The amount of data in a video system is more than can easily be transferred, so it must be compressed to allow it to be transferred to the debug host and analysed.

The highest level of compression can be achieved by ignoring most of the pixel values and only transferring control packets and statistics about the data flow – for example, a count of the number of clock cycles where data was transferred, was not available to transfer from the source or was back-pressured by the sink.

The Altera trace system is instantiated when you are building a video design within the QSYS environment. Two parts are needed: a trace monitor component for each interface to be traced and a trace system component which transfers trace data packets to the host.

A video trace monitor component needs to be inserted into each video stream you want to monitor. This component is non-intrusive – it has no effect on the video data going through the stream. The video trace monitor component reads the signals being transmitted and sends summaries to the trace system. You will need to parameterize the video trace monitor to match the type of data being sent and to match the trace system data-width.

The trace system component takes the reports from the trace monitors and buffers them before sending the results to the host over JTAG or USB, where they are reconstructed for the user. You will need to parameterize this component to select the type of connection to the host, the number of monitors, the buffer size, etc.

image

Figure 21.3 Trace monitors and the trace system.

image

Figure 21.4 Example of decoded trace output.

The SystemConsole host application decodes and displays the received packets to show the data as it passes through the system. Each video packet is displayed as one line in the display. The sections below describe common video errors and how to recognize them in the trace output.

Debug tools are also available which allow the debug host to access memory mapped slaves within the target. The Altera JTAG Avalon Master and USB Debug Master components are explicitly designed to do this: if you do not have such a component available then most processor debuggers can be used in a similar way.

21.6 Converting from Clocked to Flow-controlled Video Streams

In a functioning system the input to the flow-controlled domain will send data as it becomes available. The system needs to transfer, on average, one line’s worth of pixels in each line scan time. The transfer of data will normally be controlled by “valid”, with “ready” asserted occasionally to select the cycles on which data is accepted.

The number of cycles on which “valid” is asserted depends on the ratio between the screen resolution and the clock rate in the flow-controlled domain. If the clock is just sufficient for the highest resolution then “valid” will be asserted on most cycles within the main part of the frame. At lower resolutions “valid” will only be asserted on a proportion of the cycles.

The “ready” signal to the clocked video input should not be the main source of flow control on the frame, so it is typically de-asserted only for short periods to synchronise with the sink. One common problem is that if “ready” is de-asserted for too long then the memory buffer in the video input block can overflow.

Attaching a streaming video monitor to the output of the video input block can help detect overflow situations – if the video input block is backpressured (by de-asserting “ready”) for too long then it will abandon the backpressured frame and send a short packet. This can be seen on the trace.

The trace also reports the number of not-ready cycles within each packet and the time interval between packets. This can be used to check that the interface is being mostly flow-controlled by “valid” rather than “ready”.

If the clocked video-input block has a control port then the debug master can be used to check the overflow sticky-bit in the status register. This bit will be set if there has been an overflow since it was last checked - note that if you have software monitoring and clearing this bit then reading it from the debugger will not be reliable.

21.7 Converting from Flow-controlled to Clocked Video Streams

The clocked-video output component converts flow-controlled video packets into a clocked video signal. The flow control on the input to this component is controlled by the “ready” signal, which essentially pulls data out of the interface as it is needed.

If the source is unable to provide data at a sufficient rate then the FIFO in this component will empty. This is referred to as underflow. At this point the component tries to re-synchronize, sending out blank video data and reading continuously from the input until the start of the new frame appears, when it will re-start the output video.

The clocked-video output component latches the underflow indication – the underflow sticky-bit is set when an underflow occurs. You can use the debug master or software on an embedded processor to check this bit. As with the overflow bit in the clocked video master, if embedded software is monitoring and resetting the bit then reading it from the debugger will not be reliable.

The video trace monitor can also indicate when there are problems with underflow. Normally the stream going in to the clocked-video output is controlled by “ready”, but if there is a problem with underflow then “ready” will not be asserted during the re-synchronization process. The resulting lack of backpressure is visible in the captured video packet summaries.

21.8 Free-running Streaming Video Interfaces

The clock rate within a flow-controlled video system is normally set to sufficient bandwidth on the streaming ports for a picture of maximum resolution to be transmitted (with a small amount of overhead to allow for jitter).

The flow control signals – “ready” near the video output or “valid” near the video input – ensure that processing does not run faster than the incoming video stream. If processing runs too far ahead then frames will be missed and the picture will be jerky.

This can happen if the design has instantiated multiple triple-buffer components. Triple buffers do not flow-control their inputs or their outputs (except temporarily when waiting for memory accesses). A video pipeline between two triple buffers will run at the processing clock speed rather than staying in sync with the video frames.

If part of the video pipeline is allowed to free run then this will waste memory bandwidth. It can also reduce picture quality as the input triple buffer will duplicate frames to keep its output busy while the output triple buffer will delete frames to match the frame rate on the output. The overall effect will be that some frames are output multiple times while other frames are not output at all.

The solution to the free-running problem is to replace all but one of the triple buffers with a double-buffer component. The double buffer does no frame rate conversion so will not allow its input and output to run more than one frame apart. This will provide flow control to the central part of the system.

The video trace monitor can also be used to detect free-running streaming video components. Examining the flow-control statistics will report that there is no backpressure or unavailable data – i.e. “ready” and “valid” will be high for most of the frame.

The timing information on the captured video packets reports the average frame rate passing through the monitor. If the streaming video interface is free running then the frame rate in parts of the video pipeline will be much faster than expected.

21.9 Insufficient Memory Bandwidth

Some video processing components, such as a color space converter, can process the video data one pixel at a time. Others need to store the pixels between input and output – the simplest examples are the buffer components that write the input pixels to a frame buffer in memory and read from memory (with different timing) to create the output pixel stream.

Components using a frame buffer demand a large amount of memory bandwidth – the sum of the bandwidth of the input and output data rates. If the memory subsystem is not designed correctly then it will not be able to provide this bandwidth. This will cause excessive flow control of the input and/or output which in turn will make FIFOs in other components overflow or underflow as described previously.

Because of their size, most frame buffers are stored in external memory, which is usually shared between multiple, different, memory-mapped masters. Even in the case where memory is not shared, a double- or triple-buffer component has two masters, one to write and the other to read.

When there are multiple masters for the same memory-mapped slave, an arbiter is needed to share the slave’s bandwidth between the masters. In some cases the arbiter is inserted automatically as part of the bus fabric – in other cases it is explicitly inserted by the user as a separate component, or as part of the slave component.

The Altera multi-port front-end component is a specialized arbiter which understands the costs of different DDR accesses and can be configured to maximize bus efficiency. When used correctly this component can achieve memory-bandwidth efficiency of over 90% – i.e. the number of cycles lost due to bank opens, closes, read-after-write delays and other DDR performance hazards is less than 10%.

Setting up the arbiter to achieve high efficiencies is sometimes complex, as the interface priorities need to be set correctly so that low-latency masters are serviced quickly. Most video component masters will use only as much bandwidth as is needed for the selected video resolution, although in example of free running they will use as much bandwidth as available – possibly locking out lower-priority masters from the memory.

A processor master does not normally have a bandwidth limit – it will use as much bandwidth as is available and can be consumed. Most processors are only able to pipeline a limited number of memory accesses, so the latency of the memory can limit the amount of bandwidth they can consume. Processors are normally put at the lowest priority to prevent them from starving video masters, which have a bandwidth target.

Many arbiters, including the Altera external memory interface toolkit, have an optional efficiency monitoring feature which collects statistics about the bandwidths and latencies used by different masters. This efficiency monitor can be used to check that the memory is running at a sufficiently high overall bandwidth, and can help with optimization when it is not.

21.10 Check Data Within Stream

During the prototype stage all components have bugs that must be fixed. The usual hardware flow is to fix these bugs through simulation where the visibility into the system is good.

This is harder for video components as the high data rates mean that complex components can take several minutes to simulate each frame. For edge-case bugs, which occur once every few hours on video data, this would mean many days of simulation before a bug occurs. These bugs are only really debuggable in hardware.

Most debug components, including the Altera trace system, can be set up to continuously capture data into a circular buffer. When the trace system is triggered it stops capturing data, sending its stored data to the host for analysis. Ignoring the activity of the system significantly before the trigger lets you concentrate on the immediate causes of the bug, rather than having to wade through large amounts of captured data.

What drives the trigger signal? For hard-to-find bugs you might write custom hardware, which monitors various parts of the system and sends a trigger when misbehavior is detected. This is difficult, can be error-prone and is not always necessary.

Most component vendors ship bus protocol monitors that are used in simulation to check that the signals on a bus do not violate the specification. For example, a lot of memory-mapped buses require that after an access has started the address signals must remain stable until that access is accepted by the slave. A master that changes the address lines halfway through its transaction will be detected by the monitor.

Bus monitors are used extensively during simulation and some of them are now synthesizable, so they can be temporarily included within an FPGA design. Connecting the error output of the monitor to the trigger input of a trace system, or embedded logic analyzer, will let the user capture the events leading up to, and just after, the error.

In streaming video systems application-aware bus monitors also detect higher-level errors: for example, a component which outputs data packets which do not match the size described in the preceding control packet will be logged and/or report errors.

The video trace system will show edge cases occurring which might trigger a bug, for example when two control packets preceding a data packet is legal (the second control packet takes priority) but is not handled correctly by some components.

21.11 Summary

Debugging video systems can be daunting, especially when the only visible symptom is the output of a black picture.

Many trace components are available to provide visibility into the system and narrow down the location of the bug which is causing the symptom. Careful use of these components can save significant time during development.

In some cases the trace components can be left active in shipped systems. Remote debugging can then be used on units running real data – this can be especially valuable when the bug has been triggered by almost standard data being generated by other equipment that is only installed in one broadcaster, in one distant country.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset