i
i
i
i
i
i
i
i
Chapter 15
Pipeline Optimization
“We should forget about small efficiencies, say about 97% of the
time: Premature optimization is the root of all evil.”
—Donald Knuth
As we saw in Chapter 2, the process of rendering an image is based on a
pipelined architecture with three conceptual stages: application, geometry,
and rasterizer. At any given moment, one of these stages, or the commu-
nication path between them, will always be the bottleneck—the slowest
stage in the pipeline. This implies that the bottleneck stage sets the limit
for the throughput, i.e., the total rendering performance, and so is a prime
candidate for optimization.
Optimizing the performance of the rendering pipeline resembles the
process of optimizing a pipelined processor (CPU) [541] in that it consists
mainly of two steps. First, the bottleneck of the pipeline is located. Second,
that stage is optimized in some way; and after that, step one is repeated if
the performance goals have not been met. Note that the bottleneck may
or may not be located at the same place after the optimization step. It is
a good idea to put only enough effort into optimizing the bottleneck stage
so that the bottleneck moves to another stage. Several other stages may
have to be optimized before this stage becomes the bottleneck again. For
this reason, effort should not be wasted on over-optimizing a stage.
The location of the bottleneck may change within a frame. At one mo-
ment the geometry stage may be the bottleneck because many tiny triangles
are rendered. Later in the frame the rasterizer could be the bottleneck be-
cause triangles covering large parts of the screen are rendered. So, when
we talk about, say, the rasterizer stage being the bottleneck, we mean it is
the bottleneck most of the time during that frame.
Another way to capitalize on the pipelined construction is to recognize
that when the slowest stage cannot be optimized further, the other stages
can be made to work just as much as the slowest stage. This will not change
performance, since the speed of the slowest stage will not be altered, but the
697
i
i
i
i
i
i
i
i
698 15. Pipeline Optimization
extra processing can be used to improve image quality. For example, say
that the bottleneck is in the application stage, which takes 50 milliseconds
(ms) to produce a frame, while the others each take 25 ms. This means
that without changing the speed of the rendering pipeline (50 ms equals
20 frames per second), the geometry and the rasterizer stages could also
do their work in 50 ms. For example, we could use a more sophisticated
lighting model or increase the level of realism with shadows and reflections,
assuming that this does not increase the workload on the application stage.
Pipeline optimization is a process in which we first maximize the ren-
dering speed, then allow the stages that are not bottlenecks to consume as
much time as the bottleneck. That said, this idea does not apply for newer
architectures such as the Xbox 360, which automatically load-balance com-
putational resources (more on this in a moment).
This exception is an excellent example of a key principle. When reading
this chapter, the dictum
KNOW YOUR ARCHITECTURE
should always be in the back of your mind, since optimization techniques
vary greatly for different architectures. A related dictum is, simply, “Mea-
sure.”
15.1 Profiling Tools
There are a number of worthwhile tools available for profiling use of the
graphics accelerator and CPU. Such tools are useful both for locating bot-
tlenecks and for optimizing. Examples include PIX for Windows (for Di-
rectX), gDEBugger (for OpenGL), NVIDIA’s NVPerfKit suite of tools,
ATIs GPU PerfStudio [1401], and Apple’s OpenGL Profiler.
As an example, PIX for Windows provides real-time performance eval-
uation by providing counters for a wide variety of data, such as the number
of draw calls, state changes, texture and shader calls, CPU and GPU idle
time, locks on various resources, read and write I/O, the amount of mem-
ory used, etc. This data can be displayed overlaid on the application itself.
Figure 15.1 was rendered with this technique.
PIX can capture all the DirectX calls made within a given frame for
later analysis or playback. Examining this stream can show whether and
where unnecessary API calls are being made. PIX can also be used for
pixel debugging, showing the frame buffer history for a single pixel.
While these tools can provide developers with most of the information
they need, sometimes other data is needed that does not fit the mold.
Pelzer [1001] presents a number of useful techniques to display debugging
information.
i
i
i
i
i
i
i
i
15.2. Locating the Bottleneck 699
Figure 15.1. PIX run atop a DirectX program, showing HUD information about the
number of draw calls performed. (Image from ATI’s Chimp Demo with Microsoft’s
PIX overlaid.)
15.2 Locating the Bottleneck
The first step in optimizing a pipeline is to locate the bottleneck. One
way of finding bottlenecks is to set up a number of tests, where each test
decreases the amount of work a particular stage performs. If one of these
tests causes the frames per second to increase, the bottleneck stage has
been found. A related way of testing a stage is to reduce the workload on
the other stages without reducing the workload on the stage being tested.
If performance does not change, the bottleneck is the stage where the work-
load was not altered. Performance tools can provide detailed information
on what API calls are expensive, but do not necessarily pinpoint exactly
what stage in the pipeline is slowing down the rest. Even when they do,
such as the “simplified experiments” provided by NVPerfAPI,itisimpor-
tant to understand the idea behind each test.
What follows is a brief discussion of some of the ideas used to test
the various stages, to give a flavor of how such testing is done. There
are a number of documents available that discuss testing and optimiza-
tions for specific architectures and features [164, 946, 1065, 1400]. A per-
i
i
i
i
i
i
i
i
700 15. Pipeline Optimization
fect example of the importance of understanding the underlying hard-
ware comes with the advent of the unified shader architecture.TheXbox
360 (see Section 18.4.1) uses this architecture, and it forms the basis of
high-end cards from the end of 2006 on. The idea is that vertex, pixel,
and geometry shaders all use the same functional units. The GPU takes
care of load balancing, changing the proportion of units assigned to ver-
tex versus pixel shading. As an example, if a large quadrilateral is ren-
dered, only a few shader units could be assigned to vertex transformation,
while the bulk are given the task of fragment processing. For this ar-
chitecture, pinpointing whether the bottleneck is in the vertex or pixel
shader stage is moot [1400]. Either this combined set of stages or another
stage will still be the bottleneck, however, so we discuss each possibility
in turn.
15.2.1 Testing the Application Stage
If the platform being used is supplied with a utility for measuring the
workload on the processor(s), that utility can be used to see if your program
uses 100 percent (or near that) of the CPU processing power. For Windows
there is the Task Manager, and for Macs, there is the Activity Monitor.For
Unix systems, there are usually programs called top or osview that show
the process workload on the CPU(s). If the CPU is in constant use, your
program is CPU-limited. This is not always foolproof, since you may be
waiting for the hardware to complete a frame, and this wait operation is
sometimes implemented as a busy-wait. Using a code profiler to determine
where the time is spent is better. AMD has a tool called CodeAnalyst for
analyzing and optimizing the code run on their line of CPUs, and Intel has
a similar tool called VTune that can analyze where the time is spent in the
application or in a driver. There are other places where time can go, e.g.,
the D3D runtime sits between the application and the driver. This element
converts API calls to device-independent commands, which are stored in a
command buffer. The runtime sends the driver these commands in batches,
for efficiency.
A smarter way to test for CPU limitsistosenddowndatathatcauses
the other stages to do little or no work. For some systems this can be
accomplished by simply using a null driver (a driver that accepts calls but
does nothing) instead of a real driver [946]. This effectively sets an upper
limit on how fast you can get the entire program to run, because you do not
use the graphics hardware, and thus, the CPU is always the bottleneck. By
doing this test, you get an idea on how much room for improvement there
is for the stages not run in the application stage. That said, be aware that
using a null driver can also hide any bottleneck due to driver processing
itself and communication between stages.
i
i
i
i
i
i
i
i
15.2. Locating the Bottleneck 701
Another more direct method is to underclock the CPU, if possible [164,
946]. If performance drops in direct proportion to the CPU rate, the ap-
plication is CPU-bound. Finally, the process of elimination can be used: If
none of the GPU stages are the bottleneck, the CPU must be. This same
underclocking approach can be done for many GPUs for various stages with
programs such as Coolbits or EnTech’s PowerStrip. These underclocking
methods can help identify a bottleneck, but can sometimes cause a stage
that was not a bottleneck before to become one. The other option is to
overclock, but you did not read that here.
15.2.2 Testing the Geometry Stage
The geometry stage is the most difficult stage to test. This is because
if the workload on this stage is changed, then the workload on one or
both of the other stages is often changed as well. To avoid this problem,
Cebenoyan [164] gives a series of tests working from the rasterizer stages
back up the pipeline.
There are two main areas where a bottleneck can occur in the geometry
stage: vertex fetching and processing. To see if the bottleneck is due to
object data transfer is to increase the size of the vertex format. This can be
done by sending several extra texture coordinates per vertex, for example.
If performance falls, this area is the bottleneck.
Vertex processing is done by the vertex shader or the fixed-function
pipeline’s transform and lighting functionality. For the vertex shader bot-
tleneck, testing consists of making the shader program longer. Some care
has to be taken to make sure the compiler is not optimizing away these
additional instructions. For the fixed-function pipeline, processing load
can be increased by turning on additional functionality such as specular
highlighting, or by changing light sources to more complex forms (e.g.,
spotlights).
15.2.3 Testing the Rasterizer Stage
This stage consists of three separate substages: triangle setup, the pixel
shader program, and raster operations. Triangle setup is almost never the
bottleneck, as it simply joins vertices into triangles [1400]. The simplest
way to test if raster operations are the bottleneck is by reducing the bit
depth of color output from 32 (or 24) bit to 16 bit. The bottleneck is found
if the frame rate increases considerably.
Once raster operations are ruled out, the pixel shader program’s effect
can be tested by changing the screen resolution. If a lower screen resolution
causes the frame rate to rise appreciably, the pixel shader is the bottleneck,
at least some of the time. Care has to be taken if a level-of-detail system
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset