i
i
i
i
i
i
i
i
702 15. Pipeline Optimization
is in place. A smaller screen is likely to also simplify the models displayed,
lessening the load on the geometry stage.
Another approach is the same as that taken with vertex shader pro-
grams, to add more instructions to see the effect on execution speed. Again,
it is important to determine that these additional instructions were not op-
timized away by the compiler.
15.3 Performance Measurements
Before delving into different ways of optimizing the performance of the
graphics pipeline, a brief presentation of performance measurements will
be given. One way to express the performance of the geometry stage is
in terms of vertices per second. As in the geometry stage, one way to
express the fill rate of the rasterizer stage is in terms of pixels per second
(or sometimes texels per second ).
Note that graphics hardware manufacturers often present peak rates,
which are at best hard to reach. Also, since we are dealing with a pipelined
system, true performance is not as simple as listing these kinds of numbers.
This is because the location of the bottleneck may move from one time to
another, and the different pipeline stages interact in different ways during
execution. It is educational to find out what the peak rates represent, and
try to duplicate them: Usually you will find the bus getting in the way,
but the exercise is guaranteed to generate insights into the architecture like
nothing else can.
When it comes to measuring performance for CPUs, the trend has been
to avoid IPS (instructions per second), FLOPS (floating point operations
per second), gigahertz, and simple short benchmarks. Instead, the pre-
ferred method is to use clock cycle counters [1], or to measure wall clock
times for a range of different, real programs [541], and then compare the
running times for these. Following this trend, most independent bench-
marks instead measure the actual frame rate in fps for a number of given
scenes, and for a number of different screen resolutions, antialiasing set-
tings, etc. Many graphics-heavy games include a benchmarking mode or
have one created by a third party, and these benchmarks are commonly
used in comparing GPUs.
To be able to see the potential effects of pipeline optimization, it is
important to measure the total rendering time per frame with double
buffering disabled, i.e., in single-buffer mode. This is because with double
buffering turned on, swapping of the buffers occurs only in synchroniza-
tion with the frequency of the monitor, as explained in the example in
Section 2.1. Other benchmarking tips and numerous optimizations are
provided in NVIDIA’s [946] and ATI’s guides [1400].
i
i
i
i
i
i
i
i
15.4. Optimization 703
15.4 Optimization
Once a bottleneck has been located, we want to optimize that stage to boost
the performance. Optimizing techniques for the application, geometry, and
rasterizer stages are presented next. Some of these optimizations trade off
quality for speed, and some are just clever tricks for making the execution
of a certain stage faster.
15.4.1 Application Stage
The application stage is optimized by making the code faster and the mem-
ory accesses of the program faster or fewer. Detailed code optimization is
out of the scope of this book, and optimization techniques usually differ
from one CPU manufacturer to another.
One of the first steps to take is to turn on the optimization flags for the
compiler. There are usually a number of different flags, and you will have
to check which of these apply to your code. Make few assumptions about
what optimization options to use. For example, setting the compiler to
“minimize code size” instead of optimizing for speed” may result in faster
code because memory caching performance is improved. Also, if possible,
try different compilers, as these are optimized in different ways, and some
are markedly superior.
For code optimization, it is crucial to locate the place in the code where
most of the time is spent. A good code profiler is key in finding these code
hot spots, where most time is spent. Optimization efforts are then made
in these places. Such locations in the program are often inner loops, pieces
of the code that are executed many times each frame.
The basic rule of optimization is to try a variety of tactics: Reexamine
algorithms, assumptions, and code syntax, trying as many variants as pos-
sible. CPU architecture and compiler performance often limit the user’s
ability to form an intuition about how to write the fastest code, so question
your assumptions and keep an open mind. For example, Booth [126] shows
a piece of code in which a seemingly useless array access actually feeds the
cache and speeds the overall routine by 25 percent.
A cache is a small fast-memory area that exists because there is usually
much coherence in a program, which the cache can exploit. That is, nearby
locations in memory tend to be accessed one after another (spatial locality),
and code is often accessed sequentially. Also, memory locations tend to be
accessed repeatedly (temporal locality), which the cache also exploits [282].
Processor caches are fast to access, second only to registers for speed. Many
fast algorithms work to access data as locally (and as little) as possible.
Registers and local caches form one end of the memory hierarchy,which
extends next to dynamic random access memory (DRAM), then to storage
i
i
i
i
i
i
i
i
704 15. Pipeline Optimization
on hard disks. At the one end are small amounts of very expensive and
very fast memory, at the other are large amounts of slow and cheap storage.
Between each level of the hierarchy the speed drops by some noticeable
factor. For example, processor registers are usually accessed in one clock
cycle, while L1 cache memory is accessed in a few cycles. Each change
in level has an increase in latency in this way. The one level change to
be avoided if at all possible is accessing the hard disk, which can have
hundreds of thousands of cycles of latency. Sometimes latency can be
hidden by the architecture. Section 18.4.2 describes this technique for the
PLAYSTATION
R
3 system, but it is always a factor that must be taken
into account.
Below we address memory issues and then present tricks and methods
for writing fast code. These considerations apply to most pipelined CPUs
with a memory hierarchy and a cache at the topmost level(s). The rules are
constantly changing, though, so it helps to know your target architecture
well. Also, keep in mind that the hardware will change, so some optimiza-
tions made now may eventually become useless or counterproductive.
Memory Issues
The memory hierarchy can be a major source of slowdown on CPU ar-
chitectures, much more than inefficiency in computation. Years ago the
number of arithmetic instructions was the key measure of an algorithm’s
efficiency; now the key is memory access patterns. Below is a list of pointers
that should be kept in consideration when programming.
Assume nothing when it comes to the system memory routines (or
anything else, for that matter). For example, if you know the direc-
tion of a copy, or that no overlap of source and destination occurs,
an assembly loop using the largest registers available is the quickest
way to do a copy. This is not necessarily what the system’s memory
routines provide.
Data that is accessed sequentially in the code should also be stored se-
quentially in memory. For example, when rendering a triangle mesh,
store texture coordinate #0, normal #0, color #0, vertex #0, texture
coordinate #1, normal #1, etc., sequentially in memory if they are
accessed in that order.
Avoid pointer indirection, jumps, and function calls, as these may
significantly decrease the performance of the cache. You get pointer
indirection when you follow a pointer to another pointer and so on.
This is typical for linked lists and tree structures; use arrays instead,
as possible. McVoy and Staelin [850] show a code example that fol-
lows a linked list through pointers. This causes cache misses for data
i
i
i
i
i
i
i
i
15.4. Optimization 705
both before and after, and their example stalls the CPU more than
100 times longer than it takes to follow the pointer (if the cache
could provide the address of the pointer). Smits [1200] notes how
flattening a pointer-based tree into a list with skip pointers consider-
ably improves hierarchy traversal. Using a van Emde Boas layout is
another way to help avoid cache misses—see Section 14.1.4.
The default memory allocation and deletion functions may be slow
on some systems, so it is often better to allocate a large pool of
memory at start-up for objects of the same size, and then use your
own allocation and free routines for handling the memory of that
pool [72, 550]. Libraries such as Boost provide pool allocation. That
said, for languages with garbage collection, such as C# and Java,
pools can actually reduce performance.
Better yet, try to avoid allocating or freeing memory altogether within
the rendering loop, e.g., allocate scratch space once, and have stacks,
arrays, and other structures only grow (using a variable or flags to
note which elements should be treated as deleted).
On PCs, aligning frequently used data structures to multiples of ex-
actly 32 bytes can noticeably improve overall performance by using
cache lines optimally [1144]. Check the cache line size on your com-
puter; on Pentium IVs, the L1 (level 1) cache line size is 64 bytes,
and the L2 (level 2) cache line size is 128 bytes; so for these, it is
better to align to either 64 or 128 bytes. On an AMD Athlon, the
cache line size is 64 bytes for both L1 and L2. Compiler options
can help, but it helps to design your data structures with alignment,
called padding, in mind. Tools such as VTune and CodeAnalyst for
Windows and Linux and the open-source Valgrind for Linux can help
identify caching bottlenecks.
Try different organizations of data structures. For example, Hecker
[525] shows how a surprisingly large amount of time was saved by
testing a variety of matrix structures for a simple matrix multiplier.
An array of structures,
struct Vertex {float x,y,z;}
Vertex myvertices[1000];
or a structure of arrays,
struct VertexChunk {float x[1000],y[1000],z[1000];}
VertexChunk myvertices;
i
i
i
i
i
i
i
i
706 15. Pipeline Optimization
may work better for a given architecture. This second structure is
better for using SIMD commands, but as the number of vertices goes
up, the chance of a cache miss increases. As the array size increases,
a hybrid scheme,
struct Vertex4 {float x[4],y[4],z[4];}
Vertex4 myvertices[250];
may be the best choice that works well with existing code.
Code Issues
The list that follows gives some techniques for writing fast code that are
particularly relevant to computer graphics. Such methods vary with the
compiler and the evolving CPU, but most of the following have held for
some years.
Single Instruction Multiple Data (SIMD) instruction sets, such as
Intel’s SSE series, AMD’s 3DNow! series, and the AIM Alliance’s Al-
tiVec, could be used in many situations with great performance gains.
Typically, 2–4 elements can be computed in parallel, which is perfect
for vector operations. Alternately, four tests can be done in parallel.
For example, Ericson [315] provides SIMD tests for comparing four
separate pairs of objects simultaneously for overlap.
Using float-to-long conversion is slow on Pentiums, due to compli-
ance with the C ANSI standard. A custom assembly language routine
can save significant time (while sacrificing ANSI compliance) [251,
664]. Intel added SSE instructions for integer truncation and other
operations to improve performance [347].
Avoid using division as much as possible. Such instructions can take
2.5 or more times longer to execute than multiplying [591]. For ex-
ample, take normalizing a three-dimensional vector. Conceptually,
this is done by dividing all elements in the vector by the length of the
vector. Instead of dividing three times, it is faster to compute 1/d,
where d is the length of the vector, and then multiply all elements by
this factor.
Mathfunctionssuchassin,cos,tan,exp,arcsin,etc.,areexpensive
and should be used with care. If lower precision is acceptable, de-
velop or use approximate routines (possibly using SIMD instructions)
where only the first few terms in the MacLaurin or Taylor series (see
Equation B.2) are used. Since memory accesses can be expensive on
modern CPUs, this is strongly preferred over using lookup tables.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset