15. Pipeline Optimization (2/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

702 15. Pipeline Optimization

is in place. A smaller screen is likely to also simplify the models displayed,

lessening the load on the geometry stage.

Another approach is the same as that taken with vertex shader pro-

grams, to add more instructions to see the eﬀect on execution speed. Again,

it is important to determine that these additional instructions were not op-

timized away by the compiler.

15.3 Performance Measurements

Before delving into diﬀerent ways of optimizing the performance of the

graphics pipeline, a brief presentation of performance measurements will

be given. One way to express the performance of the geometry stage is

in terms of vertices per second. As in the geometry stage, one way to

express the ﬁll rate of the rasterizer stage is in terms of pixels per second

(or sometimes texels per second ).

Note that graphics hardware manufacturers often present peak rates,

which are at best hard to reach. Also, since we are dealing with a pipelined

system, true performance is not as simple as listing these kinds of numbers.

This is because the location of the bottleneck may move from one time to

another, and the diﬀerent pipeline stages interact in diﬀerent ways during

execution. It is educational to ﬁnd out what the peak rates represent, and

try to duplicate them: Usually you will ﬁnd the bus getting in the way,

but the exercise is guaranteed to generate insights into the architecture like

nothing else can.

When it comes to measuring performance for CPUs, the trend has been

to avoid IPS (instructions per second), FLOPS (ﬂoating point operations

per second), gigahertz, and simple short benchmarks. Instead, the pre-

ferred method is to use clock cycle counters [1], or to measure wall clock

times for a range of diﬀerent, real programs [541], and then compare the

running times for these. Following this trend, most independent bench-

marks instead measure the actual frame rate in fps for a number of given

scenes, and for a number of diﬀerent screen resolutions, antialiasing set-

tings, etc. Many graphics-heavy games include a benchmarking mode or

have one created by a third party, and these benchmarks are commonly

used in comparing GPUs.

To be able to see the potential eﬀects of pipeline optimization, it is

important to measure the total rendering time per frame with double

buﬀering disabled, i.e., in single-buﬀer mode. This is because with double

buﬀering turned on, swapping of the buﬀers occurs only in synchroniza-

tion with the frequency of the monitor, as explained in the example in

Section 2.1. Other benchmarking tips and numerous optimizations are

provided in NVIDIA’s [946] and ATI’s guides [1400].

15.4. Optimization 703

15.4 Optimization

Once a bottleneck has been located, we want to optimize that stage to boost

the performance. Optimizing techniques for the application, geometry, and

rasterizer stages are presented next. Some of these optimizations trade oﬀ

quality for speed, and some are just clever tricks for making the execution

of a certain stage faster.

15.4.1 Application Stage

The application stage is optimized by making the code faster and the mem-

ory accesses of the program faster or fewer. Detailed code optimization is

out of the scope of this book, and optimization techniques usually diﬀer

from one CPU manufacturer to another.

One of the ﬁrst steps to take is to turn on the optimization ﬂags for the

compiler. There are usually a number of diﬀerent ﬂags, and you will have

to check which of these apply to your code. Make few assumptions about

what optimization options to use. For example, setting the compiler to

“minimize code size” instead of “optimizing for speed” may result in faster

code because memory caching performance is improved. Also, if possible,

try diﬀerent compilers, as these are optimized in diﬀerent ways, and some

are markedly superior.

For code optimization, it is crucial to locate the place in the code where

most of the time is spent. A good code proﬁler is key in ﬁnding these code

hot spots, where most time is spent. Optimization eﬀorts are then made

in these places. Such locations in the program are often inner loops, pieces

of the code that are executed many times each frame.

The basic rule of optimization is to try a variety of tactics: Reexamine

algorithms, assumptions, and code syntax, trying as many variants as pos-

sible. CPU architecture and compiler performance often limit the user’s

ability to form an intuition about how to write the fastest code, so question

your assumptions and keep an open mind. For example, Booth [126] shows

a piece of code in which a seemingly useless array access actually feeds the

cache and speeds the overall routine by 25 percent.

A cache is a small fast-memory area that exists because there is usually

much coherence in a program, which the cache can exploit. That is, nearby

locations in memory tend to be accessed one after another (spatial locality),

and code is often accessed sequentially. Also, memory locations tend to be

accessed repeatedly (temporal locality), which the cache also exploits [282].

Processor caches are fast to access, second only to registers for speed. Many

fast algorithms work to access data as locally (and as little) as possible.

Registers and local caches form one end of the memory hierarchy,which

extends next to dynamic random access memory (DRAM), then to storage

704 15. Pipeline Optimization

on hard disks. At the one end are small amounts of very expensive and

very fast memory, at the other are large amounts of slow and cheap storage.

Between each level of the hierarchy the speed drops by some noticeable

factor. For example, processor registers are usually accessed in one clock

cycle, while L1 cache memory is accessed in a few cycles. Each change

in level has an increase in latency in this way. The one level change to

be avoided if at all possible is accessing the hard disk, which can have

hundreds of thousands of cycles of latency. Sometimes latency can be

hidden by the architecture. Section 18.4.2 describes this technique for the

PLAYSTATION



3 system, but it is always a factor that must be taken

into account.

Below we address memory issues and then present tricks and methods

for writing fast code. These considerations apply to most pipelined CPUs

with a memory hierarchy and a cache at the topmost level(s). The rules are

constantly changing, though, so it helps to know your target architecture

well. Also, keep in mind that the hardware will change, so some optimiza-

tions made now may eventually become useless or counterproductive.

Memory Issues

The memory hierarchy can be a major source of slowdown on CPU ar-

chitectures, much more than ineﬃciency in computation. Years ago the

number of arithmetic instructions was the key measure of an algorithm’s

eﬃciency; now the key is memory access patterns. Below is a list of pointers

that should be kept in consideration when programming.

• Assume nothing when it comes to the system memory routines (or

anything else, for that matter). For example, if you know the direc-

tion of a copy, or that no overlap of source and destination occurs,

an assembly loop using the largest registers available is the quickest

way to do a copy. This is not necessarily what the system’s memory

routines provide.

• Data that is accessed sequentially in the code should also be stored se-

quentially in memory. For example, when rendering a triangle mesh,

store texture coordinate #0, normal #0, color #0, vertex #0, texture

coordinate #1, normal #1, etc., sequentially in memory if they are

accessed in that order.

• Avoid pointer indirection, jumps, and function calls, as these may

signiﬁcantly decrease the performance of the cache. You get pointer

indirection when you follow a pointer to another pointer and so on.

This is typical for linked lists and tree structures; use arrays instead,

as possible. McVoy and Staelin [850] show a code example that fol-

lows a linked list through pointers. This causes cache misses for data

15.4. Optimization 705

both before and after, and their example stalls the CPU more than

100 times longer than it takes to follow the pointer (if the cache

could provide the address of the pointer). Smits [1200] notes how

ﬂattening a pointer-based tree into a list with skip pointers consider-

ably improves hierarchy traversal. Using a van Emde Boas layout is

another way to help avoid cache misses—see Section 14.1.4.

• The default memory allocation and deletion functions may be slow

on some systems, so it is often better to allocate a large pool of

memory at start-up for objects of the same size, and then use your

own allocation and free routines for handling the memory of that

pool [72, 550]. Libraries such as Boost provide pool allocation. That

said, for languages with garbage collection, such as C# and Java,

pools can actually reduce performance.

• Better yet, try to avoid allocating or freeing memory altogether within

the rendering loop, e.g., allocate scratch space once, and have stacks,

arrays, and other structures only grow (using a variable or ﬂags to

note which elements should be treated as deleted).

• On PCs, aligning frequently used data structures to multiples of ex-

actly 32 bytes can noticeably improve overall performance by using

cache lines optimally [1144]. Check the cache line size on your com-

puter; on Pentium IVs, the L1 (level 1) cache line size is 64 bytes,

and the L2 (level 2) cache line size is 128 bytes; so for these, it is

better to align to either 64 or 128 bytes. On an AMD Athlon, the

cache line size is 64 bytes for both L1 and L2. Compiler options

can help, but it helps to design your data structures with alignment,

called padding, in mind. Tools such as VTune and CodeAnalyst for

Windows and Linux and the open-source Valgrind for Linux can help

identify caching bottlenecks.

• Try diﬀerent organizations of data structures. For example, Hecker

[525] shows how a surprisingly large amount of time was saved by

testing a variety of matrix structures for a simple matrix multiplier.

An array of structures,

struct Vertex {float x,y,z;}

Vertex myvertices[1000];

or a structure of arrays,

struct VertexChunk {float x[1000],y[1000],z[1000];}

VertexChunk myvertices;

706 15. Pipeline Optimization

may work better for a given architecture. This second structure is

better for using SIMD commands, but as the number of vertices goes

up, the chance of a cache miss increases. As the array size increases,

a hybrid scheme,

struct Vertex4 {float x[4],y[4],z[4];}

Vertex4 myvertices[250];

may be the best choice that works well with existing code.

Code Issues

The list that follows gives some techniques for writing fast code that are

particularly relevant to computer graphics. Such methods vary with the

compiler and the evolving CPU, but most of the following have held for

some years.

• Single Instruction Multiple Data (SIMD) instruction sets, such as

Intel’s SSE series, AMD’s 3DNow! series, and the AIM Alliance’s Al-

tiVec, could be used in many situations with great performance gains.

Typically, 2–4 elements can be computed in parallel, which is perfect

for vector operations. Alternately, four tests can be done in parallel.

For example, Ericson [315] provides SIMD tests for comparing four

separate pairs of objects simultaneously for overlap.

• Using float-to-long conversion is slow on Pentiums, due to compli-

ance with the C ANSI standard. A custom assembly language routine

can save signiﬁcant time (while sacriﬁcing ANSI compliance) [251,

664]. Intel added SSE instructions for integer truncation and other

operations to improve performance [347].

• Avoid using division as much as possible. Such instructions can take

2.5 or more times longer to execute than multiplying [591]. For ex-

ample, take normalizing a three-dimensional vector. Conceptually,

this is done by dividing all elements in the vector by the length of the

vector. Instead of dividing three times, it is faster to compute 1/d,

where d is the length of the vector, and then multiply all elements by

this factor.

• Mathfunctionssuchassin,cos,tan,exp,arcsin,etc.,areexpensive

and should be used with care. If lower precision is acceptable, de-

velop or use approximate routines (possibly using SIMD instructions)

where only the ﬁrst few terms in the MacLaurin or Taylor series (see

Equation B.2) are used. Since memory accesses can be expensive on

modern CPUs, this is strongly preferred over using lookup tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Pipeline Optimization (2/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Pipeline Optimization (2/6)