15. Pipeline Optimization (3/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15.4. Optimization 707

• Conditional branches are often expensive, though most processors

have branch prediction, which means as long as the branches can be

consistently predicted, the cost can be low. However, a mispredicted

branch is often very expensive on some architectures, especially those

with deep pipelines.

• Unroll small loops in order to get rid of the loop overhead. However,

this makes the code larger and thus may degrade cache performance.

Also, branch prediction usually works well on loops. Sometimes the

compiler can do the loop unrolling for you.

• Use inline code for small functions that are called frequently.

• Lessen ﬂoating-point precision when reasonable. For example, on

an Intel Pentium, ﬂoating-point division normally takes 39 cycles at

80 bits of precision, but only 19 cycles at 32 bits (however, at any

precision, division by a power of 2 takes around 8 cycles) [126]. When

choosing float instead of double, remember to attach an f at the

end of constants. Otherwise, they, and whole expressions, may be cast

to double.Sofloat x=2.42f; may be faster than float x=2.42;.

• Lower precision is also better because less data is then sent down the

graphics pipeline. Graphics computations are often fairly forgiving.

If a normal vector is stored as three 32-bit ﬂoats, it has enough ac-

curacy to point from Earth to a rock on Mars with sub-centimeter

precision [1207]. This level of precision is usually a bit more than is

needed.

• Virtual methods, dynamic casting, (inherited) constructors, and pass-

ing structs by value have some eﬃciency penalties. In one case re-

ported to us, 40% of the time spent in a frame was used on the vir-

tual inheritance hierarchy that managed models. Blinn [109] presents

techniques for avoiding overhead for evaluating vector expressions in

C++.

15.4.2 API Calls

Throughout this book we have given advice based on general trends in

hardware. For example, indexed vertex buﬀers (in OpenGL, vertex buﬀer

objects) are usually the fastest way to provide the accelerator with geomet-

ric data (see Section 12.4.5). This section deals with some of the features

of the API and how to use them to best eﬀect.

One problem touched upon in Section 12.4.2 that we would like to revisit

here is the small batch problem. This is by far the most signiﬁcant factor

708 15. Pipeline Optimization

aﬀecting performance in modern APIs. DirectX 10 had speciﬁc changes in

its design to combat this bottleneck, improving performance by a factor of

2× [1066], but its eﬀect is still signiﬁcant. Simply put, a few large meshes

are much more eﬃcient to render than many small ones. This is because

there is a ﬁxed-cost overhead associated with each API call, a cost paid for

processing a primitive, regardless of size. For example, Wloka [1365] shows

that drawing two (relatively small) triangles per batch was a factor of 375×

away from the maximum throughput for the GPU tested. Instead of 150

million triangles per second, the rate was 0.4 million, for a 2.7 GHz CPU.

This rate dropped to 0.1 million, 1500× slower, for a 1.0 GHz CPU. For a

scene rendered consisting of many small and simple objects, each with only

a few triangles, performance is entirely CPU-bound by the API; the GPU

has no ability to increase it. That is, the processing time on the CPU for

the draw call is greater than the amount of time the GPU takes to actually

draw the mesh, so the GPU is starved.

Figure 15.2. Batching performance benchmarks for an Intel Core 2 Duo 2.66 GHz CPU

using an NVIDIA G80 GPU, running DirectX 10. Batches of varying size were run and

timed under diﬀerent conditions. The “Low” conditions are for triangles with just the

position and a constant-color pixel shader; the other set of tests is for reasonable meshes

and shading. “Single” is rendering a single batch many times. “Instancing” reuses

the mesh data and puts the per-instance data in a separate stream. “Constants” is a

DirectX 10 method where instance data is put in constant memory. As can be seen, small

batches hurt all methods, but instancing gives proportionally much faster performance.

At a few hundred polygons, performance levels out, as the bottleneck becomes how fast

vertices are retrieved from the vertex buﬀer and caches. (Graph courtesy of NVIDIA

Corporation.)

15.4. Optimization 709

Back in 2003, the breakpoint where the API was the bottleneck was

about 130 triangles per object. The breakpoint for NVIDIA’s GeForce 6

and 7 series with typical CPUs is about 200 triangles per draw call [75].

In other words, when batches have 200 triangles or less (and the triangles

are not large or shaded in a complex fashion), the bottleneck is the API on

the CPU side; an inﬁnitely fast GPU would not change this performance,

since the GPU is not the bottleneck under these conditions. A slower GPU

would of course make this breakpoint lower. See Figure 15.2 for more recent

tests.

Wloka’s rule of thumb, borne out by others, is that “You get X batches

per frame.” and that X depends on the CPU speed. This is a maximum

number of batches given the performance of just the CPU; batches with a

large number of polygons or expensive rasterization, making the GPU the

bottleneck, lower this number. This idea is encapsulated in his formula:

X =

BCU

, (15.1)

where B is the number of batches per second for a 1 GHz CPU, C is GHz

rating of the current CPU, U is the amount of the CPU that is dedicated

to calling the object API, F is the target frame rate in fps, and X is

the computed number of batches callable per frame. Wloka gives B as a

constant of 25,000 batches per second for a 1 GHz CPU at 100% usage.

This formula is approximate, and some API and driver improvements can

be done to lower the impact of the CPU on the pipeline (which may increase

B). However, with GPUs increasing in speed faster than CPUs (about

3.0 − 3.7× for GPUs versus 2.2× forCPUsoveraneighteenmonthspan),

the trend is toward each batch containing more triangles, not less. That

said, a few thousand polygons per mesh is enough to avoid the API being

the bottleneck and keep the GPU busy.

XAMPLE:BATCHES PER FRAME. For a 3.7 GHz CPU, a budget of 40% of

the CPU’s time spent purely on object API calls, and a 60 fps frame rate,

the formula evaluates to

X =

25000 batches/GHz × 3.7GHz× 0.40 usage

60 fps = 616 batches/frame

For this conﬁguration, usage budget, and goals, the CPU limits the appli-

cation to roughly 600 batches that can be sent to the GPU each frame.



There are a number of ways of ameliorating the small batch problem,

and they all have the same goal: fewer API calls. The basic idea of batching

is to combine a number of objects into a single object, so needing only one

API call to render the set.

710 15. Pipeline Optimization

Figure 15.3. Vegetation instancing. All objects the same color in the lower image are

rendered in a single draw call [1340]. (Image from CryEngine1 courtesy of Crytek.)

15.4. Optimization 711

Combining can be done one time and the buﬀer reused each frame for

sets of objects that are static. For dynamic objects, a single buﬀer can be

ﬁlled with a number of meshes. The limitation of this basic approach is

that all objects in a mesh need to use the same set of shader programs, i.e.,

the same material. However, it is possible to merge objects with diﬀerent

colors, for example, by tagging each object’s vertices with an identiﬁer.

This identiﬁer is used by a shader program to look up what color is used

to shade the object. This same idea can be extended to other surface

attributes. Similarly, textures attached to surfaces can also hold identiﬁers

as to which material to use. Light maps of separate objects need to be

combined into texture atlases or arrays [961].

However, such practices can be taken too far. Adding branches and

diﬀerent shading models to a single pixel shader program can be costly.

Sets of fragments are processed in parallel. If all fragments do not take

the same branch, then both branches must be evaluated for all fragments.

Care has to be taken to avoid making pixel shader programs that use an

excessive number of registers. The number of registers used inﬂuences the

number of fragments that a pixel shader can handle at the same time in

parallel. See Section 18.4.2.

The other approach to minimize API calls is to use some form of in-

stancing. Most APIs support the idea of having an object and drawing it

a number of times in a single call. So instead of making a separate API

call for each tree in a forest, you make one call that renders many copies

of the tree model. This is typically done by specifying a base model and

providing a separate data structure that holds information about each spe-

ciﬁc instance desired. Beyond position and orientation, other attributes

could be speciﬁed per instance, such as leaf colors or curvature due to the

wind, or anything else that could be used by shader programs to aﬀect the

model. Lush jungle scenes can be created by liberal use of instancing. See

Figure 15.3. Crowd scenes are a good ﬁt for instancing, with each character

appearing unique by having diﬀerent body parts from a set of choices. Fur-

ther variation can be added by random coloring and decals [430]. Instancing

can also be combined with level of detail techniques [158, 279, 810, 811].

See Figure 15.4 for an example.

In theory, the geometry shader could be used for instancing, as it can

create duplicate data of an incoming mesh. In practice, this method is often

slower than using instancing API commands. The intent of the geometry

shader is to perform local, small scale ampliﬁcation of data [1311].

Another way for the application to improve performance is to minimize

state changes by grouping objects with a similar rendering state (vertex and

pixel shader, texture, material, lighting, transparency, etc.) and rendering

them sequentially. When changing the state, there is sometimes a need

to wholly or partially ﬂush the pipeline. For this reason, changing shader

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Pipeline Optimization (3/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Pipeline Optimization (3/6)