i
i
i
i
i
i
i
i
15.4. Optimization 707
Conditional branches are often expensive, though most processors
have branch prediction, which means as long as the branches can be
consistently predicted, the cost can be low. However, a mispredicted
branch is often very expensive on some architectures, especially those
with deep pipelines.
Unroll small loops in order to get rid of the loop overhead. However,
this makes the code larger and thus may degrade cache performance.
Also, branch prediction usually works well on loops. Sometimes the
compiler can do the loop unrolling for you.
Use inline code for small functions that are called frequently.
Lessen floating-point precision when reasonable. For example, on
an Intel Pentium, floating-point division normally takes 39 cycles at
80 bits of precision, but only 19 cycles at 32 bits (however, at any
precision, division by a power of 2 takes around 8 cycles) [126]. When
choosing float instead of double, remember to attach an f at the
end of constants. Otherwise, they, and whole expressions, may be cast
to double.Sofloat x=2.42f; may be faster than float x=2.42;.
Lower precision is also better because less data is then sent down the
graphics pipeline. Graphics computations are often fairly forgiving.
If a normal vector is stored as three 32-bit floats, it has enough ac-
curacy to point from Earth to a rock on Mars with sub-centimeter
precision [1207]. This level of precision is usually a bit more than is
needed.
Virtual methods, dynamic casting, (inherited) constructors, and pass-
ing structs by value have some efficiency penalties. In one case re-
ported to us, 40% of the time spent in a frame was used on the vir-
tual inheritance hierarchy that managed models. Blinn [109] presents
techniques for avoiding overhead for evaluating vector expressions in
C++.
15.4.2 API Calls
Throughout this book we have given advice based on general trends in
hardware. For example, indexed vertex buffers (in OpenGL, vertex buffer
objects) are usually the fastest way to provide the accelerator with geomet-
ric data (see Section 12.4.5). This section deals with some of the features
of the API and how to use them to best effect.
One problem touched upon in Section 12.4.2 that we would like to revisit
here is the small batch problem. This is by far the most significant factor
i
i
i
i
i
i
i
i
708 15. Pipeline Optimization
affecting performance in modern APIs. DirectX 10 had specific changes in
its design to combat this bottleneck, improving performance by a factor of
2× [1066], but its effect is still significant. Simply put, a few large meshes
are much more efficient to render than many small ones. This is because
there is a fixed-cost overhead associated with each API call, a cost paid for
processing a primitive, regardless of size. For example, Wloka [1365] shows
that drawing two (relatively small) triangles per batch was a factor of 375×
away from the maximum throughput for the GPU tested. Instead of 150
million triangles per second, the rate was 0.4 million, for a 2.7 GHz CPU.
This rate dropped to 0.1 million, 1500× slower, for a 1.0 GHz CPU. For a
scene rendered consisting of many small and simple objects, each with only
a few triangles, performance is entirely CPU-bound by the API; the GPU
has no ability to increase it. That is, the processing time on the CPU for
the draw call is greater than the amount of time the GPU takes to actually
draw the mesh, so the GPU is starved.
Figure 15.2. Batching performance benchmarks for an Intel Core 2 Duo 2.66 GHz CPU
using an NVIDIA G80 GPU, running DirectX 10. Batches of varying size were run and
timed under different conditions. The “Low” conditions are for triangles with just the
position and a constant-color pixel shader; the other set of tests is for reasonable meshes
and shading. “Single” is rendering a single batch many times. “Instancing” reuses
the mesh data and puts the per-instance data in a separate stream. “Constants” is a
DirectX 10 method where instance data is put in constant memory. As can be seen, small
batches hurt all methods, but instancing gives proportionally much faster performance.
At a few hundred polygons, performance levels out, as the bottleneck becomes how fast
vertices are retrieved from the vertex buffer and caches. (Graph courtesy of NVIDIA
Corporation.)
i
i
i
i
i
i
i
i
15.4. Optimization 709
Back in 2003, the breakpoint where the API was the bottleneck was
about 130 triangles per object. The breakpoint for NVIDIA’s GeForce 6
and 7 series with typical CPUs is about 200 triangles per draw call [75].
In other words, when batches have 200 triangles or less (and the triangles
are not large or shaded in a complex fashion), the bottleneck is the API on
the CPU side; an infinitely fast GPU would not change this performance,
since the GPU is not the bottleneck under these conditions. A slower GPU
would of course make this breakpoint lower. See Figure 15.2 for more recent
tests.
Wloka’s rule of thumb, borne out by others, is that “You get X batches
per frame. and that X depends on the CPU speed. This is a maximum
number of batches given the performance of just the CPU; batches with a
large number of polygons or expensive rasterization, making the GPU the
bottleneck, lower this number. This idea is encapsulated in his formula:
X =
BCU
F
, (15.1)
where B is the number of batches per second for a 1 GHz CPU, C is GHz
rating of the current CPU, U is the amount of the CPU that is dedicated
to calling the object API, F is the target frame rate in fps, and X is
the computed number of batches callable per frame. Wloka gives B as a
constant of 25,000 batches per second for a 1 GHz CPU at 100% usage.
This formula is approximate, and some API and driver improvements can
be done to lower the impact of the CPU on the pipeline (which may increase
B). However, with GPUs increasing in speed faster than CPUs (about
3.0 3.7× for GPUs versus 2.2× forCPUsoveraneighteenmonthspan),
the trend is toward each batch containing more triangles, not less. That
said, a few thousand polygons per mesh is enough to avoid the API being
the bottleneck and keep the GPU busy.
E
XAMPLE:BATCHES PER FRAME. For a 3.7 GHz CPU, a budget of 40% of
the CPU’s time spent purely on object API calls, and a 60 fps frame rate,
the formula evaluates to
X =
25000 batches/GHz × 3.7GHz× 0.40 usage
60 fps = 616 batches/frame
.
For this configuration, usage budget, and goals, the CPU limits the appli-
cation to roughly 600 batches that can be sent to the GPU each frame.
There are a number of ways of ameliorating the small batch problem,
and they all have the same goal: fewer API calls. The basic idea of batching
is to combine a number of objects into a single object, so needing only one
API call to render the set.
i
i
i
i
i
i
i
i
710 15. Pipeline Optimization
Figure 15.3. Vegetation instancing. All objects the same color in the lower image are
rendered in a single draw call [1340]. (Image from CryEngine1 courtesy of Crytek.)
i
i
i
i
i
i
i
i
15.4. Optimization 711
Combining can be done one time and the buffer reused each frame for
sets of objects that are static. For dynamic objects, a single buffer can be
filled with a number of meshes. The limitation of this basic approach is
that all objects in a mesh need to use the same set of shader programs, i.e.,
the same material. However, it is possible to merge objects with different
colors, for example, by tagging each object’s vertices with an identifier.
This identifier is used by a shader program to look up what color is used
to shade the object. This same idea can be extended to other surface
attributes. Similarly, textures attached to surfaces can also hold identifiers
as to which material to use. Light maps of separate objects need to be
combined into texture atlases or arrays [961].
However, such practices can be taken too far. Adding branches and
different shading models to a single pixel shader program can be costly.
Sets of fragments are processed in parallel. If all fragments do not take
the same branch, then both branches must be evaluated for all fragments.
Care has to be taken to avoid making pixel shader programs that use an
excessive number of registers. The number of registers used influences the
number of fragments that a pixel shader can handle at the same time in
parallel. See Section 18.4.2.
The other approach to minimize API calls is to use some form of in-
stancing. Most APIs support the idea of having an object and drawing it
a number of times in a single call. So instead of making a separate API
call for each tree in a forest, you make one call that renders many copies
of the tree model. This is typically done by specifying a base model and
providing a separate data structure that holds information about each spe-
cific instance desired. Beyond position and orientation, other attributes
could be specified per instance, such as leaf colors or curvature due to the
wind, or anything else that could be used by shader programs to affect the
model. Lush jungle scenes can be created by liberal use of instancing. See
Figure 15.3. Crowd scenes are a good fit for instancing, with each character
appearing unique by having different body parts from a set of choices. Fur-
ther variation can be added by random coloring and decals [430]. Instancing
can also be combined with level of detail techniques [158, 279, 810, 811].
See Figure 15.4 for an example.
In theory, the geometry shader could be used for instancing, as it can
create duplicate data of an incoming mesh. In practice, this method is often
slower than using instancing API commands. The intent of the geometry
shader is to perform local, small scale amplification of data [1311].
Another way for the application to improve performance is to minimize
state changes by grouping objects with a similar rendering state (vertex and
pixel shader, texture, material, lighting, transparency, etc.) and rendering
them sequentially. When changing the state, there is sometimes a need
to wholly or partially flush the pipeline. For this reason, changing shader
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset