15. Pipeline Optimization (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 15

Pipeline Optimization

“We should forget about small eﬃciencies, say about 97% of the

time: Premature optimization is the root of all evil.”

—Donald Knuth

As we saw in Chapter 2, the process of rendering an image is based on a

pipelined architecture with three conceptual stages: application, geometry,

and rasterizer. At any given moment, one of these stages, or the commu-

nication path between them, will always be the bottleneck—the slowest

stage in the pipeline. This implies that the bottleneck stage sets the limit

for the throughput, i.e., the total rendering performance, and so is a prime

candidate for optimization.

Optimizing the performance of the rendering pipeline resembles the

process of optimizing a pipelined processor (CPU) [541] in that it consists

mainly of two steps. First, the bottleneck of the pipeline is located. Second,

that stage is optimized in some way; and after that, step one is repeated if

the performance goals have not been met. Note that the bottleneck may

or may not be located at the same place after the optimization step. It is

a good idea to put only enough eﬀort into optimizing the bottleneck stage

so that the bottleneck moves to another stage. Several other stages may

have to be optimized before this stage becomes the bottleneck again. For

this reason, eﬀort should not be wasted on over-optimizing a stage.

The location of the bottleneck may change within a frame. At one mo-

ment the geometry stage may be the bottleneck because many tiny triangles

are rendered. Later in the frame the rasterizer could be the bottleneck be-

cause triangles covering large parts of the screen are rendered. So, when

we talk about, say, the rasterizer stage being the bottleneck, we mean it is

the bottleneck most of the time during that frame.

Another way to capitalize on the pipelined construction is to recognize

that when the slowest stage cannot be optimized further, the other stages

can be made to work just as much as the slowest stage. This will not change

performance, since the speed of the slowest stage will not be altered, but the

697

698 15. Pipeline Optimization

extra processing can be used to improve image quality. For example, say

that the bottleneck is in the application stage, which takes 50 milliseconds

(ms) to produce a frame, while the others each take 25 ms. This means

that without changing the speed of the rendering pipeline (50 ms equals

20 frames per second), the geometry and the rasterizer stages could also

do their work in 50 ms. For example, we could use a more sophisticated

lighting model or increase the level of realism with shadows and reﬂections,

assuming that this does not increase the workload on the application stage.

Pipeline optimization is a process in which we ﬁrst maximize the ren-

dering speed, then allow the stages that are not bottlenecks to consume as

much time as the bottleneck. That said, this idea does not apply for newer

architectures such as the Xbox 360, which automatically load-balance com-

putational resources (more on this in a moment).

This exception is an excellent example of a key principle. When reading

this chapter, the dictum

KNOW YOUR ARCHITECTURE

should always be in the back of your mind, since optimization techniques

vary greatly for diﬀerent architectures. A related dictum is, simply, “Mea-

sure.”

15.1 Proﬁling Tools

There are a number of worthwhile tools available for proﬁling use of the

graphics accelerator and CPU. Such tools are useful both for locating bot-

tlenecks and for optimizing. Examples include PIX for Windows (for Di-

rectX), gDEBugger (for OpenGL), NVIDIA’s NVPerfKit suite of tools,

ATI’s GPU PerfStudio [1401], and Apple’s OpenGL Proﬁler.

As an example, PIX for Windows provides real-time performance eval-

uation by providing counters for a wide variety of data, such as the number

of draw calls, state changes, texture and shader calls, CPU and GPU idle

time, locks on various resources, read and write I/O, the amount of mem-

ory used, etc. This data can be displayed overlaid on the application itself.

Figure 15.1 was rendered with this technique.

PIX can capture all the DirectX calls made within a given frame for

later analysis or playback. Examining this stream can show whether and

where unnecessary API calls are being made. PIX can also be used for

pixel debugging, showing the frame buﬀer history for a single pixel.

While these tools can provide developers with most of the information

they need, sometimes other data is needed that does not ﬁt the mold.

Pelzer [1001] presents a number of useful techniques to display debugging

information.

15.2. Locating the Bottleneck 699

Figure 15.1. PIX run atop a DirectX program, showing HUD information about the

number of draw calls performed. (Image from ATI’s Chimp Demo with Microsoft’s

PIX overlaid.)

15.2 Locating the Bottleneck

The ﬁrst step in optimizing a pipeline is to locate the bottleneck. One

way of ﬁnding bottlenecks is to set up a number of tests, where each test

decreases the amount of work a particular stage performs. If one of these

tests causes the frames per second to increase, the bottleneck stage has

been found. A related way of testing a stage is to reduce the workload on

the other stages without reducing the workload on the stage being tested.

If performance does not change, the bottleneck is the stage where the work-

load was not altered. Performance tools can provide detailed information

on what API calls are expensive, but do not necessarily pinpoint exactly

what stage in the pipeline is slowing down the rest. Even when they do,

such as the “simpliﬁed experiments” provided by NVPerfAPI,itisimpor-

tant to understand the idea behind each test.

What follows is a brief discussion of some of the ideas used to test

the various stages, to give a ﬂavor of how such testing is done. There

are a number of documents available that discuss testing and optimiza-

tions for speciﬁc architectures and features [164, 946, 1065, 1400]. A per-

700 15. Pipeline Optimization

fect example of the importance of understanding the underlying hard-

ware comes with the advent of the uniﬁed shader architecture.TheXbox

360 (see Section 18.4.1) uses this architecture, and it forms the basis of

high-end cards from the end of 2006 on. The idea is that vertex, pixel,

and geometry shaders all use the same functional units. The GPU takes

care of load balancing, changing the proportion of units assigned to ver-

tex versus pixel shading. As an example, if a large quadrilateral is ren-

dered, only a few shader units could be assigned to vertex transformation,

while the bulk are given the task of fragment processing. For this ar-

chitecture, pinpointing whether the bottleneck is in the vertex or pixel

shader stage is moot [1400]. Either this combined set of stages or another

stage will still be the bottleneck, however, so we discuss each possibility

in turn.

15.2.1 Testing the Application Stage

If the platform being used is supplied with a utility for measuring the

workload on the processor(s), that utility can be used to see if your program

uses 100 percent (or near that) of the CPU processing power. For Windows

there is the Task Manager, and for Macs, there is the Activity Monitor.For

Unix systems, there are usually programs called top or osview that show

the process workload on the CPU(s). If the CPU is in constant use, your

program is CPU-limited. This is not always foolproof, since you may be

waiting for the hardware to complete a frame, and this wait operation is

sometimes implemented as a busy-wait. Using a code proﬁler to determine

where the time is spent is better. AMD has a tool called CodeAnalyst for

analyzing and optimizing the code run on their line of CPUs, and Intel has

a similar tool called VTune that can analyze where the time is spent in the

application or in a driver. There are other places where time can go, e.g.,

the D3D runtime sits between the application and the driver. This element

converts API calls to device-independent commands, which are stored in a

command buﬀer. The runtime sends the driver these commands in batches,

for eﬃciency.

A smarter way to test for CPU limitsistosenddowndatathatcauses

the other stages to do little or no work. For some systems this can be

accomplished by simply using a null driver (a driver that accepts calls but

does nothing) instead of a real driver [946]. This eﬀectively sets an upper

limit on how fast you can get the entire program to run, because you do not

use the graphics hardware, and thus, the CPU is always the bottleneck. By

doing this test, you get an idea on how much room for improvement there

is for the stages not run in the application stage. That said, be aware that

using a null driver can also hide any bottleneck due to driver processing

itself and communication between stages.

15.2. Locating the Bottleneck 701

Another more direct method is to underclock the CPU, if possible [164,

946]. If performance drops in direct proportion to the CPU rate, the ap-

plication is CPU-bound. Finally, the process of elimination can be used: If

none of the GPU stages are the bottleneck, the CPU must be. This same

underclocking approach can be done for many GPUs for various stages with

programs such as Coolbits or EnTech’s PowerStrip. These underclocking

methods can help identify a bottleneck, but can sometimes cause a stage

that was not a bottleneck before to become one. The other option is to

overclock, but you did not read that here.

15.2.2 Testing the Geometry Stage

The geometry stage is the most diﬃcult stage to test. This is because

if the workload on this stage is changed, then the workload on one or

both of the other stages is often changed as well. To avoid this problem,

Cebenoyan [164] gives a series of tests working from the rasterizer stages

back up the pipeline.

There are two main areas where a bottleneck can occur in the geometry

stage: vertex fetching and processing. To see if the bottleneck is due to

object data transfer is to increase the size of the vertex format. This can be

done by sending several extra texture coordinates per vertex, for example.

If performance falls, this area is the bottleneck.

Vertex processing is done by the vertex shader or the ﬁxed-function

pipeline’s transform and lighting functionality. For the vertex shader bot-

tleneck, testing consists of making the shader program longer. Some care

has to be taken to make sure the compiler is not optimizing away these

additional instructions. For the ﬁxed-function pipeline, processing load

can be increased by turning on additional functionality such as specular

highlighting, or by changing light sources to more complex forms (e.g.,

spotlights).

15.2.3 Testing the Rasterizer Stage

This stage consists of three separate substages: triangle setup, the pixel

shader program, and raster operations. Triangle setup is almost never the

bottleneck, as it simply joins vertices into triangles [1400]. The simplest

way to test if raster operations are the bottleneck is by reducing the bit

depth of color output from 32 (or 24) bit to 16 bit. The bottleneck is found

if the frame rate increases considerably.

Once raster operations are ruled out, the pixel shader program’s eﬀect

can be tested by changing the screen resolution. If a lower screen resolution

causes the frame rate to rise appreciably, the pixel shader is the bottleneck,

at least some of the time. Care has to be taken if a level-of-detail system

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Pipeline Optimization (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Pipeline Optimization (1/6)