i
i
i
i
i
i
i
i
700 15. Pipeline Optimization
fect example of the importance of understanding the underlying hard-
ware comes with the advent of the unified shader architecture.TheXbox
360 (see Section 18.4.1) uses this architecture, and it forms the basis of
high-end cards from the end of 2006 on. The idea is that vertex, pixel,
and geometry shaders all use the same functional units. The GPU takes
care of load balancing, changing the proportion of units assigned to ver-
tex versus pixel shading. As an example, if a large quadrilateral is ren-
dered, only a few shader units could be assigned to vertex transformation,
while the bulk are given the task of fragment processing. For this ar-
chitecture, pinpointing whether the bottleneck is in the vertex or pixel
shader stage is moot [1400]. Either this combined set of stages or another
stage will still be the bottleneck, however, so we discuss each possibility
in turn.
15.2.1 Testing the Application Stage
If the platform being used is supplied with a utility for measuring the
workload on the processor(s), that utility can be used to see if your program
uses 100 percent (or near that) of the CPU processing power. For Windows
there is the Task Manager, and for Macs, there is the Activity Monitor.For
Unix systems, there are usually programs called top or osview that show
the process workload on the CPU(s). If the CPU is in constant use, your
program is CPU-limited. This is not always foolproof, since you may be
waiting for the hardware to complete a frame, and this wait operation is
sometimes implemented as a busy-wait. Using a code profiler to determine
where the time is spent is better. AMD has a tool called CodeAnalyst for
analyzing and optimizing the code run on their line of CPUs, and Intel has
a similar tool called VTune that can analyze where the time is spent in the
application or in a driver. There are other places where time can go, e.g.,
the D3D runtime sits between the application and the driver. This element
converts API calls to device-independent commands, which are stored in a
command buffer. The runtime sends the driver these commands in batches,
for efficiency.
A smarter way to test for CPU limitsistosenddowndatathatcauses
the other stages to do little or no work. For some systems this can be
accomplished by simply using a null driver (a driver that accepts calls but
does nothing) instead of a real driver [946]. This effectively sets an upper
limit on how fast you can get the entire program to run, because you do not
use the graphics hardware, and thus, the CPU is always the bottleneck. By
doing this test, you get an idea on how much room for improvement there
is for the stages not run in the application stage. That said, be aware that
using a null driver can also hide any bottleneck due to driver processing
itself and communication between stages.