i
i
i
i
i
i
i
i
Chapter 3
The Graphics
Processing Unit
“The display is the computer.”
—Jen-Hsun Huang
Historically, hardware graphics acceleration has started at the end of the
pipeline, first performing rasterization of a triangle’s scanlines. Successive
generations of hardware have then worked back up the pipeline, to the point
where some higher level application-stage algorithms are being committed
to the hardware accelerator. Dedicated hardware’s only advantage over
software is speed, but speed is critical.
Over the past decade, graphics hardware has undergone an incredible
transformation. The first consumer graphics chip to include hardware ver-
tex processing (NVIDIA’s GeForce256) shipped in 1999. NVIDIA coined
the term graphics processing unit (GPU) to differentiate the GeForce 256
from the previously available rasterization-only chips, and it stuck [898].
Over the next few years, the GPU evolved from configurable implementa-
tions of a complex fixed-function pipeline to highly programmable “blank
slates” where developers could implement their own algorithms. Pro-
grammable shaders of various kinds are the primary means by which the
GPU is controlled. The vertex shader enables various operations (includ-
ing transformations and deformations) to be performed on each vertex.
Similarly, the pixel shader processes individual pixels, allowing complex
shading equations to be evaluated per pixel. The geometry shader allows
the GPU to create and destroy geometric primitives (points, lines, trian-
gles) on the fly. Computed values can be written to multiple high-precision
buffers and reused as vertex or texture data. For efficiency, some parts
of the pipeline remain configurable, not programmable, but the trend is
towards programmability and flexibility [123].
29
i
i
i
i
i
i
i
i
30 3. The Graphics Processing Unit
Clipping
Vertex
Shader
Screen
Mapping
Merger
Triangle
Setup
Triangle
Traversal
Pixel
Shader
Geometry
Shader
Figure 3.1. GPU implementation of the rendering pipeline. The stages are color coded
according to the degree of user control over their operation. Green stages are fully
programmable. Yellow stages are configurable but not programmable, e.g., the clipping
stage can optionally perform culling or add user-defined clipping planes. Blue stages are
completely fixed in their function.
3.1 GPU Pipeline Overview
The GPU implements the geometry and rasterization conceptual pipeline
stages described in Chapter 2. These are divided into several hardware
stages with varying degrees of configurability or programmability. Fig-
ure 3.1 shows the various stages color coded according to how programmable
or configurable they are. Note that these physical stages are split up slightly
differently than the functional stages presented in Chapter 2.
The vertex shader is a fully programmable stage that is typically used
to implement the Model and View Transform,” “Vertex Shading,” and
“Projection” functional stages. The geometry shader is an optional, fully
programmable stage that operates on the vertices of a primitive (point, line
or triangle). It can be used to perform per-primitive shading operations,
to destroy primitives, or to create new ones. The clipping, screen mapping,
triangle setup, and triangle traversal stages are fixed-function stages that
implement the functional stages of the same names. Like the vertex and
geometry shaders, the pixel shader is fully programmable and performs the
“Pixel Shading” function stage. Finally, the merger stage is somewhere be-
tween the full programmability of the shader stages and the fixed operation
of the other stages. Although it is not programmable, it is highly config-
urable and can be set to perform a wide variety of operations. Of course,
it implements the “Merging” functional stage, in charge of modifying the
color, Z-buffer, blend, stencil, and other related buffers.
Over time, the GPU pipeline has evolved away from hard-coded op-
eration and toward increasing flexibility and control. The introduction
of programmable shader stages was the most important step in this evo-
lution. The next section describes the features common to the various
programmable stages.
3.2 The Programmable Shader Stage
Modern shader stages (i.e., those that support Shader Model 4.0, DirectX
10 and later, on Vista) use a common-shader core. This means that the
i
i
i
i
i
i
i
i
3.2. The Programmable Shader Stage 31
vertex, pixel, and geometry shaders share a programming model. We dif-
ferentiate in this book between the common-shader core, the functional
description seen by the applications programmer, and unified shaders, a
GPU architecture that maps well to this core. See Section 18.4. The
common-shader core is the API; having unified shaders is a GPU feature.
Earlier GPUs had less commonality between vertex and pixel shaders and
did not have geometry shaders. Nonetheless, most of the design elements
for this model are shared by older hardware; for the most part, older ver-
sions’ design elements are either simpler or missing, not radically different.
So, for now we will focus on Shader Model 4.0 and discuss older GPUs’
shader models in later sections.
Describing the entire programming model is well beyond the scope of
this book, and there are many documents, books, and websites that al-
ready do so [261, 338, 647, 1084]. However, a few comments are in order.
Shaders are programmed using C-like shading languages such as HLSL, Cg,
and GLSL. These are compiled to a machine-independent assembly lan-
guage, also called the intermediate language (IL). Previous shader models
allowed programming directly in the assembly language, but as of DirectX
10, programs in this language are visible as debug output only [123]. This
assembly language is converted to the actual machine language in a sep-
arate step, usually in the drivers. This arrangement allows compatibility
across different hardware implementations. This assembly language can
be seen as defining a virtual machine, which is targeted by the shading
language compiler.
This virtual machine is a processor with various types of registers and
data sources, programmed with a set of instructions. Since many graph-
ics operations are done on short vectors (up to length 4), the processor
has 4-way SIMD (single-instruction multiple-data) capabilities. Each regis-
ter contains four independent values. 32-bit single-precision floating-point
scalars and vectors are the basic data types; support for 32-bit integers has
recently been added, as well. Floating-point vectors typically contain data
such as positions (xyzw), normals, matrix rows, colors (rgba), or texture
coordinates (uvwq). Integers are most often used to represent counters,
indices, or bit masks. Aggregate data types such as structures, arrays,
and matrices are also supported. To facilitate working with vectors, swiz-
zling, the replication of any vector component, is also supported. That
is, a vector’s elements can be reordered or duplicated as desired. Simi-
larly, masking, where only the specified vector elements are used, is also
supported.
A draw call invokes the graphics API to draw a group of primitives,
so causing the graphics pipeline to execute. Each programmable shader
stage has two types of inputs: uniform inputs, with values that remain
constant throughout a draw call (but can be changed between draw calls),
i
i
i
i
i
i
i
i
32 3. The Graphics Processing Unit
Shader
Virtual Machine
Varying Input
Registers
16 / 16 / 32 registers
Output
Registers
16 / 32 / 8 registers
Constant
Registers
Temporary
Registers
4096 registers
16 buffers of
4096 registers
Textures
128 arrays of
512 textures
Figure 3.2. Common-shader core virtual machine architecture and register layout, under
DirectX 10. The maximum available number is indicated next to each resource. Three
numbers separated by slashes refer to the limits for vertex, geometry, and pixel shaders
(from left to right).
and varying inputs, which are different for each vertex or pixel processed
by the shader. A texture is a special kind of uniform input that once was
always a color image applied to a surface, but that now can be thought of
as any large array of data. It is important to note that although shaders
have a wide variety of inputs, which they can address in different ways,
the outputs are extremelyconstrained. Thisisthemostsignicantway
in which shaders are different from programs executing on general-purpose
processors. The underlying virtual machine provides special registers for
the different types of inputs and outputs. Uniform inputs are accessed
via read-only constant registers or constant buffers, so called because their
contents are constant across a draw call. The number of available constant
registers is much larger than the number of registers available for varying
inputs or outputs. This is because the varying inputs and outputs need to
be stored separately for each vertex or pixel, and the uniform inputs are
stored once and reused across all the vertices or pixels in the draw call.
The virtual machine also has general-purpose temporary registers,which
are used for scratch space. All types of registers can be array-indexed
using integer values in temporary registers. The inputs and outputs of the
shader virtual machine can be seen in Figure 3.2.
Operations that are common in graphics computations are efficiently
executed on modern GPUs. Typically, the fastest operations are scalar and
vector multiplications, additions, and their combinations, such as multiply-
add and dot-product. Other operations, such as reciprocal, square root,
i
i
i
i
i
i
i
i
3.3. The Evolution of Programmable Shading 33
sine, cosine, exponentiation, and logarithm, tend to be slightly more costly
but still fairly speedy. Texturing operations (see Chapter 6) are efficient,
but their performance may be limited by factors such as the time spent
waiting to retrieve the result of an access. Shading languages expose the
most common of these operations (such as additions and multiplications)
via operators such as * and +. The rest are exposed through intrinsic
functions, e.g., atan(), dot(), log(), and many others. Intrinsic functions
also exist for more complex operations, such as vector normalization and
reflection, cross products, matrix transpose and determinant, etc.
The term flow control refers to the use of branching instructions to
change the flow of code execution. These instructions are used to implement
high-level language constructs such as “if and “case” statements, as well
as various types of loops. Shaders support two types of flow control. Static
flow control branches are based on the values of uniform inputs. This means
that the flow of the code is constant over the draw call. The primary benefit
of static flow control is to allow the same shader to be used in a variety of
different situations (e.g., varying numbers of lights). Dynamic flow control
is based on the values of varying inputs. This is much more powerful than
static flow control but is more costly, especially if the code flow changes
erratically between shader invocations. As discussed in Section 18.4.2, a
shader is evaluated on a number of vertices or pixels at a time. If the flow
selects the “if” branch for some elements and the “else” branch for others,
both branches must be evaluated for all elements (and the unused branch
for each element is discarded).
Shader programs can be compiled offline before program load or during
run time. As with any compiler, there are options for generating different
output files and for using different optimization levels. A compiled shader
is stored as a string of text, which is passed to the GPU via the driver.
3.3 The Evolution of Programmable Shading
The idea of a framework for programmable shading dates back to 1984 with
Cook’s shade trees [194]. A simple shader and its corresponding shade tree
are shown in Figure 3.3. The RenderMan Shading Language [30, 1283] was
developed from this idea in the late 80’s and is still widely used today for
film production rendering. Before GPUs supported programmable shaders
natively, there were several attempts to implement programmable shading
operations in real time via multiple rendering passes. The Quake III: Arena
scripting language was the first widespread commercial success in this area
in 1999 [558, 604]. In 2000, Peercy et al. [993] described a system that trans-
lated RenderMan shaders to run in multiple passes on graphics hardware.
They found that GPUs lacked two features that would make this approach
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset