3. The Graphics Processing Unit (1/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

The Graphics

Processing Unit

“The display is the computer.”

—Jen-Hsun Huang

Historically, hardware graphics acceleration has started at the end of the

pipeline, ﬁrst performing rasterization of a triangle’s scanlines. Successive

generations of hardware have then worked back up the pipeline, to the point

where some higher level application-stage algorithms are being committed

to the hardware accelerator. Dedicated hardware’s only advantage over

software is speed, but speed is critical.

Over the past decade, graphics hardware has undergone an incredible

transformation. The ﬁrst consumer graphics chip to include hardware ver-

tex processing (NVIDIA’s GeForce256) shipped in 1999. NVIDIA coined

the term graphics processing unit (GPU) to diﬀerentiate the GeForce 256

from the previously available rasterization-only chips, and it stuck [898].

Over the next few years, the GPU evolved from conﬁgurable implementa-

tions of a complex ﬁxed-function pipeline to highly programmable “blank

slates” where developers could implement their own algorithms. Pro-

grammable shaders of various kinds are the primary means by which the

GPU is controlled. The vertex shader enables various operations (includ-

ing transformations and deformations) to be performed on each vertex.

Similarly, the pixel shader processes individual pixels, allowing complex

shading equations to be evaluated per pixel. The geometry shader allows

the GPU to create and destroy geometric primitives (points, lines, trian-

gles) on the ﬂy. Computed values can be written to multiple high-precision

buﬀers and reused as vertex or texture data. For eﬃciency, some parts

of the pipeline remain conﬁgurable, not programmable, but the trend is

towards programmability and ﬂexibility [123].

30 3. The Graphics Processing Unit

Clipping

Vertex

Shader

Screen

Mapping

Merger

Triangle

Setup

Triangle

Traversal

Pixel

Shader

Geometry

Shader

Figure 3.1. GPU implementation of the rendering pipeline. The stages are color coded

according to the degree of user control over their operation. Green stages are fully

programmable. Yellow stages are conﬁgurable but not programmable, e.g., the clipping

stage can optionally perform culling or add user-deﬁned clipping planes. Blue stages are

completely ﬁxed in their function.

3.1 GPU Pipeline Overview

The GPU implements the geometry and rasterization conceptual pipeline

stages described in Chapter 2. These are divided into several hardware

stages with varying degrees of conﬁgurability or programmability. Fig-

ure 3.1 shows the various stages color coded according to how programmable

or conﬁgurable they are. Note that these physical stages are split up slightly

diﬀerently than the functional stages presented in Chapter 2.

The vertex shader is a fully programmable stage that is typically used

to implement the “Model and View Transform,” “Vertex Shading,” and

“Projection” functional stages. The geometry shader is an optional, fully

programmable stage that operates on the vertices of a primitive (point, line

or triangle). It can be used to perform per-primitive shading operations,

to destroy primitives, or to create new ones. The clipping, screen mapping,

triangle setup, and triangle traversal stages are ﬁxed-function stages that

implement the functional stages of the same names. Like the vertex and

geometry shaders, the pixel shader is fully programmable and performs the

“Pixel Shading” function stage. Finally, the merger stage is somewhere be-

tween the full programmability of the shader stages and the ﬁxed operation

of the other stages. Although it is not programmable, it is highly conﬁg-

urable and can be set to perform a wide variety of operations. Of course,

it implements the “Merging” functional stage, in charge of modifying the

color, Z-buﬀer, blend, stencil, and other related buﬀers.

Over time, the GPU pipeline has evolved away from hard-coded op-

eration and toward increasing ﬂexibility and control. The introduction

of programmable shader stages was the most important step in this evo-

lution. The next section describes the features common to the various

programmable stages.

3.2 The Programmable Shader Stage

Modern shader stages (i.e., those that support Shader Model 4.0, DirectX

10 and later, on Vista) use a common-shader core. This means that the

3.2. The Programmable Shader Stage 31

vertex, pixel, and geometry shaders share a programming model. We dif-

ferentiate in this book between the common-shader core, the functional

description seen by the applications programmer, and uniﬁed shaders, a

GPU architecture that maps well to this core. See Section 18.4. The

common-shader core is the API; having uniﬁed shaders is a GPU feature.

Earlier GPUs had less commonality between vertex and pixel shaders and

did not have geometry shaders. Nonetheless, most of the design elements

for this model are shared by older hardware; for the most part, older ver-

sions’ design elements are either simpler or missing, not radically diﬀerent.

So, for now we will focus on Shader Model 4.0 and discuss older GPUs’

shader models in later sections.

Describing the entire programming model is well beyond the scope of

this book, and there are many documents, books, and websites that al-

ready do so [261, 338, 647, 1084]. However, a few comments are in order.

Shaders are programmed using C-like shading languages such as HLSL, Cg,

and GLSL. These are compiled to a machine-independent assembly lan-

guage, also called the intermediate language (IL). Previous shader models

allowed programming directly in the assembly language, but as of DirectX

10, programs in this language are visible as debug output only [123]. This

assembly language is converted to the actual machine language in a sep-

arate step, usually in the drivers. This arrangement allows compatibility

across diﬀerent hardware implementations. This assembly language can

be seen as deﬁning a virtual machine, which is targeted by the shading

language compiler.

This virtual machine is a processor with various types of registers and

data sources, programmed with a set of instructions. Since many graph-

ics operations are done on short vectors (up to length 4), the processor

has 4-way SIMD (single-instruction multiple-data) capabilities. Each regis-

ter contains four independent values. 32-bit single-precision ﬂoating-point

scalars and vectors are the basic data types; support for 32-bit integers has

recently been added, as well. Floating-point vectors typically contain data

such as positions (xyzw), normals, matrix rows, colors (rgba), or texture

coordinates (uvwq). Integers are most often used to represent counters,

indices, or bit masks. Aggregate data types such as structures, arrays,

and matrices are also supported. To facilitate working with vectors, swiz-

zling, the replication of any vector component, is also supported. That

is, a vector’s elements can be reordered or duplicated as desired. Simi-

larly, masking, where only the speciﬁed vector elements are used, is also

supported.

A draw call invokes the graphics API to draw a group of primitives,

so causing the graphics pipeline to execute. Each programmable shader

stage has two types of inputs: uniform inputs, with values that remain

constant throughout a draw call (but can be changed between draw calls),

32 3. The Graphics Processing Unit

Shader

Virtual Machine

Varying Input

Registers

16 / 16 / 32 registers

Output

Registers

16 / 32 / 8 registers

Constant

Registers

Temporary

Registers

4096 registers

16 buffers of

4096 registers

Textures

128 arrays of

512 textures

Figure 3.2. Common-shader core virtual machine architecture and register layout, under

DirectX 10. The maximum available number is indicated next to each resource. Three

numbers separated by slashes refer to the limits for vertex, geometry, and pixel shaders

(from left to right).

and varying inputs, which are diﬀerent for each vertex or pixel processed

by the shader. A texture is a special kind of uniform input that once was

always a color image applied to a surface, but that now can be thought of

as any large array of data. It is important to note that although shaders

have a wide variety of inputs, which they can address in diﬀerent ways,

the outputs are extremelyconstrained. Thisisthemostsigniﬁcantway

in which shaders are diﬀerent from programs executing on general-purpose

processors. The underlying virtual machine provides special registers for

the diﬀerent types of inputs and outputs. Uniform inputs are accessed

via read-only constant registers or constant buﬀers, so called because their

contents are constant across a draw call. The number of available constant

registers is much larger than the number of registers available for varying

inputs or outputs. This is because the varying inputs and outputs need to

be stored separately for each vertex or pixel, and the uniform inputs are

stored once and reused across all the vertices or pixels in the draw call.

The virtual machine also has general-purpose temporary registers,which

are used for scratch space. All types of registers can be array-indexed

using integer values in temporary registers. The inputs and outputs of the

shader virtual machine can be seen in Figure 3.2.

Operations that are common in graphics computations are eﬃciently

executed on modern GPUs. Typically, the fastest operations are scalar and

vector multiplications, additions, and their combinations, such as multiply-

add and dot-product. Other operations, such as reciprocal, square root,

3.3. The Evolution of Programmable Shading 33

sine, cosine, exponentiation, and logarithm, tend to be slightly more costly

but still fairly speedy. Texturing operations (see Chapter 6) are eﬃcient,

but their performance may be limited by factors such as the time spent

waiting to retrieve the result of an access. Shading languages expose the

most common of these operations (such as additions and multiplications)

via operators such as * and +. The rest are exposed through intrinsic

functions, e.g., atan(), dot(), log(), and many others. Intrinsic functions

also exist for more complex operations, such as vector normalization and

reﬂection, cross products, matrix transpose and determinant, etc.

The term ﬂow control refers to the use of branching instructions to

change the ﬂow of code execution. These instructions are used to implement

high-level language constructs such as “if” and “case” statements, as well

as various types of loops. Shaders support two types of ﬂow control. Static

ﬂow control branches are based on the values of uniform inputs. This means

that the ﬂow of the code is constant over the draw call. The primary beneﬁt

of static ﬂow control is to allow the same shader to be used in a variety of

diﬀerent situations (e.g., varying numbers of lights). Dynamic ﬂow control

is based on the values of varying inputs. This is much more powerful than

static ﬂow control but is more costly, especially if the code ﬂow changes

erratically between shader invocations. As discussed in Section 18.4.2, a

shader is evaluated on a number of vertices or pixels at a time. If the ﬂow

selects the “if” branch for some elements and the “else” branch for others,

both branches must be evaluated for all elements (and the unused branch

for each element is discarded).

Shader programs can be compiled oﬄine before program load or during

run time. As with any compiler, there are options for generating diﬀerent

output ﬁles and for using diﬀerent optimization levels. A compiled shader

is stored as a string of text, which is passed to the GPU via the driver.

3.3 The Evolution of Programmable Shading

The idea of a framework for programmable shading dates back to 1984 with

Cook’s shade trees [194]. A simple shader and its corresponding shade tree

are shown in Figure 3.3. The RenderMan Shading Language [30, 1283] was

developed from this idea in the late 80’s and is still widely used today for

ﬁlm production rendering. Before GPUs supported programmable shaders

natively, there were several attempts to implement programmable shading

operations in real time via multiple rendering passes. The Quake III: Arena

scripting language was the ﬁrst widespread commercial success in this area

in 1999 [558, 604]. In 2000, Peercy et al. [993] described a system that trans-

lated RenderMan shaders to run in multiple passes on graphics hardware.

They found that GPUs lacked two features that would make this approach

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. The Graphics Processing Unit (1/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
3. The Graphics Processing Unit (1/5)