18. Graphics Hardware (3/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

18.2. Perspective-Correct Interpolation 839

Figure 18.6. A textured quadrilateral consists of two triangles. To the left, linear inter-

polation is used over each triangle, and to the right, perspective-correct interpolation is

used. Notice how linear interpolation ruins the three-dimensional eﬀect.

interpolated value versus position forms a hyperbola. It is also called ratio-

nal linear interpolation, because a numerator and denominator are linearly

interpolated.

As an example, assume a triangle should be rasterized, and that each

vertex is





. (18.1)

Here, p

=(p

) is the point after projection and division by its

w-component, which is denoted w

.The(u

) are texture coordinates at

the vertices. This is shown in Figure 18.7. To ﬁnd the coordinates of a pixel

on the horizontal scanline shown in gray, one must ﬁrst linearly interpolate

and r

to get r

, and also linearly interpolate r

and r

to get r

.Then

linear interpolation of r

and r

can be used to ﬁnd a r vertex for each

pixel on the gray scanline. The interpolated p-values are correct as is. The

other interpolated surface values (u/w, v/w) are multiplied by a computed

value w to obtain (u, v). To obtain w itself the linearly interpolated value

1/w is used, with a division done per pixel to compute w =1/(1/w).

Figure 18.7. A triangle being rasterized. The vectors r

are linearly interpolated over

the triangle.

840 18. Graphics Hardware

This implementation of the interpolation is only one choice. Another

possibility is to compute delta slopes for texture interpolation: Δ(u/w)/Δx,

Δ(u/w)/Δy,Δ(v/w)/Δx,Δ(v/w)/Δy,Δ(1/w)/Δx,andΔ(1/w)/Δy.

Again, the u/w, v/w,and1/w values can be updated with only addi-

tions between neighboring pixels, and then the texture coordinates (u, v)

are computed as before.

An in-depth discussion of perspective-correct interpolation is provided

by Blinn [103, 105, 108]. It should be noted that in current GPUs, edge

functions [1016] are often used to determine whether a sample is inside

a triangle. An edge function is simply the equation of a two-dimensional

line on implicit form (see Equation A.47 on page 907). As an interesting

side eﬀect, perspective-correct interpolation can also be done by using the

evaluated edge functions [836].

18.3 Architecture

In this section we will ﬁrst present a general architecture for a real-time

computer graphics system. The geometry and rasterizer stages, as well

as a texture cache mechanism, will also be discussed here. Bandwidth,

memory, ports, and buses are important components in a graphics ar-

chitecture, as is latency. These topics are discussed in Sections 18.3.2–

18.3.5. To reduce bandwidth requirements, buﬀers can be compressed and

occlusion culling can be performed. Such algorithms are treated in Sec-

tions 18.3.6 and 18.3.7. Section 18.3.8 describes the programmable culling

unit, a hardware construction for culling unnecessary instructions in the

fragment shader.

18.3.1 General

Graphics accelerators have generally evolved from the end of the pipeline

backward. For example, the ﬁrst PC accelerators did little more than draw

interpolated or textured spans of pixels rapidly. Triangle setup was then

added in 1998 by the 3Dfx Voodoo 2. NVIDIA introduced geometry stage

acceleration, known as hardware transform and lighting (T&L), with the

GeForce 256 in 1999 [360]. However, the ﬁxed-function implementation

limited the usefulness of this hardware stage. With the introduction of

vertex and pixel shaders in 2000, graphics hardware has both the speed

and programmability needed to overcome purely CPU-based renderers in

essentially all areas. The modern accelerator is often called a graphics

processing unit (GPU) because of these capabilities.

The host is the computer system without any graphics hardware, i.e.,

it is a system with CPU(s), memory, buses, etc. Applications run on the

18.3. Architecture 841

host, i.e., on the CPU(s). The interface between the application and the

graphics hardware is called a driver.

The GPU is often thought of in diﬀerent terms than the CPU. A GPU

is not a serial processor like the CPU, but is rather a dataﬂow or stream

processor. It is the appearance of data at the beginning of the pipeline

that causes computation to occur, and a limited set of inputs is needed to

perform the computation. This diﬀerent processing model lends itself in

particular to problems where each datum is aﬀected by only nearby data.

One active area of research is how to apply the GPU to non-rendering

problems of this type, such as ﬂuid ﬂow computation and collision de-

tection. This form of computation is called general purpose computation

on the GPU, abbreviated as GPGPU.AMD’sCTM [994] and NVIDIA’s

CUDA [211] are toolkits that aid the programmer in using the GPU for

non-graphical applications, as well as alternative rendering methods such

as ray tracing. Other architectures, such as Intel’s Larrabee [1221], aim

to provide graphics acceleration using CPU-centric processors with texture

samplers. The Larrabee architecture is expected to be able to attack a

wider range of problems that can beneﬁt from parallelization.

Pipelining and Parallelization

Two ways to achieve faster processing in graphics hardware are pipelining

and parallelism. These techniques are most often used in conjunction with

each other. See Section 2.1 for the basics of pipelining. When discussing

pipelining in this section, we do not mean the conceptual or the functional

stages, but rather the physical pipeline stages, i.e., the actual blocks built

in silicon that are executed in parallel. It is worth repeating that divid-

ing a certain function (for example, the lighting computation of a vertex)

into n pipeline stages ideally gives a performance increase of n. In general,

graphics hardware for polygon rendering is much simpler to pipeline than

a CPU. For example, the Intel Pentium IV has 20 pipeline stages, which

should be compared to NVIDIA’s GeForce3, which has 600–800 stages (de-

pending on which path is taken). The reasons for this are many, including

that the CPU is operation-driven, while most of the GPU is data-driven,

that pixels are rendered independent of each other, and that some parts of

the GPU are highly specialized and ﬁxed-function (not programmable).

The clock frequency of a CPU is often much higher than that of a graph-

ics accelerator, but this hardly touches the huge advantage that modern

accelerators have over the CPU when used correctly. CPUs typically have

much higher clock frequencies because, among other reasons, CPU design-

ers spend much more time on optimizing the blocks. One reason for this is

because CPUs have tens of pipeline stages, compared to hundreds or thou-

sands for GPUs. Also, for GPUs, the clock frequency is most often not the

842 18. Graphics Hardware

Figure 18.8. The general architecture for a high-performance, parallel computer graphics

architecture. Each geometry unit (G) processes (i.e., transforms, projects, etc.) a portion

of the geometry to be rendered, and each rasterizer unit (R) takes care of a portion of

the pixel calculations.

bottleneck—rather, memory bandwidth is.

Another reason why graphics

algorithms can run fast in hardware is that it is relatively simple to predict

what memory accesses will be done, and so eﬃcient memory prefetching

can be achieved. The majority of memory accesses are reads, which also

simpliﬁes and improves performance. Also note that a CPU cannot tol-

erate long latencies (when an instruction is executed, the result is usually

desired immediately), whereas a graphics accelerator can tolerate longer la-

tencies, as long as suﬃciently many primitives can be rendered per second.

In addition, in the beginning of the 21st century, the most complex GPUs

surpassed the most complex PC CPUs in terms of number of transistors.

In conclusion, if the same chip area is used to build a dedicated graphics

accelerator as a standard CPU, hardware rendering is at least 100 times

faster than using software rendering with the CPU [816].

The other way of achieving faster graphics is to parallelize, and this

can be done in both the geometry and the rasterizer stages. The idea

is to compute multiple results simultaneously and then merge these at a

later stage. In general, a parallel graphics architecture has the appearance

shown in Figure 18.8. The result of the graphics database traversal is sent

toanumberofgeometry units (G’s) that work in parallel. Together, these

geometry units do the geometry-stage work of the rendering pipeline (see

Section 2.3), and forward their results to a set of rasterizer units (R’s),

which together implement the rasterizer stage in the rendering pipeline.

However, since multiple results are computed in parallel, some kind of

sorting must be done, so that the parallel units together render the image

that the user intended. Speciﬁcally, the sorting needed is from model space

to screen space (see Section 2.3.1 and 2.4). It should be noted that the

In fact, memory bandwidth is most often the bottleneck for CPUs as well. However,

CPU company marketing departments have found that a high clock frequency number

is what sells best.

18.3. Architecture 843

G G G

Rasterizer

SORT

DISPLAY

G G G

SORT

DISPLAY

G G G

SORT

DISPLAY

G G G

DISPLAY

Geometry

Figure 18.9. Taxonomy of parallel graphics architectures. A is the application, G is the

geometry stage, then follows the rasterizer comprised of FG, which is fragment gener-

ation, and FM, which is fragment merging. The architectures are (left to right): sort-

ﬁrst, sort-middle, sort-last fragment, and sort-last image. (Illustration after Eldridge et

al. [302].)

G’s and R’s may be identical units, i.e., uniﬁed shader cores, capable of

executing both geometry and per-pixel processing, and switching between

these. See Section 18.4.1 on how the Xbox 360 uses uniﬁed shaders. Even

if this is the case, it is important to understand where this sorting takes

place.

Molnar et al. [896] present a taxonomy for parallel architectures, based

on where in the parallel rendering pipeline the sort occurs. Eldridge et

al. [302] present a slight variation of this taxonomy, and we follow their

notation here. The sort can occur anywhere in the pipeline, which gives rise

to four diﬀerent classes of parallel architectures, as shown in Figure 18.9.

These are called sort-ﬁrst, sort-middle, sort-last fragment,andsort-last

image. The rasterizer stage is comprised of two internal stages, namely,

fragment generation (FG) and fragment merge (FM). FG generates the

actual locations inside a primitive, and forwards these to FM, which merges

the result, using the Z-buﬀer. Shading is not shown, but is, in general, done

either at the end of FG, or in the beginning of FM. Also note that there

may be, for example, twice as many FGs as there are geometry units. Any

conﬁguration is possible.

A sort-ﬁrst-based architecture sorts primitives before the geometry

stage. The strategy is to divide the screen into a set of regions, and the

primitives inside a region are sent to a complete pipeline that “owns” that

region. See Figure 18.10. A primitive is initially processed enough to know

which region(s) it needs to be sent—this is the sorting step. Sort-ﬁrst

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (3/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (3/10)