i
i
i
i
i
i
i
i
18.2. Perspective-Correct Interpolation 839
Figure 18.6. A textured quadrilateral consists of two triangles. To the left, linear inter-
polation is used over each triangle, and to the right, perspective-correct interpolation is
used. Notice how linear interpolation ruins the three-dimensional effect.
interpolated value versus position forms a hyperbola. It is also called ratio-
nal linear interpolation, because a numerator and denominator are linearly
interpolated.
As an example, assume a triangle should be rasterized, and that each
vertex is
r
i
=
p
i
,
1
w
i
,
u
i
w
i
,
v
i
w
i
. (18.1)
Here, p
i
=(p
ix
,p
iy
,p
iz
) is the point after projection and division by its
w-component, which is denoted w
i
.The(u
i
,v
i
) are texture coordinates at
the vertices. This is shown in Figure 18.7. To find the coordinates of a pixel
on the horizontal scanline shown in gray, one must first linearly interpolate
r
0
and r
1
to get r
3
, and also linearly interpolate r
2
and r
1
to get r
4
.Then
linear interpolation of r
3
and r
4
can be used to find a r vertex for each
pixel on the gray scanline. The interpolated p-values are correct as is. The
other interpolated surface values (u/w, v/w) are multiplied by a computed
value w to obtain (u, v). To obtain w itself the linearly interpolated value
1/w is used, with a division done per pixel to compute w =1/(1/w).
Figure 18.7. A triangle being rasterized. The vectors r
i
are linearly interpolated over
the triangle.
i
i
i
i
i
i
i
i
840 18. Graphics Hardware
This implementation of the interpolation is only one choice. Another
possibility is to compute delta slopes for texture interpolation: Δ(u/w)/Δx,
Δ(u/w)/Δy(v/w)/Δx(v/w)/Δy(1/w)/Δx,andΔ(1/w)/Δy.
Again, the u/w, v/w,and1/w values can be updated with only addi-
tions between neighboring pixels, and then the texture coordinates (u, v)
are computed as before.
An in-depth discussion of perspective-correct interpolation is provided
by Blinn [103, 105, 108]. It should be noted that in current GPUs, edge
functions [1016] are often used to determine whether a sample is inside
a triangle. An edge function is simply the equation of a two-dimensional
line on implicit form (see Equation A.47 on page 907). As an interesting
side effect, perspective-correct interpolation can also be done by using the
evaluated edge functions [836].
18.3 Architecture
In this section we will first present a general architecture for a real-time
computer graphics system. The geometry and rasterizer stages, as well
as a texture cache mechanism, will also be discussed here. Bandwidth,
memory, ports, and buses are important components in a graphics ar-
chitecture, as is latency. These topics are discussed in Sections 18.3.2–
18.3.5. To reduce bandwidth requirements, buffers can be compressed and
occlusion culling can be performed. Such algorithms are treated in Sec-
tions 18.3.6 and 18.3.7. Section 18.3.8 describes the programmable culling
unit, a hardware construction for culling unnecessary instructions in the
fragment shader.
18.3.1 General
Graphics accelerators have generally evolved from the end of the pipeline
backward. For example, the first PC accelerators did little more than draw
interpolated or textured spans of pixels rapidly. Triangle setup was then
added in 1998 by the 3Dfx Voodoo 2. NVIDIA introduced geometry stage
acceleration, known as hardware transform and lighting (T&L), with the
GeForce 256 in 1999 [360]. However, the fixed-function implementation
limited the usefulness of this hardware stage. With the introduction of
vertex and pixel shaders in 2000, graphics hardware has both the speed
and programmability needed to overcome purely CPU-based renderers in
essentially all areas. The modern accelerator is often called a graphics
processing unit (GPU) because of these capabilities.
The host is the computer system without any graphics hardware, i.e.,
it is a system with CPU(s), memory, buses, etc. Applications run on the
i
i
i
i
i
i
i
i
18.3. Architecture 841
host, i.e., on the CPU(s). The interface between the application and the
graphics hardware is called a driver.
The GPU is often thought of in different terms than the CPU. A GPU
is not a serial processor like the CPU, but is rather a dataflow or stream
processor. It is the appearance of data at the beginning of the pipeline
that causes computation to occur, and a limited set of inputs is needed to
perform the computation. This different processing model lends itself in
particular to problems where each datum is affected by only nearby data.
One active area of research is how to apply the GPU to non-rendering
problems of this type, such as fluid flow computation and collision de-
tection. This form of computation is called general purpose computation
on the GPU, abbreviated as GPGPU.AMDsCTM [994] and NVIDIA’s
CUDA [211] are toolkits that aid the programmer in using the GPU for
non-graphical applications, as well as alternative rendering methods such
as ray tracing. Other architectures, such as Intel’s Larrabee [1221], aim
to provide graphics acceleration using CPU-centric processors with texture
samplers. The Larrabee architecture is expected to be able to attack a
wider range of problems that can benefit from parallelization.
Pipelining and Parallelization
Two ways to achieve faster processing in graphics hardware are pipelining
and parallelism. These techniques are most often used in conjunction with
each other. See Section 2.1 for the basics of pipelining. When discussing
pipelining in this section, we do not mean the conceptual or the functional
stages, but rather the physical pipeline stages, i.e., the actual blocks built
in silicon that are executed in parallel. It is worth repeating that divid-
ing a certain function (for example, the lighting computation of a vertex)
into n pipeline stages ideally gives a performance increase of n. In general,
graphics hardware for polygon rendering is much simpler to pipeline than
a CPU. For example, the Intel Pentium IV has 20 pipeline stages, which
should be compared to NVIDIA’s GeForce3, which has 600–800 stages (de-
pending on which path is taken). The reasons for this are many, including
that the CPU is operation-driven, while most of the GPU is data-driven,
that pixels are rendered independent of each other, and that some parts of
the GPU are highly specialized and fixed-function (not programmable).
The clock frequency of a CPU is often much higher than that of a graph-
ics accelerator, but this hardly touches the huge advantage that modern
accelerators have over the CPU when used correctly. CPUs typically have
much higher clock frequencies because, among other reasons, CPU design-
ers spend much more time on optimizing the blocks. One reason for this is
because CPUs have tens of pipeline stages, compared to hundreds or thou-
sands for GPUs. Also, for GPUs, the clock frequency is most often not the
i
i
i
i
i
i
i
i
842 18. Graphics Hardware
Figure 18.8. The general architecture for a high-performance, parallel computer graphics
architecture. Each geometry unit (G) processes (i.e., transforms, projects, etc.) a portion
of the geometry to be rendered, and each rasterizer unit (R) takes care of a portion of
the pixel calculations.
bottleneck—rather, memory bandwidth is.
5
Another reason why graphics
algorithms can run fast in hardware is that it is relatively simple to predict
what memory accesses will be done, and so efficient memory prefetching
can be achieved. The majority of memory accesses are reads, which also
simplifies and improves performance. Also note that a CPU cannot tol-
erate long latencies (when an instruction is executed, the result is usually
desired immediately), whereas a graphics accelerator can tolerate longer la-
tencies, as long as sufficiently many primitives can be rendered per second.
In addition, in the beginning of the 21st century, the most complex GPUs
surpassed the most complex PC CPUs in terms of number of transistors.
In conclusion, if the same chip area is used to build a dedicated graphics
accelerator as a standard CPU, hardware rendering is at least 100 times
faster than using software rendering with the CPU [816].
The other way of achieving faster graphics is to parallelize, and this
can be done in both the geometry and the rasterizer stages. The idea
is to compute multiple results simultaneously and then merge these at a
later stage. In general, a parallel graphics architecture has the appearance
shown in Figure 18.8. The result of the graphics database traversal is sent
toanumberofgeometry units (G’s) that work in parallel. Together, these
geometry units do the geometry-stage work of the rendering pipeline (see
Section 2.3), and forward their results to a set of rasterizer units (R’s),
which together implement the rasterizer stage in the rendering pipeline.
However, since multiple results are computed in parallel, some kind of
sorting must be done, so that the parallel units together render the image
that the user intended. Specifically, the sorting needed is from model space
to screen space (see Section 2.3.1 and 2.4). It should be noted that the
5
In fact, memory bandwidth is most often the bottleneck for CPUs as well. However,
CPU company marketing departments have found that a high clock frequency number
is what sells best.
i
i
i
i
i
i
i
i
18.3. Architecture 843
FM
FG
G G G
A
FM
FG
FM
FG
Rasterizer
SORT
DISPLAY
FM
FG
G G G
A
FM
FG
FM
FG
SORT
DISPLAY
FM
FG
G G G
A
FM
FG
FM
FG
SORT
DISPLAY
FM
FG
G G G
A
FM
FG
FM
FG
DISPLAY
Geometry
Figure 18.9. Taxonomy of parallel graphics architectures. A is the application, G is the
geometry stage, then follows the rasterizer comprised of FG, which is fragment gener-
ation, and FM, which is fragment merging. The architectures are (left to right): sort-
first, sort-middle, sort-last fragment, and sort-last image. (Illustration after Eldridge et
al. [302].)
G’s and R’s may be identical units, i.e., unified shader cores, capable of
executing both geometry and per-pixel processing, and switching between
these. See Section 18.4.1 on how the Xbox 360 uses unified shaders. Even
if this is the case, it is important to understand where this sorting takes
place.
Molnar et al. [896] present a taxonomy for parallel architectures, based
on where in the parallel rendering pipeline the sort occurs. Eldridge et
al. [302] present a slight variation of this taxonomy, and we follow their
notation here. The sort can occur anywhere in the pipeline, which gives rise
to four different classes of parallel architectures, as shown in Figure 18.9.
These are called sort-first, sort-middle, sort-last fragment,andsort-last
image. The rasterizer stage is comprised of two internal stages, namely,
fragment generation (FG) and fragment merge (FM). FG generates the
actual locations inside a primitive, and forwards these to FM, which merges
the result, using the Z-buffer. Shading is not shown, but is, in general, done
either at the end of FG, or in the beginning of FM. Also note that there
may be, for example, twice as many FGs as there are geometry units. Any
configuration is possible.
A sort-first-based architecture sorts primitives before the geometry
stage. The strategy is to divide the screen into a set of regions, and the
primitives inside a region are sent to a complete pipeline that “owns” that
region. See Figure 18.10. A primitive is initially processed enough to know
which region(s) it needs to be sent—this is the sorting step. Sort-first
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset