i
i
i
i
i
i
i
i
844 18. Graphics Hardware
Figure 18.10. Sort-first splits the screen into separate tiles and assigns a processor to
each tile, as shown here. A primitive is then sent to the processors whose tiles they
overlap. This is in contrast to sort-middle architecture, which needs to sort all triangles
after geometry processing has occurred. Only after all triangles have been sorted can
per-pixel rasterization start. (Images courtesy of Marcus Roth and Dirk Reiners.)
is the least explored architecture for a single machine [303, 896]. It is a
scheme that does see use when driving a system with multiple screens or
projectors forming a large display, as a single computer is dedicated to each
screen [1086]. A system called Chromium [577] has been developed, which
can implement any type of parallel rendering algorithm using a cluster of
workstations. For example, sort-first and sort-last can be implemented
with high rendering performance.
The Mali 200 (Section 18.4.3) is a sort-middle architecture. Geometry
processors are given arbitrary sets of primitives, with the goal of load-
balancing the work. Also, each rasterizer unit is responsible for a screen-
space region, here called a tile. This may be a rectangular region of pixels,
or a set of scanlines, or some other interleaved pattern (e.g., every eighth
pixel). Once a primitive is transformed to screen coordinates, sorting oc-
curs to route the processing of this primitive to the rasterizer units (FGs
and FMs) that are responsible for that tile of the screen. Note that a
transformed triangle or mesh may be sent to several rasterizer units if it
overlaps more than one tile. One limitation of these tiling architectures
i
i
i
i
i
i
i
i
18.3. Architecture 845
is the difficulty of using various post-processing effects such as blurs or
glows, as these require communication between neighboring pixels, and so
neighboring tiles, of computed results. If the maximum neighborhood of
the filter is known in advance, each tile can be enlarged by this amount to
capture the data needed and so avoid the problem, at the cost of redundant
computation along the seams.
The sort-last fragment architecture sorts the fragments after fragment
generation (FG) and before fragment merge (FM). An example is the GPU
of the PLAYSTATION
R
3 system described in Section 18.4.2. Just as
with sort-middle, primitives are spread as evenly as possible across the
geometry units. One advantage with sort-last fragment is that there will
not be any overlap, meaning that a generated fragment is sent to only
one FM, which is optimal. However, it may be hard to balance fragment
generation work. For example, assume that a set of large polygons happens
to be sent down to one FG unit. The FG unit will have to generate all the
fragments on these polygons, but the fragments are then sorted to the FM
units responsible for the respective pixels. Therefore, imbalance is likely to
occur for such cases [303].
Figure 18.11. In sort-last image, different objects in the scene are sent to different
processors. Transparency is difficult to deal with when compositing separate rendered
images, so transparent objects are usually sent to all nodes. (Images courtesy of Marcus
Roth and Dirk Reiners.)
i
i
i
i
i
i
i
i
846 18. Graphics Hardware
Finally, the sort-last image architecture sorts after the entire rasterizer
stage (FG and FM). A visualization is shown in Figure 18.11. PixelFlow
[328, 895] is one such example. This architecture can be seen as a set of
independent pipelines. The primitives are spread across the pipelines, and
each pipeline renders an image with depth. In a final composition stage,
all the images are merged with respect to its Z-buffers. The PixelFlow
architecture is also interesting because it used deferred shading, meaning
that it textured and shaded only visible fragments. It should be noted that
sort-last image cannot fully implement an API such as OpenGL, because
OpenGL requires that primitives be rendered in the order they are sent.
One problem with a pure sort-last scheme for large tiled display systems
is the sheer amount of image and depth data that needs to be transferred
between rendering nodes. Roth and Reiners [1086] optimize data trans-
fer and composition costs by using the screen and depth bounds of each
processor’s results.
Eldridge et al. [302, 303] present “Pomegranate,” a sort-everywhere
architecture. Briefly, it inserts sort stages between the geometry stage
and the fragment generators (FGs), between FGs and fragment mergers
(FMs), and between FMs and the display. The work is therefore kept
more balanced as the system scales (i.e., as more pipelines are added).
The sorting stages are implemented as a high-speed network with point-
to-point links. Simulations showed a nearly linear performance increase as
more pipelines are added.
All the components in a graphics system (host, geometry processing,
and rasterization) connected together give us a multiprocessing system.
For such systems there are two problems that are well-known, and almost
always associated with multiprocessing: load balancing and communica-
tions [204]. FIFO (first-in, first-out) queues are often inserted into many
different places in the pipeline, so that jobs can be queued in order to
avoid stalling parts of the pipeline. For example, it is possible to put a
FIFO between the geometry and the rasterizer stage. The different sort
architectures described have different load balancing advantages and dis-
advantages. Consult Eldridge’s Ph.D. thesis [303] or the paper by Molnar
et al. [896] for more on these. The programmer can also affect the load
balance; techniques for doing so are discussed in Chapter 15. Communi-
cations can be a problem if the bandwidth of the buses is too low, or is
used unwisely. Therefore, it is of extreme importance to design a graphics
system so that the bottleneck does not occur in any of the buses, e.g., the
bus from the host to the graphics hardware. Bandwidth is discussed in
Section 18.3.4.
One significant development in commercial GPUs introduced in 2006
is the unified shader model. See Section 15.2 for a short description, and
Section 18.4.1, which describes the unified shader architecture of the Xbox
360.
i
i
i
i
i
i
i
i
18.3. Architecture 847
Texture Access
The performance increase in terms of pure computations has grown expo-
nentially for many years. While processors have continued to increase in
speed at a rate in keeping with Moore’s Law, and graphics pipelines have
actually exceeded this rate [735], memory bandwidth and memory latency
have not kept up. Bandwidth is the rate at which data can be transferred,
and latency is the time between request and retrieval. While the capabil-
ity of a processor has been going up 71% per year, DRAM bandwidth is
improving about 25%, and latency a mere 5% [981]. For NVIDIA’s 8800
architecture [948], you can do about 14 floating-point operations per texel
access.
6
Chip density rises faster than available bandwidth, so this ratio
will only increase [1400]. In addition, the trend is to use more and more
textures per primitive. Reading from texture memory is often the main con-
sumer of bandwidth [10]. Therefore, when reading out texels from memory,
care must be taken to reduce the bandwidth used and to hide latency.
To save bandwidth, most architectures use caches at various places in
the pipeline, and to hide latency, a technique called prefetching is often
used. Caching is implemented with a small on-chip memory (a few kilo-
bytes) where the result of recent texture reads are stored, and access is very
fast [489, 582, 583]. This memory is shared among all textures currently in
use. If neighboring pixels need to access the same or closely located texels,
they are likely to find these in the cache. This is what is done for standard
CPUs, as well. However, reading texels into the cache takes time, and most
often entire cache blocks (e.g., 32 bytes) are read in at once.
So, if a texel is not in the cache, it may take relatively long before it can
be found there. One solution employed by GPUs today to hide this latency
is to keep many fragments in flight at a time. Say the shader program is
about to execute a texture lookup instruction. If only one fragment is kept
in flight, we need to wait until the texels are available in the texture cache,
and this will keep the pipeline idle. However, assume that we can keep
100 fragments in flight at a time. First, fragment 0 will request a texture
access. However, since it will take many clock cycles before the requested
data is in the cache, the GPU executes the same texture lookup instruction
for fragment 1, and so on for all 100 fragments. When these 100 texture
lookup instructions have been executed, we are back at processing fragment
0. Say a dot product instruction, using the texture lookup as an argument,
follows. At this time, it is very likely that the requested data is in the
cache, and processing can continue immediately. This is a common way
of hiding latency in GPUs. This is also the reason why a GPU has many
registers. See the Xbox 360 description in Section 18.4.1 for an example.
6
This ratio was obtained using 520 GFLOPS as the computing power, and 38.4 billion
texel accesses per second. It should be noted that these are peak performance numbers.
i
i
i
i
i
i
i
i
848 18. Graphics Hardware
Mipmapping (see Section 6.2.2) is important for texture cache coher-
ence, since it enforces a maximum texel-pixel ratio. When traversing the
triangle, each new pixel represents a step in texture space of one texel at
most. Mipmapping is one of the few cases in rendering where a technique
improves both visuals and performance. There are other ways in which the
cache coherence of texture access can be improved. Texture swizzling is a
technique that can be used to improve the texture cache performance and
page locality. There are various methods of texture swizzling employed
by different GPU manufacturers, but all have the same goal: to maximize
texture cache locality regardless of the angle of access. We will describe a
swizzle pattern commonly used by NVIDIA GPUs—it is representative of
the patterns used.
Assume that the texture coordinates have been transformed to fixed
point numbers: (u, v), where each of u and v have n bits. The bit with
number i of u is denoted u
i
. Then the remapping of (u, v) to a swizzled
texture address is
texaddr(u, v) = texbaseaddr
+(v
n1
u
n1
v
n2
u
n2
...v
2
u
2
v
1
u
1
v
0
u
0
) texelsize.
(18.2)
Here, texelsize is the number of bytes occupied by one texel. The ad-
vantage of this remapping is that it gives rise to the texel order shown in
04
812
16 20
24 28
32 36
40 44
48 52
56 60
64 68
72 76
80 84
88 92
96 100
104 108
112 116
120 124
128 132
136 140
144 148
152 156
160 164
168 172
176 180
184 188
192 196
200 204
208 212
216 220
224 228
232 236
240 244
248 252
Figure 18.12. Texture swizzling increases coherency of texel memory accesses. Note that
texel size here is four bytes, and that the texel address is shown in each texel’s upper
left corner.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset