18. Graphics Hardware (4/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

844 18. Graphics Hardware

Figure 18.10. Sort-ﬁrst splits the screen into separate tiles and assigns a processor to

each tile, as shown here. A primitive is then sent to the processors whose tiles they

overlap. This is in contrast to sort-middle architecture, which needs to sort all triangles

after geometry processing has occurred. Only after all triangles have been sorted can

per-pixel rasterization start. (Images courtesy of Marcus Roth and Dirk Reiners.)

is the least explored architecture for a single machine [303, 896]. It is a

scheme that does see use when driving a system with multiple screens or

projectors forming a large display, as a single computer is dedicated to each

screen [1086]. A system called Chromium [577] has been developed, which

can implement any type of parallel rendering algorithm using a cluster of

workstations. For example, sort-ﬁrst and sort-last can be implemented

with high rendering performance.

The Mali 200 (Section 18.4.3) is a sort-middle architecture. Geometry

processors are given arbitrary sets of primitives, with the goal of load-

balancing the work. Also, each rasterizer unit is responsible for a screen-

space region, here called a tile. This may be a rectangular region of pixels,

or a set of scanlines, or some other interleaved pattern (e.g., every eighth

pixel). Once a primitive is transformed to screen coordinates, sorting oc-

curs to route the processing of this primitive to the rasterizer units (FGs

and FMs) that are responsible for that tile of the screen. Note that a

transformed triangle or mesh may be sent to several rasterizer units if it

overlaps more than one tile. One limitation of these tiling architectures

18.3. Architecture 845

is the diﬃculty of using various post-processing eﬀects such as blurs or

glows, as these require communication between neighboring pixels, and so

neighboring tiles, of computed results. If the maximum neighborhood of

the ﬁlter is known in advance, each tile can be enlarged by this amount to

capture the data needed and so avoid the problem, at the cost of redundant

computation along the seams.

The sort-last fragment architecture sorts the fragments after fragment

generation (FG) and before fragment merge (FM). An example is the GPU

of the PLAYSTATION



3 system described in Section 18.4.2. Just as

with sort-middle, primitives are spread as evenly as possible across the

geometry units. One advantage with sort-last fragment is that there will

not be any overlap, meaning that a generated fragment is sent to only

one FM, which is optimal. However, it may be hard to balance fragment

generation work. For example, assume that a set of large polygons happens

to be sent down to one FG unit. The FG unit will have to generate all the

fragments on these polygons, but the fragments are then sorted to the FM

units responsible for the respective pixels. Therefore, imbalance is likely to

occur for such cases [303].

Figure 18.11. In sort-last image, diﬀerent objects in the scene are sent to diﬀerent

processors. Transparency is diﬃcult to deal with when compositing separate rendered

images, so transparent objects are usually sent to all nodes. (Images courtesy of Marcus

Roth and Dirk Reiners.)

846 18. Graphics Hardware

Finally, the sort-last image architecture sorts after the entire rasterizer

stage (FG and FM). A visualization is shown in Figure 18.11. PixelFlow

[328, 895] is one such example. This architecture can be seen as a set of

independent pipelines. The primitives are spread across the pipelines, and

each pipeline renders an image with depth. In a ﬁnal composition stage,

all the images are merged with respect to its Z-buﬀers. The PixelFlow

architecture is also interesting because it used deferred shading, meaning

that it textured and shaded only visible fragments. It should be noted that

sort-last image cannot fully implement an API such as OpenGL, because

OpenGL requires that primitives be rendered in the order they are sent.

One problem with a pure sort-last scheme for large tiled display systems

is the sheer amount of image and depth data that needs to be transferred

between rendering nodes. Roth and Reiners [1086] optimize data trans-

fer and composition costs by using the screen and depth bounds of each

processor’s results.

Eldridge et al. [302, 303] present “Pomegranate,” a sort-everywhere

architecture. Brieﬂy, it inserts sort stages between the geometry stage

and the fragment generators (FGs), between FGs and fragment mergers

(FMs), and between FMs and the display. The work is therefore kept

more balanced as the system scales (i.e., as more pipelines are added).

The sorting stages are implemented as a high-speed network with point-

to-point links. Simulations showed a nearly linear performance increase as

more pipelines are added.

All the components in a graphics system (host, geometry processing,

and rasterization) connected together give us a multiprocessing system.

For such systems there are two problems that are well-known, and almost

always associated with multiprocessing: load balancing and communica-

tions [204]. FIFO (ﬁrst-in, ﬁrst-out) queues are often inserted into many

diﬀerent places in the pipeline, so that jobs can be queued in order to

avoid stalling parts of the pipeline. For example, it is possible to put a

FIFO between the geometry and the rasterizer stage. The diﬀerent sort

architectures described have diﬀerent load balancing advantages and dis-

advantages. Consult Eldridge’s Ph.D. thesis [303] or the paper by Molnar

et al. [896] for more on these. The programmer can also aﬀect the load

balance; techniques for doing so are discussed in Chapter 15. Communi-

cations can be a problem if the bandwidth of the buses is too low, or is

used unwisely. Therefore, it is of extreme importance to design a graphics

system so that the bottleneck does not occur in any of the buses, e.g., the

bus from the host to the graphics hardware. Bandwidth is discussed in

Section 18.3.4.

One signiﬁcant development in commercial GPUs introduced in 2006

is the uniﬁed shader model. See Section 15.2 for a short description, and

Section 18.4.1, which describes the uniﬁed shader architecture of the Xbox

360.

18.3. Architecture 847

Texture Access

The performance increase in terms of pure computations has grown expo-

nentially for many years. While processors have continued to increase in

speed at a rate in keeping with Moore’s Law, and graphics pipelines have

actually exceeded this rate [735], memory bandwidth and memory latency

have not kept up. Bandwidth is the rate at which data can be transferred,

and latency is the time between request and retrieval. While the capabil-

ity of a processor has been going up 71% per year, DRAM bandwidth is

improving about 25%, and latency a mere 5% [981]. For NVIDIA’s 8800

architecture [948], you can do about 14 ﬂoating-point operations per texel

access.

Chip density rises faster than available bandwidth, so this ratio

will only increase [1400]. In addition, the trend is to use more and more

textures per primitive. Reading from texture memory is often the main con-

sumer of bandwidth [10]. Therefore, when reading out texels from memory,

care must be taken to reduce the bandwidth used and to hide latency.

To save bandwidth, most architectures use caches at various places in

the pipeline, and to hide latency, a technique called prefetching is often

used. Caching is implemented with a small on-chip memory (a few kilo-

bytes) where the result of recent texture reads are stored, and access is very

fast [489, 582, 583]. This memory is shared among all textures currently in

use. If neighboring pixels need to access the same or closely located texels,

they are likely to ﬁnd these in the cache. This is what is done for standard

CPUs, as well. However, reading texels into the cache takes time, and most

often entire cache blocks (e.g., 32 bytes) are read in at once.

So, if a texel is not in the cache, it may take relatively long before it can

be found there. One solution employed by GPUs today to hide this latency

is to keep many fragments in ﬂight at a time. Say the shader program is

about to execute a texture lookup instruction. If only one fragment is kept

in ﬂight, we need to wait until the texels are available in the texture cache,

and this will keep the pipeline idle. However, assume that we can keep

100 fragments in ﬂight at a time. First, fragment 0 will request a texture

access. However, since it will take many clock cycles before the requested

data is in the cache, the GPU executes the same texture lookup instruction

for fragment 1, and so on for all 100 fragments. When these 100 texture

lookup instructions have been executed, we are back at processing fragment

0. Say a dot product instruction, using the texture lookup as an argument,

follows. At this time, it is very likely that the requested data is in the

cache, and processing can continue immediately. This is a common way

of hiding latency in GPUs. This is also the reason why a GPU has many

registers. See the Xbox 360 description in Section 18.4.1 for an example.

This ratio was obtained using 520 GFLOPS as the computing power, and 38.4 billion

texel accesses per second. It should be noted that these are peak performance numbers.

848 18. Graphics Hardware

Mipmapping (see Section 6.2.2) is important for texture cache coher-

ence, since it enforces a maximum texel-pixel ratio. When traversing the

triangle, each new pixel represents a step in texture space of one texel at

most. Mipmapping is one of the few cases in rendering where a technique

improves both visuals and performance. There are other ways in which the

cache coherence of texture access can be improved. Texture swizzling is a

technique that can be used to improve the texture cache performance and

page locality. There are various methods of texture swizzling employed

by diﬀerent GPU manufacturers, but all have the same goal: to maximize

texture cache locality regardless of the angle of access. We will describe a

swizzle pattern commonly used by NVIDIA GPUs—it is representative of

the patterns used.

Assume that the texture coordinates have been transformed to ﬁxed

point numbers: (u, v), where each of u and v have n bits. The bit with

number i of u is denoted u

. Then the remapping of (u, v) to a swizzled

texture address is

texaddr(u, v) = texbaseaddr

+(v

n−1

n−2

...v

) ∗ texelsize.

(18.2)

Here, texelsize is the number of bytes occupied by one texel. The ad-

vantage of this remapping is that it gives rise to the texel order shown in

812

16 20

24 28

32 36

40 44

48 52

56 60

64 68

72 76

80 84

88 92

96 100

104 108

112 116

120 124

128 132

136 140

144 148

152 156

160 164

168 172

176 180

184 188

192 196

200 204

208 212

216 220

224 228

232 236

240 244

248 252

Figure 18.12. Texture swizzling increases coherency of texel memory accesses. Note that

texel size here is four bytes, and that the texel address is shown in each texel’s upper

left corner.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (4/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (4/10)