18. Graphics Hardware (6/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

854 18. Graphics Hardware

the latency when requesting access to texel data. One technique for hiding

this type of latency is described on page 847.

A diﬀerent type of latency is that from reading back data from GPU to

CPU. A good mental model is to think of the GPU and CPU as separate

computers working asynchronously, with communication between the two

taking some eﬀort. Latency from changing the direction of the ﬂow of

information can seriously hurt performance. When data is read back from

the accelerator, the pipeline often has to be ﬂushed before the read. During

this time, the CPU is idle. An example of a read-back mechanism that does

not produce a stall is the occlusion query. See Section 14.6.1. For occlusion

testing, the mechanism is to perform the query and then occasionally check

the GPU to see if the results of the query are available. While waiting for

the results, other work can then be done on both the CPU and GPU.

18.3.6 Buffer Compression

As can be seen in Equation 18.8, there are several components adding up

to the bandwidth traﬃc. To obtain an eﬃcient GPU architecture, one

needs to work on all fronts to reduce bandwidth usage. Even for occluded

triangles, there is still a substantial amount of Z-buﬀer bandwidth (d×Z

In addition, it would be advantageous to reduce the bandwidth traﬃc to

other buﬀers, such as color and stencil, as well. In this section, we will

present techniques to compress and decompress these types of buﬀers on

the ﬂy, i.e., as images are being rendered. This includes algorithms for

rapidly clearing the contents of the buﬀers. Automatic GPU occlusion

culling (called Z-culling), compression and decompression of buﬀers, and

rapid clears are typically implemented in the same subsystem in hardware,

since they share some parts of the implementation. For clarity, we present

buﬀer compression and fast clears together with the system ﬁrst, and then

discuss Z-culling.

Central to these algorithms is on-chip storage or a cache, which we

call the tile table, with additional information stored for each tile, i.e.,

rectangle, of pixels in the buﬀer. A block diagram of this algorithm is

shown in Figure 18.14, where we consider only Z-buﬀer compression, for

simplicity. However, this general architecture applies to other buﬀers, such

as color and stencil, as well. Each element in the tile table stores the status

of,say,an8× 8 tile of pixels in the frame buﬀer. This includes a state

and the maximum z-value, z

max

(and possibly z

min

as well—more about

that in Section 18.3.7), for the tile. As will be seen later, the z

max

can be

used for occlusion culling. The state of each tile can be either compressed,

uncompressed,orcleared. In general there could also be diﬀerent types of

compressed blocks. For example, one compressed mode might compress

down to 25% and another to 50%. These modes are used to implement a

18.3. Architecture 855

Tile table

8 x 8 uncompressed

z-values + z

max

Figure 18.14. Block diagram of hardware techniques for fast clear of the Z-buﬀer, com-

pression of Z-buﬀer, and Z-culling. Applies to other types of buﬀers too.

fast clear and compression of the buﬀer. When the system issues a clear

of the buﬀer, the state of each tile is set to cleared, and the frame buﬀer

proper is not touched. When the rasterizer needs to read the cleared buﬀer,

the decompressor unit ﬁrst checks the state, sees that the tile is cleared,

and so can send a buﬀer tile with all values set to the clear value for that

buﬀer (e.g., z

far

for the Z-buﬀer) without reading and uncompressing the

actual buﬀer. In this way, access to the buﬀer itself is minimized during

clears, so saving bandwidth. If the state is not cleared, then the buﬀer

for that tile has to be read. When the rasterizer has ﬁnished writing new

values to the tile, it is sent to the compressor, where an attempt is made at

compressing it. If there are two compression modes, both could be tried,

and the one that can compress that tile with smallest amount of bits is

used. Since we usually need lossless buﬀer compression, a fallback to using

uncompressed data is needed if all compression techniques fail. This also

implies that lossless buﬀer compression never can reduce memory usage in

the actual buﬀer—such techniques only reduce memory bandwidth usage.

In addition, for the Z-buﬀer, a new z

max

is also computed. This z

max

sent to the tile table. If compression succeeded, the tile’s state is set to

compressed and the data is sent in a compressed form. Otherwise, it is sent

uncompressed and the state is set to uncompressed.

The ATI Radeon cards from 2000 used a diﬀerential pulse code mod-

ulation (DPCM) scheme for compressing the 8 × 8 tiles [902]. This type

of algorithm is good when there is a great deal of coherence in the data

to be compressed. This is often the case for the contents of the Z-buﬀer,

as a triangle tends to cover many pixels, and objects consist of adjacent

triangles. In addition to a new algorithm for depth buﬀer compression, a

survey of existing depth buﬀer compression patents is given by Hasselgren

856 18. Graphics Hardware

and Akenine-M¨oller [511]. For color buﬀer compression, there is a similar

survey by Rasmusson et al. [1049], who also suggest that lossy compression

can be used for color buﬀer compression if the error is kept under strict

control. This may be reasonable in cases where reducing bandwidth is

extremely important, such as for a mobile phone.

18.3.7 Z-Culling and Early-Z

Testing the z-depth is a raster operation that is often done after almost all

the rest of the pipeline has done its work. Automatically killing a fragment

in a tile, before the pixel shader itself is accessed, is a useful optimization

performed by essentially all modern GPUs. This optimization is called

hierarchical Z-culling,orsimplyZ-culling. There are two variants, called

max

-culling and Z

min

-culling. These techniques are automatically used

by the GPU if the pixel shader is analyzed and found not to modify the

z-depth of the fragment [1065]. Doing so saves on pixel shader program

execution costs. In the next section an alternative culling technique is

described that can also be used for pixel shaders modifying the z-depth,

among other situations.

We start by describing Z

max

-culling. This is basically the same idea as

the hierarchical z-buﬀering algorithm (Section 14.6.2), with the exception

that each triangle is not tested for occlusion, and with the restriction that

the entire Z-pyramid is not used; instead only a few levels are used. To per-

form occlusion culling, the rasterizer fetches the z

max

for each tile of pixels

it processes, and it uses this value to exit the pipeline early for occluded

pixels. The size of the tiles may vary from architecture to architecture,

but 8 ×8 pixels is a common size [902]. For these techniques, the triangle

traversal visits a tile at a time. Imagine a triangle being rasterized, and

that a particular tile is visited. In order to test whether the triangle is

occluded inside this tile, we need to compute the minimum z-value, z

tri

min

on the triangle. If z

tri

min

max

, it is guaranteed that the triangle is oc-

cluded by previously rendered geometry in that tile, and all pixel shader

processing can be avoided. In practice, we cannot aﬀord to compute the ex-

act value of z

tri

min

, so instead, a conservative estimate is computed. Several

diﬀerent ways to compute z

tri

min

are possible, each with its own advantages

and disadvantages:

1. The minimum z-value of the three vertices of a triangle can be used

to test against the z

max

. This is not very accurate, but has little

overhead.

2. Evaluate the z-value at the four corners of a tile using the plane

equation of the triangle, and use the minimum of those.

18.3. Architecture 857

Best culling performance is obtained if these two strategies are combined.

This is done by taking the larger of the two z

min

-values. Using Z

max

-culling

means that both pixel shader processing and testing against the Z-buﬀer

can be avoided for most occluded pixels on rasterized primitives, and so

a signiﬁcant performance boost can be expected. If the tile or pixel is

determined not to be occluded, then processing continues as normal. It

should be noted that the description here use a hierarchical Z-buﬀer with

only two levels (the real Z-buﬀer, and the level at, say, 8 × 8 pixel tiles).

However, it is likely that GPUs are using hierarchical Z-buﬀers with more

than two levels.

The other technique is called Z

min

-culling, and the idea is to store

min

of all the pixels in a tile [13]. There are two uses for this. First,

it can be used to support diﬀerent depth tests. For the Z

max

-culling

method, we assumed a “less than” depth test. However, it would be

beneﬁcial if culling could be used with other depth tests as well, and if

min

and z

max

are both available, all depth tests can be supported in this

culling process. Second, it can be used to avoid Z-buﬀer reads. If a triangle

being rendered is deﬁnitely in front of all previously rendered

geometry, per-pixel depth testing is unnecessary. In some cases, Z-

buﬀer reads can be completely avoided, which further boosts

performance.

ATI’s implementation of this type of occlusion culling is called

HyperZ [1065], NVIDIA’s is called the Lightspeed Memory Architecture

(LMA). As always, occlusion culling beneﬁts from rendering front

to back.

This occlusion culling hardware can be of great beneﬁt to multipass

algorithms (see Section 7.9.1). The ﬁrst pass establishes the z-values and

the corresponding z

max

-values. Succeeding passes use all the z

max

-values,

which gives almost perfect occlusion culling for hidden triangles. Given

that pixel shaders are often expensive to evaluate, it can be proﬁtable to

render an extra pass at the beginning purely to establish the Z-buﬀer and

max

-values in advance.

If a tile survives Z-culling, there is still another test that a GPU can

apply per fragment, called early-Z [879, 1108]. The pixel shader program

is analyzed in the same way as for Z-culling. If it does not modify the z-

depth, and the rendering state is set for a typical Z-buﬀer test, then early-

Z can be performed. Before the execution of the pixel shader program,

the fragment’s precise z-depth value is computed and compared to the Z-

buﬀer’s stored value. If the fragment is occluded, it is discarded at this

point. This process thus avoids unnecessary execution of the pixel shader

program hardware. The early-Z test is often confused with Z-culling, but

it is performed by entirely separate hardware; either technique can be used

without the other.

858 18. Graphics Hardware

Another technique with a similar name, and similar intent, is the “early

z pass” method. This is a software technique, the idea being to render

geometry once and establish the Z-buﬀer depths. See Section 7.9.2.

18.3.8 PCU: Programmable Culling Unit

As seen in the previous section, Z-culling has to be disabled by the hard-

ware if the fragment shader writes out a custom depth per pixel. Render-

ing algorithms such as relief mapping and evaluating the sphere equation

inside a quad are two examples where the shader modiﬁes the z-depth.

Blythe [123] mentions that getting good performance relies on being able

to predict the shader output (on a tile basis). One reason why a fully

programmable output merger, i.e., blending, depth testing, stencil testing,

etc, was not included in DirectX 10 is that Z-culling is disabled when the

shader writes to the depth, for example.

The programmable culling unit (PCU) [513] solves this problem, but is

more general than that. The core idea is to run parts of the fragment shader

over an entire tile of pixels, and determine, for example, a lower bound on

the z-values, i.e., z

tri

min

(Section 18.3.7). In this example, the lower bound

can be used for Z-culling. Another example would be to determine if all

the pixels in a tile are completely in shadow, i.e., totally black. If this

is true, then the GPU can avoid executing large parts (or sometimes the

entire) fragment shader program for the current tile.

To compute these conservative bounds (e.g., z

tri

min

), the fragment shader

instructions are executed using interval arithmetic instead of ﬂoating-point

numbers. An interval, ˆa =[a

, a], is simply an interval of numbers, starting

from a

to a. An example is ˆa =[−1, 2], which is the entire set of numbers

from −1 to 2. Arithmetic operations can also be deﬁned to operate on

intervals [900]. Addition of two intervals results in an interval including

all possible combinations from the two interval operands. This can be

expressed as: ˆa +

b =[a

+ b, a + b]. As an example, [−1, 2] + [4, 5] = [3, 7].

If the fragment shader program contains a KIL instruction, all instruc-

tions that depend on this KIL are extracted into a cull shader program,

which will be executed once for the entire tile. All arithmetic operations

work on intervals in this cull shader, and all varying operands are intervals

as well. When the interval operand of the KIL instruction has been com-

puted, it is known whether all fragments in the tile can be killed without

executing the fragment shader for each fragment in the tile. Before start-

ing the execution of the cull shader, the varying input has to be converted

into intervals, and this is done using techniques similar to the one used for

computing z

tri

min

in the previous section.

As a simple example, imagine that the shader program computes diﬀuse

shading as d = l · n. It then tests this dot product to see if it is less than

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (6/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (6/10)