i
i
i
i
i
i
i
i
854 18. Graphics Hardware
the latency when requesting access to texel data. One technique for hiding
this type of latency is described on page 847.
A different type of latency is that from reading back data from GPU to
CPU. A good mental model is to think of the GPU and CPU as separate
computers working asynchronously, with communication between the two
taking some effort. Latency from changing the direction of the flow of
information can seriously hurt performance. When data is read back from
the accelerator, the pipeline often has to be flushed before the read. During
this time, the CPU is idle. An example of a read-back mechanism that does
not produce a stall is the occlusion query. See Section 14.6.1. For occlusion
testing, the mechanism is to perform the query and then occasionally check
the GPU to see if the results of the query are available. While waiting for
the results, other work can then be done on both the CPU and GPU.
18.3.6 Buffer Compression
As can be seen in Equation 18.8, there are several components adding up
to the bandwidth traffic. To obtain an efficient GPU architecture, one
needs to work on all fronts to reduce bandwidth usage. Even for occluded
triangles, there is still a substantial amount of Z-buffer bandwidth (d×Z
r
).
In addition, it would be advantageous to reduce the bandwidth traffic to
other buffers, such as color and stencil, as well. In this section, we will
present techniques to compress and decompress these types of buffers on
the fly, i.e., as images are being rendered. This includes algorithms for
rapidly clearing the contents of the buffers. Automatic GPU occlusion
culling (called Z-culling), compression and decompression of buffers, and
rapid clears are typically implemented in the same subsystem in hardware,
since they share some parts of the implementation. For clarity, we present
buffer compression and fast clears together with the system first, and then
discuss Z-culling.
Central to these algorithms is on-chip storage or a cache, which we
call the tile table, with additional information stored for each tile, i.e.,
rectangle, of pixels in the buffer. A block diagram of this algorithm is
shown in Figure 18.14, where we consider only Z-buffer compression, for
simplicity. However, this general architecture applies to other buffers, such
as color and stencil, as well. Each element in the tile table stores the status
of,say,an8× 8 tile of pixels in the frame buffer. This includes a state
and the maximum z-value, z
max
(and possibly z
min
as well—more about
that in Section 18.3.7), for the tile. As will be seen later, the z
max
can be
used for occlusion culling. The state of each tile can be either compressed,
uncompressed,orcleared. In general there could also be different types of
compressed blocks. For example, one compressed mode might compress
down to 25% and another to 50%. These modes are used to implement a
i
i
i
i
i
i
i
i
18.3. Architecture 855
Tile table
8 x 8 uncompressed
z-values + z
max
Figure 18.14. Block diagram of hardware techniques for fast clear of the Z-buffer, com-
pression of Z-buffer, and Z-culling. Applies to other types of buffers too.
fast clear and compression of the buffer. When the system issues a clear
of the buffer, the state of each tile is set to cleared, and the frame buffer
proper is not touched. When the rasterizer needs to read the cleared buffer,
the decompressor unit first checks the state, sees that the tile is cleared,
and so can send a buffer tile with all values set to the clear value for that
buffer (e.g., z
far
for the Z-buffer) without reading and uncompressing the
actual buffer. In this way, access to the buffer itself is minimized during
clears, so saving bandwidth. If the state is not cleared, then the buffer
for that tile has to be read. When the rasterizer has finished writing new
values to the tile, it is sent to the compressor, where an attempt is made at
compressing it. If there are two compression modes, both could be tried,
and the one that can compress that tile with smallest amount of bits is
used. Since we usually need lossless buffer compression, a fallback to using
uncompressed data is needed if all compression techniques fail. This also
implies that lossless buffer compression never can reduce memory usage in
the actual buffer—such techniques only reduce memory bandwidth usage.
In addition, for the Z-buffer, a new z
max
is also computed. This z
max
is
sent to the tile table. If compression succeeded, the tile’s state is set to
compressed and the data is sent in a compressed form. Otherwise, it is sent
uncompressed and the state is set to uncompressed.
The ATI Radeon cards from 2000 used a differential pulse code mod-
ulation (DPCM) scheme for compressing the 8 × 8 tiles [902]. This type
of algorithm is good when there is a great deal of coherence in the data
to be compressed. This is often the case for the contents of the Z-buffer,
as a triangle tends to cover many pixels, and objects consist of adjacent
triangles. In addition to a new algorithm for depth buffer compression, a
survey of existing depth buffer compression patents is given by Hasselgren
i
i
i
i
i
i
i
i
856 18. Graphics Hardware
and Akenine-M¨oller [511]. For color buffer compression, there is a similar
survey by Rasmusson et al. [1049], who also suggest that lossy compression
can be used for color buffer compression if the error is kept under strict
control. This may be reasonable in cases where reducing bandwidth is
extremely important, such as for a mobile phone.
18.3.7 Z-Culling and Early-Z
Testing the z-depth is a raster operation that is often done after almost all
the rest of the pipeline has done its work. Automatically killing a fragment
in a tile, before the pixel shader itself is accessed, is a useful optimization
performed by essentially all modern GPUs. This optimization is called
hierarchical Z-culling,orsimplyZ-culling. There are two variants, called
Z
max
-culling and Z
min
-culling. These techniques are automatically used
by the GPU if the pixel shader is analyzed and found not to modify the
z-depth of the fragment [1065]. Doing so saves on pixel shader program
execution costs. In the next section an alternative culling technique is
described that can also be used for pixel shaders modifying the z-depth,
among other situations.
We start by describing Z
max
-culling. This is basically the same idea as
the hierarchical z-buffering algorithm (Section 14.6.2), with the exception
that each triangle is not tested for occlusion, and with the restriction that
the entire Z-pyramid is not used; instead only a few levels are used. To per-
form occlusion culling, the rasterizer fetches the z
max
for each tile of pixels
it processes, and it uses this value to exit the pipeline early for occluded
pixels. The size of the tiles may vary from architecture to architecture,
but 8 ×8 pixels is a common size [902]. For these techniques, the triangle
traversal visits a tile at a time. Imagine a triangle being rasterized, and
that a particular tile is visited. In order to test whether the triangle is
occluded inside this tile, we need to compute the minimum z-value, z
tri
min
,
on the triangle. If z
tri
min
>z
max
, it is guaranteed that the triangle is oc-
cluded by previously rendered geometry in that tile, and all pixel shader
processing can be avoided. In practice, we cannot afford to compute the ex-
act value of z
tri
min
, so instead, a conservative estimate is computed. Several
different ways to compute z
tri
min
are possible, each with its own advantages
and disadvantages:
1. The minimum z-value of the three vertices of a triangle can be used
to test against the z
max
. This is not very accurate, but has little
overhead.
2. Evaluate the z-value at the four corners of a tile using the plane
equation of the triangle, and use the minimum of those.
i
i
i
i
i
i
i
i
18.3. Architecture 857
Best culling performance is obtained if these two strategies are combined.
This is done by taking the larger of the two z
min
-values. Using Z
max
-culling
means that both pixel shader processing and testing against the Z-buffer
can be avoided for most occluded pixels on rasterized primitives, and so
a significant performance boost can be expected. If the tile or pixel is
determined not to be occluded, then processing continues as normal. It
should be noted that the description here use a hierarchical Z-buffer with
only two levels (the real Z-buffer, and the level at, say, 8 × 8 pixel tiles).
However, it is likely that GPUs are using hierarchical Z-buffers with more
than two levels.
The other technique is called Z
min
-culling, and the idea is to store
z
min
of all the pixels in a tile [13]. There are two uses for this. First,
it can be used to support different depth tests. For the Z
max
-culling
method, we assumed a “less than” depth test. However, it would be
beneficial if culling could be used with other depth tests as well, and if
z
min
and z
max
are both available, all depth tests can be supported in this
culling process. Second, it can be used to avoid Z-buffer reads. If a triangle
being rendered is definitely in front of all previously rendered
geometry, per-pixel depth testing is unnecessary. In some cases, Z-
buffer reads can be completely avoided, which further boosts
performance.
ATI’s implementation of this type of occlusion culling is called
HyperZ [1065], NVIDIA’s is called the Lightspeed Memory Architecture
(LMA). As always, occlusion culling benefits from rendering front
to back.
This occlusion culling hardware can be of great benefit to multipass
algorithms (see Section 7.9.1). The first pass establishes the z-values and
the corresponding z
max
-values. Succeeding passes use all the z
max
-values,
which gives almost perfect occlusion culling for hidden triangles. Given
that pixel shaders are often expensive to evaluate, it can be profitable to
render an extra pass at the beginning purely to establish the Z-buffer and
z
max
-values in advance.
If a tile survives Z-culling, there is still another test that a GPU can
apply per fragment, called early-Z [879, 1108]. The pixel shader program
is analyzed in the same way as for Z-culling. If it does not modify the z-
depth, and the rendering state is set for a typical Z-buffer test, then early-
Z can be performed. Before the execution of the pixel shader program,
the fragment’s precise z-depth value is computed and compared to the Z-
buffer’s stored value. If the fragment is occluded, it is discarded at this
point. This process thus avoids unnecessary execution of the pixel shader
program hardware. The early-Z test is often confused with Z-culling, but
it is performed by entirely separate hardware; either technique can be used
without the other.
i
i
i
i
i
i
i
i
858 18. Graphics Hardware
Another technique with a similar name, and similar intent, is the “early
z pass” method. This is a software technique, the idea being to render
geometry once and establish the Z-buffer depths. See Section 7.9.2.
18.3.8 PCU: Programmable Culling Unit
As seen in the previous section, Z-culling has to be disabled by the hard-
ware if the fragment shader writes out a custom depth per pixel. Render-
ing algorithms such as relief mapping and evaluating the sphere equation
inside a quad are two examples where the shader modifies the z-depth.
Blythe [123] mentions that getting good performance relies on being able
to predict the shader output (on a tile basis). One reason why a fully
programmable output merger, i.e., blending, depth testing, stencil testing,
etc, was not included in DirectX 10 is that Z-culling is disabled when the
shader writes to the depth, for example.
The programmable culling unit (PCU) [513] solves this problem, but is
more general than that. The core idea is to run parts of the fragment shader
over an entire tile of pixels, and determine, for example, a lower bound on
the z-values, i.e., z
tri
min
(Section 18.3.7). In this example, the lower bound
can be used for Z-culling. Another example would be to determine if all
the pixels in a tile are completely in shadow, i.e., totally black. If this
is true, then the GPU can avoid executing large parts (or sometimes the
entire) fragment shader program for the current tile.
To compute these conservative bounds (e.g., z
tri
min
), the fragment shader
instructions are executed using interval arithmetic instead of floating-point
numbers. An interval, ˆa =[a
, a], is simply an interval of numbers, starting
from a
to a. An example is ˆa =[1, 2], which is the entire set of numbers
from 1 to 2. Arithmetic operations can also be defined to operate on
intervals [900]. Addition of two intervals results in an interval including
all possible combinations from the two interval operands. This can be
expressed as: ˆa +
ˆ
b =[a
+ b, a + b]. As an example, [1, 2] + [4, 5] = [3, 7].
If the fragment shader program contains a KIL instruction, all instruc-
tions that depend on this KIL are extracted into a cull shader program,
which will be executed once for the entire tile. All arithmetic operations
work on intervals in this cull shader, and all varying operands are intervals
as well. When the interval operand of the KIL instruction has been com-
puted, it is known whether all fragments in the tile can be killed without
executing the fragment shader for each fragment in the tile. Before start-
ing the execution of the cull shader, the varying input has to be converted
into intervals, and this is done using techniques similar to the one used for
computing z
tri
min
in the previous section.
As a simple example, imagine that the shader program computes diffuse
shading as d = l · n. It then tests this dot product to see if it is less than
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset