i
i
i
i
i
i
i
i
18.4. Case Studies 859
zero (facing away from the light). If so, the program sends a KIL command,
which terminates the fragment if the dot product is less than zero. Assume
that the interval result of the dot product is [1.2, 0.3]. This means that
all possible combinations of the light vectors, l,andthenormals,n,forthe
fragments in the current tile evaluate to a negative dot product. The tile
is then culled because the diffuse shading is guaranteed to be zero.
Assume that the tile size is 8 × 8 pixels, and that the cull shader pro-
gram uses four
7
times [513] as many instructions as the fragment shader
program. This is so because evaluating the cull shader is done using interval
arithmetic, which is more costly than using plain floating-point arithmetic.
In this case, it is possible to get a 16× speedup (in theory) for tiles that can
be culled, since the cull shader is only evaluated once for a tile and then
ends the execution. For tiles that cannot be culled, there is actually a slight
slowdown, since you need to execute the PCU’s cull shader program once,
and then the normal 8 ×8 pixel shader programs. Overall the performance
gains are significant, with a 1.4–2.1 times speedup. Currently there are no
commercial implementations available, but the PCU solves an important
problem. It allows programmers to use more general shader programs (e.g.,
modified z-depth) while maintaining performance. A research implemen-
tation shows that the size of a simple fragment shader unit increases by
less than 10% in terms of gates. It should be straightforward to add PCU
hardware to a unified shader architecture.
18.4 Case Studies
In this section, three different graphics hardware architectures will be pre-
sented. The Xbox 360 is described first, followed by the PLAYSTATION
R
3
system architecture. Finally, an architecture that targets mobile devices,
called Mali,ispresented.
18.4.1 Case Study: Xbox 360
The Xbox 360 is a game console built by Microsoft with a graphics hardware
solution by ATI/AMD [268]. It is built around the memory architecture
illustrated in Figure 18.15. The main GPU chip also acts as a memory
controller (often called the north bridge), so the CPU accesses the system
memory through the GPU chip. So, in a sense this is a kind of a unified
memory architecture (UMA). However, as can be seen, there is also memory
that only the GPU can access.
7
The exact number is somewhere between two and four per instruction. However,
the cull shader program is often shorter than the actual fragment shader program.
i
i
i
i
i
i
i
i
860 18. Graphics Hardware
10.8 GB/s
10.8 GB/s
CPU
L1, L2 caches
GPU chip
North bridge
32 GB/s
AZ
256
GB/s
eDRAM
22.4 GB/s
System RAM
512 MB
South bridge
USBs, audio,
network, etc
Daughter chip
Figure 18.15. Xbox 360 memory architecture.
In this overview of the Xbox 360 architecture, we will focus on the
GPU and its daughter chip. One design choice in the Xbox 360 was to
use embedded DRAM (eDRAM) for the frame buffers, and this memory is
embedded in a separate daughter chip. The main point of this design is
that the daughter chip also has some extra logic, here called AZ,which
takes care of all alpha and depth testing, and related operations. Hence,
eDRAM is sometimes called “intelligent” RAM, because it can actually do
something more than just store content. Instead of accessing the frame
buffer from main memory, this is accessed through the eDRAM with AZ
logic, and this gives high performance at a low cost. The eDRAM has
10 × 1024
2
bytes, i.e., 10 MB of storage. In addition, the Xbox 360 is
designed around the concept of a unified shader architecture, which will be
described later.
A block diagram of the Xbox 360 GPU can be found in Figure 18.16.
Rendering commences when the command processor starts reading com-
mands from the main memory. This can be either the start of a render-
ing batch, or state changes. Drawing commands are then forwarded to
the vertex grouper and tessellator (VGT) unit. This unit may receive a
stream of vertex indices, which it groups into primitives (e.g., triangles or
lines). This unit also acts as a vertex cache after transforms, and hence
already-transformed vertices may be reused. In addition, tessellation can
be performed here, as described in Section 13.6.
The instruction set for the Xbox is a superset of SM 3.0 (not a full
SM 4.0), and its vertex and pixel shader languages have converged, i.e.,
they share the same common core. For this reason, a single shader element
can be designed and implemented that can execute both vertex and pixel
shader programs. This is the main idea of a unified shader architecture.
In the Xbox 360, shader execution is always done on vectors of 64 vertices
i
i
i
i
i
i
i
i
18.4. Case Studies 861
Command
processor
Hierarchical
Z/Stencil
Vertex
grouper &
tessellator
primitive
connectivity
data
Triangle
traversal
Primitive
assembler
pixel barycentric
coordinates
vertex
indices
Sequencer
vertex indices
or pixel
interpolation
control
Shader pipe interpolators
Unified
shader pipes
ALU set 0
ALU set 1
ALU set 2
Shader expoert
color/Z
Backend central
Texture
cache
Texture
pipes
Vertex
cache
shader control
texture fetch control
vertex fetch control
vertex shader attributes
vertex shader position export
Z/stencil test results
resolve +
shader
memory
exports
To
daughter
chip
Memory HUB
Memory
controller
Memory
controller
Z/stencil
test results
From
daughter
chip
Figure 18.16. Block diagram of the Xbox 360 graphics processor.
or pixels. Each such vector is executed as a thread, and up to 32 vertex
threads or 64 pixel threads can be active at a time. So while a thread
is waiting for, say, a texture fetch from memory, another thread can be
scheduled for execution on the unified shader pipes. When the requested
memory content is available, the previous thread can be switched back to
active, and execution can continue. This greatly helps to hide latency, and
it also provides for much better load balancing than previous non-unified
architectures. As an example of this, consider a typical GPGPU application
that renders a screen-size quad with long pixel shader programs. In this
case, few shader cores will be used for vertex computations in the beginning,
and then all shader cores will be used for per-pixel computations. Note also
that all execution is performed on 32-bit floating point numbers, and that
all threads share a very large register file containing registers for the shader
computations. For the Xbox 360, there are 24,576 registers. In general,
the more registers a shader needs, the fewer threads can be in flight at a
i
i
i
i
i
i
i
i
862 18. Graphics Hardware
time. This concept will be discussed in detail in the next case study, on
the PLAYSTATION
R
3system.
The sequencer (Figure 18.16) receives vertex indices from the VGT. Its
core task is to schedule threads for execution. Depending on what needs
to be processed, this is done by sending vertex and pixel vectors to units
such as the vertex cache, the texture pipes, or the unified shader pipes.
In the unified shader pipes unit, there are three sets of ALUs (arithmetic
logic units), each consisting of 16 small shader cores. One such set of cores
can execute one operation on a vertex vector (64 entries) or pixel vector
(64 entries) over four clock cycles. Each shader core can execute one vector
operation and one scalar operation per cycle.
After a vertex or pixel shader program has been executed, the results
are forwarded to the shader export. If a vertex shader has been executed,
all the outputs, such as transformed vertex coordinates and texture coor-
dinates, are “exported” back into the primitive assembler. The primitive
assembler obtains how a triangle is connected from the VGT (through a
deep FIFO), and “imports” the transformed vertices from the shader export
unit. Note that FIFO buffers are there to avoid stalls during operations
with long latency. Then it performs triangle setup, clipping, and view-
port culling. The remaining information is then forwarded to the triangle
traversal unit, which operates on 8 × 8 tiles. The reason to work with
tiles is that texture caching is more efficient, and buffer compression and
various Z-culling techniques can be used. Tiles are sent to the hierarchical
Z/stencil unit, where Z-culling (see Section 18.3.7) and early stencil rejec-
tion are done. This unit stores 11 bits for depth (z
max
)andonebitfor
stencil per 16 samples in the depth buffer. Each quad is tested against the
z
max
for occlusion, and up to 16 quads can be accepted/rejected per cycle.
Surviving 2 × 2 pixels quads are sent back to the triangle traversal unit.
Since the ALU sets can handle a vector of 64 pixels at a time, the triangle
traversal unit packs 16 quads (i.e., 16 × 2 × 2) into a vector. Note that
there can be quads from different primitives in a vector. A similar process
is used in the PLAYSTATION
R
3 system, and will be explained further
in that section.
A vector of quad fragments with their barycentric coordinates is then
sent to the shader pipe interpolators, which perform all vertex attribute
interpolation, and place these values into the register file of the unified
shader.
The texture cache is a 32 kB four-way set associative cache that stores
uncompressed texture data. The texture pipes retrieve texels from the
texture cache, and can compute 16 bilinear filtered samples per cycle. So,
trilinear mipmapping then runs at half the speed, but this still amounts to
eight trilinear filtered samples per clock. Adaptive anisotropic filtering is
also handled here.
i
i
i
i
i
i
i
i
18.4. Case Studies 863
When the unified shaders have executed a pixel shader program for a 64
entry vector, it “exports” the pixel shader outputs, which typically consist
of color. Two quads (2 ×2 ×2 pixels) can be handled per cycle. Depth can
also be exported (but this is seldom done), which costs an extra cycle. The
backend central groups pixel quads and reorders them to best utilize the
eDRAM bandwidth. Note that at this point, we have “looped through”
the entire architecture. The execution starts when triangles are read from
memory, and the unified shader pipes executes a vertex shader program for
the vertices. Then, the resulting information is rerouted back to the setup
units and then injected into the unified shader pipes, in order to execute a
pixel shader program. Finally, pixel output (color + depth) is ready to be
sent to the frame buffer.
The daughter chip of the Xbox 360 performs merging operations, and
the available bandwidth is 32 GB/s. Eight pixels, each with four samples,
can be sent per clock cycle. A sample stores a representation of a 32-bit
color and uses lossless Z-compression for depth. Alternatively, 16 pixels can
be handled if only depth is used. This can be useful for shadow volumes
and when rendering out depth only, e.g., for shadow mapping or as a pre-
pass for deferred shading. In the daughter chip, there is extra logic (called
“AZ” in Figure 18.15) for handling alpha blending, stencil testing, and
depth testing. When this has been done, there are 256 GB/s available
directly to the DRAM memory in the daughter chip.
As an example of the great advantage of this design, we consider alpha
blending. Usually, one first needs to read the colors from the color buffer,
then blend the current color (generated by a pixel shader), and then send
it back to the color buffer. With the Xbox 360 architecture, the generated
color is sent to the daughter chip, and the rest of the communication is han-
dled inside that chip. Since depth compression also is used and performed
in the daughter chip, similar advantages accrue.
When a frame has been rendered, all the samples of the pixels reside
in the eDRAM in the daughter chip. The back buffer needs to be sent to
main memory, so it can be displayed onscreen. For multi-sampled buffers,
the downsampling is done by the AZ unit in the daughter chip, before it is
sent over the bus to main memory.
18.4.2 Case Study: The PLAYSTATION
R
3 System
The PLAYSTATION 3 system
8
is a game system built by Sony Computer
Entertainment. The architecture of the PLAYSTATION 3 system can be
seen in Figure 18.17. The Cell Broadband Engine
TM
was developed by
8
“PlayStation,” “PLAYSTATION,” and the “PS” Family logo are registered trade-
marks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment
Inc. The “Blu-ray Disc” name and logo are trademarks.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset