i
i
i
i
i
i
i
i
864 18. Graphics Hardware
Cell
Broadband
Engine
RSX
256 MB
GDDR3
256 MB
XDR
2.5
GB/s
Video
Audio
15 GB/s
20 GB/s
South
Bridge
BD-ROM
HDD
WiFi
GBit Etherne
t
USB
Bluetooth
20.8
GB/s
Video Memory Main Memory
25.6
GB/s
Figure 18.17. PLAYSTATION 3 architecture. (Illustration after Perthuis [1002].)
Sony Computer Entertainment in partnership with IBM and Toshiba, and
is the CPU (central processing unit) of the system. The RSX
R
, developed
by NVIDIA, serves as the GPU (graphics processing unit).
As can be seen, the memory of the PLAYSTATION 3 system is split
into two separate pools. Main memory consists of 256MB of Rambus-
developed XDR memory, connected to the Cell Broadband Engine. The
RSX has its own pool of 256 MB of GDDR3 memory. Although the GDDR3
memory operates at a higher frequency, the XDR memory operates at a
higher bandwidth due to its ability to transfer eight bits over each pin in
a clock cycle (as opposed to two bits for GDDR3). The Cell Broadband
Engine and RSX can access each other’s memory over the FlexIO
TM
in-
terface (also developed by Rambus) connecting the two, although the full
bandwidth can only be used when the RSX is accessing main memory. The
Cell Broadband Engine accesses video memory at lower data rates, espe-
cially when reading from video memory. For this reason, PLAYSTATION 3
application developers try to have the Cell Broadband Engine read from
video memory as little as possible. The south bridge chip, also connected
to the Cell Broadband Engine via FlexIO, is used to access a variety of
I/O devices. These include Blu-Ray Disc and magnetic drives, wireless
game controllers, and network access devices. Video and audio outputs are
connected to the RSX.
The PLAYSTATION 3
R
GPU: The RSX
R
We will start with a description of the GPU. The RSX is essentially a
modified GeForce 7800. The block diagram of the GPU is shown in Fig-
ure 18.18.
The geometry stage of the PLAYSTATION 3 system, located on the
top half of Figure 18.18, supports programmable vertex shaders. It has
i
i
i
i
i
i
i
i
18.4. Case Studies 865
Quad
Pixel
Shader
Quad
Pixel
Shader
Quad
Pixel
Shader
Quad
Pixel
Shader
Quad
Pixel
Shader
Quad
Pixel
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Vertex
Shader
Cull, Clip & Setup
Quad Dispatch
Fragment Crossbar
MergeMerge
20.8 GB/s
Video MemoryCell Broadband
Engine, Main Memory
15 GB/s20 GB/s
Vertices
Fragments
Pixels
Textures
Programs & Constants
Memory Bus
Memory Bus
Memory Bus
Figure 18.18. RSX architecture.
eight vertex shader (geometry) units) that execute programs in parallel
on multiple vertices, enabling a throughput of eight instructions (including
compound multiply-add instructions) per clock. These units support vertex
shader model 3.0 (see Section 3.3.1), and have access to the same constants
and textures.
There are four caches in the geometry stage, one for textures and three
for vertices. The vertex shader units share a small (2 kilobytes) L1 texture
cache. Any misses go to the L2 texture cache, which is shared between the
vertex and pixel shader units. The vertex cache before the vertex shader
units is called Pre T&L (transform and lighting), and the one immediately
after the vertex shader units is called Post T&L. Pre T&L has 4 kilobytes of
i
i
i
i
i
i
i
i
866 18. Graphics Hardware
storage, and Post T&L has, in practice, storage for roughly 24 vertices. The
task of the Pre T&L vertex cache is to avoid redundant memory fetches.
When a vertex that is needed for a triangle can be found in the Pre T&L
vertex cache, the memory fetch for that vertex data can be avoided. The
Post T&L vertex cache, on the other hand, is there to avoid processing
thesamevertexwiththevertexshader more than once. This can happen
because a vertex is, on average, shared by six other triangles. Both these
caches can improve performance tremendously.
Due to the memory requirements of a shaded vertex located in the
Post T&L vertex cache, it may take quite some time to fetch it from
that cache. Therefore, a Primitive Assembly cache is inserted after the
Post T&L vertex cache. This cache can store only four fully shaded ver-
tices, and its task is to avoid fetches from the Post T&L vertex cache.
For example, when rendering a triangle strip, two vertices from the pre-
vious triangle are used, along with another new vertex, to create a new
triangle. If those two vertices already are located in the Primitive As-
sembly cache, only one vertex is fetched from the Post T&L. It should
be noted, however, that as with all caches, the hit rate is not perfect,
so when the desired data is not in the cache, it needs to be fetched or
recomputed.
When all the vertices for a triangle have been assembled in the Primitive
Assembly cache, they are forwarded to the cull, clip, and setup block. This
block implements clipping, triangle setup, and triangle traversal, so it can
be seen as straddling the boundary between the geometry and rasteriza-
tion stages. In addition, the cull, clip, and setup block also implements two
types of culling: backface culling (discussed in Section 14.2), and Z-culling
(described in Section 18.3.7). Since this block generates all the fragments
that are inside a triangle, the PLAYSTATION 3 system can be seen as a
sort-last fragment architecture when only the GPU is taken into consider-
ation. As we shall see, usage of the Cell Broadband Engine can introduce
another level of sorting, making the PLAYSTATION 3 system a hybrid of
sort-first and sort-last fragment.
Fragments are always generated in 2 × 2-pixel units called quads.All
the pixels in a quad must belong to the same triangle. Each quad is sent to
one of six quad pixel shader (rasterizer) units. Each of these units processes
all the pixels in a quad simultaneously, so these units can be thought of as
comprising 24 pixel shader units in total. However, in practice, many of the
quads will have some invalid pixels (pixels that are outside the triangle),
especially in scenes composed of many small triangles. These invalid pixels
are still processed by the quad pixel shader units, and their shading results
are discarded. In the extreme case where each quad has one valid pixel,
this inefficiency can decrease pixel shader throughput by a factor of four.
See McCormack et al.’s paper [842] for different rasterization strategies.
i
i
i
i
i
i
i
i
18.4. Case Studies 867
The pixel shader microcode is arranged into passes, each of which con-
tains a set of computations that can be performed by a quad pixel shader
unit in one clock cycle. A pixel shader will execute some number of passes,
proportional to the shader program length. A quad pixel shader unit con-
tains a texture processor that can do one texture read operation (for the
four pixels in a quad), as well as two arithmetic processors, each of which
can perform up to two vector operations (totaling a maximum vector width
of four) or one scalar operation. Possible vector operations include multi-
ply, add, multiply-add, and dot product. Scalar operations tend to be more
complex, like reciprocal square root, exponent, and logarithm. There is also
a branch processor that can perform one branch operation. Since there are
many scheduling and dependency restrictions, it is not to be expected that
each pass will fully utilize all the processors.
The pixel shader is run over a batch of quads. This is handled by the
quad dispatch unit, which sends the quads through the quad pixel shader
units, processing the first pixel shader pass for all the quads in a batch.
Then the second pass is processed for all quads, and so on, until the pixel
shader is completed. This arrangement has the dual advantages of allowing
very long pixel shaders to be executed, and hiding the very long latencies
common to texture read operations. Latency is hidden because the result of
a texture read for a given quad will not be required until the next pass, by
which time hundreds of quads will have been processed, giving the texture
read time to complete.
The quad’s state is held in a buffer. Since the state includes the values
of temporary variables, the number of quads in a batch is inversely propor-
tional to the memory required to store the state of the temporary registers.
This will depend on the number and precision of the registers, which can
be specified as 32-floats or 16-bit half floats.
Batches are typically several hundred pixels in size. This can cause
problems with dynamic flow control. If the conditional branches do not
go the same way for all the pixels in the batch, performance can degrade
significantly. This means that dynamic conditional branching is only effi-
cient if the condition remains constant over large portions of the screen.
As discussed is Section 3.6, processing pixels in quads also enables the
computation of derivatives, which has many applications.
Each quad pixel shader unit has a 4 kilobyte L1 texture cache that
is used to provide texel data for texture read operations. Misses go to
the L2 texture cache, which is shared between the vertex and pixel shader
units. The L2 texture cache is divided into 96 kilobytes for textures in main
memory, and 48 kilobytes for textures in video memory. Texture reads from
main memory go through the Cell Broadband Engine, resulting in higher
latency. This latency is why more L2 cache space is allocated for main
memory. The RSX
R
also supports texture swizzling to improve cache
i
i
i
i
i
i
i
i
868 18. Graphics Hardware
coherence, using the pattern shown in Figure 18.12. See Section 18.3.1 for
more information on texture caching.
After the pixel shader is run, the final color of each fragment is sent to
the fragment crossbar, where the fragments are sorted and distributed to
the merge units, whose task it is to merge the fragment color and depth
with the pixel values in the color and Z-buffer. Thus it is here that alpha
blending, depth testing, stencil testing, alpha testing, and writing to the
color buffer and the Z-buffer occur. The merge units in the RSX can be
configured to support up to 4× multisampling, in which case the color and
Z-buffers will contain multiple values for each pixel (one per sample).
The merge units also handle Z-compression and decompression. Both
the color buffer and Z-buffer can be configured to have a tiled memory
architecture. Either buffer can be in video or main memory, but in practice,
most games put them in video memory. Tiling is roughly similar to texture
swizzling, in that memory ordering is modified to improve coherence of
memory accesses. However, the reordering is done on a coarser scale. The
exact size of a tile depends on the buffer format and whether it is in video
or main memory, but it is on the order of 32 ×32 pixels.
The PLAYSTATION
R
3 CPU: The Cell Broadband Engine
TM
While the RSX
R
is fundamentally similar to GPUs found in other systems,
the Cell Broadband Engine is more unusual and is perhaps a sign of future
developments in rendering systems. The architecture of the Cell Broadband
Engine can be seen in Figure 18.19.
Besides a conventional PowerPC processor (the PowerPC Processor El-
ement,orPPE), the Cell Broadband Engine contains eight additional Syn-
RSX
SPE
0
LS
(256 KB)
DMA
SPE
1
LS
(256 KB)
DMA
MIC
Memory
Interface
Controller
XIO
SPE
2
LS
(256 KB)
DMA
SPE
3
LS
(256 KB)
DMA
SPE
4
LS
(256 KB)
DMA
SPE
5
LS
(256 KB)
DMA
SPE
6
LS
(256 KB)
DMA
PPE
EIB
L1 (32 KB I/D)
L2
(512 KB)
Flex-
IO
1
Flex-
IO
0
I/O
15 GB/s
20 GB/s
South
Bridge
2.5 GB/s
Main
Memory
25.6 GB/s
Figure 18.19. Cell Broadband Engine architecture. (Illustration after Perthuis [1002].)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset