18. Graphics Hardware (8/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

864 18. Graphics Hardware

Cell

Broadband

Engine

RSX

256 MB

GDDR3

256 MB

XDR

2.5

GB/s

Video

Audio

15 GB/s

20 GB/s

South

Bridge

BD-ROM

HDD

WiFi

GBit Etherne

USB

Bluetooth

20.8

GB/s

Video Memory Main Memory

25.6

GB/s

Figure 18.17. PLAYSTATION 3 architecture. (Illustration after Perthuis [1002].)

Sony Computer Entertainment in partnership with IBM and Toshiba, and

is the CPU (central processing unit) of the system. The RSX



, developed

by NVIDIA, serves as the GPU (graphics processing unit).

As can be seen, the memory of the PLAYSTATION 3 system is split

into two separate pools. Main memory consists of 256MB of Rambus-

developed XDR memory, connected to the Cell Broadband Engine. The

RSX has its own pool of 256 MB of GDDR3 memory. Although the GDDR3

memory operates at a higher frequency, the XDR memory operates at a

higher bandwidth due to its ability to transfer eight bits over each pin in

a clock cycle (as opposed to two bits for GDDR3). The Cell Broadband

Engine and RSX can access each other’s memory over the FlexIO

in-

terface (also developed by Rambus) connecting the two, although the full

bandwidth can only be used when the RSX is accessing main memory. The

Cell Broadband Engine accesses video memory at lower data rates, espe-

cially when reading from video memory. For this reason, PLAYSTATION 3

application developers try to have the Cell Broadband Engine read from

video memory as little as possible. The south bridge chip, also connected

to the Cell Broadband Engine via FlexIO, is used to access a variety of

I/O devices. These include Blu-Ray Disc and magnetic drives, wireless

game controllers, and network access devices. Video and audio outputs are

connected to the RSX.

The PLAYSTATION 3



GPU: The RSX



We will start with a description of the GPU. The RSX is essentially a

modiﬁed GeForce 7800. The block diagram of the GPU is shown in Fig-

ure 18.18.

The geometry stage of the PLAYSTATION 3 system, located on the

top half of Figure 18.18, supports programmable vertex shaders. It has

18.4. Case Studies 865

Quad

Pixel

Shader

Quad

Pixel

Shader

Quad

Pixel

Shader

Quad

Pixel

Shader

Quad

Pixel

Shader

Quad

Pixel

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Vertex

Shader

Cull, Clip & Setup

Quad Dispatch

Fragment Crossbar

MergeMerge

20.8 GB/s

Video MemoryCell Broadband

Engine, Main Memory

15 GB/s20 GB/s

Vertices

Fragments

Pixels

Textures

Programs & Constants

Memory Bus

Figure 18.18. RSX architecture.

eight vertex shader (geometry) units) that execute programs in parallel

on multiple vertices, enabling a throughput of eight instructions (including

compound multiply-add instructions) per clock. These units support vertex

shader model 3.0 (see Section 3.3.1), and have access to the same constants

and textures.

There are four caches in the geometry stage, one for textures and three

for vertices. The vertex shader units share a small (2 kilobytes) L1 texture

cache. Any misses go to the L2 texture cache, which is shared between the

vertex and pixel shader units. The vertex cache before the vertex shader

units is called Pre T&L (transform and lighting), and the one immediately

after the vertex shader units is called Post T&L. Pre T&L has 4 kilobytes of

866 18. Graphics Hardware

storage, and Post T&L has, in practice, storage for roughly 24 vertices. The

task of the Pre T&L vertex cache is to avoid redundant memory fetches.

When a vertex that is needed for a triangle can be found in the Pre T&L

vertex cache, the memory fetch for that vertex data can be avoided. The

Post T&L vertex cache, on the other hand, is there to avoid processing

thesamevertexwiththevertexshader more than once. This can happen

because a vertex is, on average, shared by six other triangles. Both these

caches can improve performance tremendously.

Due to the memory requirements of a shaded vertex located in the

Post T&L vertex cache, it may take quite some time to fetch it from

that cache. Therefore, a Primitive Assembly cache is inserted after the

Post T&L vertex cache. This cache can store only four fully shaded ver-

tices, and its task is to avoid fetches from the Post T&L vertex cache.

For example, when rendering a triangle strip, two vertices from the pre-

vious triangle are used, along with another new vertex, to create a new

triangle. If those two vertices already are located in the Primitive As-

sembly cache, only one vertex is fetched from the Post T&L. It should

be noted, however, that as with all caches, the hit rate is not perfect,

so when the desired data is not in the cache, it needs to be fetched or

recomputed.

When all the vertices for a triangle have been assembled in the Primitive

Assembly cache, they are forwarded to the cull, clip, and setup block. This

block implements clipping, triangle setup, and triangle traversal, so it can

be seen as straddling the boundary between the geometry and rasteriza-

tion stages. In addition, the cull, clip, and setup block also implements two

types of culling: backface culling (discussed in Section 14.2), and Z-culling

(described in Section 18.3.7). Since this block generates all the fragments

that are inside a triangle, the PLAYSTATION 3 system can be seen as a

sort-last fragment architecture when only the GPU is taken into consider-

ation. As we shall see, usage of the Cell Broadband Engine can introduce

another level of sorting, making the PLAYSTATION 3 system a hybrid of

sort-ﬁrst and sort-last fragment.

Fragments are always generated in 2 × 2-pixel units called quads.All

the pixels in a quad must belong to the same triangle. Each quad is sent to

one of six quad pixel shader (rasterizer) units. Each of these units processes

all the pixels in a quad simultaneously, so these units can be thought of as

comprising 24 pixel shader units in total. However, in practice, many of the

quads will have some invalid pixels (pixels that are outside the triangle),

especially in scenes composed of many small triangles. These invalid pixels

are still processed by the quad pixel shader units, and their shading results

are discarded. In the extreme case where each quad has one valid pixel,

this ineﬃciency can decrease pixel shader throughput by a factor of four.

See McCormack et al.’s paper [842] for diﬀerent rasterization strategies.

18.4. Case Studies 867

The pixel shader microcode is arranged into passes, each of which con-

tains a set of computations that can be performed by a quad pixel shader

unit in one clock cycle. A pixel shader will execute some number of passes,

proportional to the shader program length. A quad pixel shader unit con-

tains a texture processor that can do one texture read operation (for the

four pixels in a quad), as well as two arithmetic processors, each of which

can perform up to two vector operations (totaling a maximum vector width

of four) or one scalar operation. Possible vector operations include multi-

ply, add, multiply-add, and dot product. Scalar operations tend to be more

complex, like reciprocal square root, exponent, and logarithm. There is also

a branch processor that can perform one branch operation. Since there are

many scheduling and dependency restrictions, it is not to be expected that

each pass will fully utilize all the processors.

The pixel shader is run over a batch of quads. This is handled by the

quad dispatch unit, which sends the quads through the quad pixel shader

units, processing the ﬁrst pixel shader pass for all the quads in a batch.

Then the second pass is processed for all quads, and so on, until the pixel

shader is completed. This arrangement has the dual advantages of allowing

very long pixel shaders to be executed, and hiding the very long latencies

common to texture read operations. Latency is hidden because the result of

a texture read for a given quad will not be required until the next pass, by

which time hundreds of quads will have been processed, giving the texture

read time to complete.

The quad’s state is held in a buﬀer. Since the state includes the values

of temporary variables, the number of quads in a batch is inversely propor-

tional to the memory required to store the state of the temporary registers.

This will depend on the number and precision of the registers, which can

be speciﬁed as 32-ﬂoats or 16-bit half ﬂoats.

Batches are typically several hundred pixels in size. This can cause

problems with dynamic ﬂow control. If the conditional branches do not

go the same way for all the pixels in the batch, performance can degrade

signiﬁcantly. This means that dynamic conditional branching is only eﬃ-

cient if the condition remains constant over large portions of the screen.

As discussed is Section 3.6, processing pixels in quads also enables the

computation of derivatives, which has many applications.

Each quad pixel shader unit has a 4 kilobyte L1 texture cache that

is used to provide texel data for texture read operations. Misses go to

the L2 texture cache, which is shared between the vertex and pixel shader

units. The L2 texture cache is divided into 96 kilobytes for textures in main

memory, and 48 kilobytes for textures in video memory. Texture reads from

main memory go through the Cell Broadband Engine, resulting in higher

latency. This latency is why more L2 cache space is allocated for main

memory. The RSX



also supports texture swizzling to improve cache

868 18. Graphics Hardware

coherence, using the pattern shown in Figure 18.12. See Section 18.3.1 for

more information on texture caching.

After the pixel shader is run, the ﬁnal color of each fragment is sent to

the fragment crossbar, where the fragments are sorted and distributed to

the merge units, whose task it is to merge the fragment color and depth

with the pixel values in the color and Z-buﬀer. Thus it is here that alpha

blending, depth testing, stencil testing, alpha testing, and writing to the

color buﬀer and the Z-buﬀer occur. The merge units in the RSX can be

conﬁgured to support up to 4× multisampling, in which case the color and

Z-buﬀers will contain multiple values for each pixel (one per sample).

The merge units also handle Z-compression and decompression. Both

the color buﬀer and Z-buﬀer can be conﬁgured to have a tiled memory

architecture. Either buﬀer can be in video or main memory, but in practice,

most games put them in video memory. Tiling is roughly similar to texture

swizzling, in that memory ordering is modiﬁed to improve coherence of

memory accesses. However, the reordering is done on a coarser scale. The

exact size of a tile depends on the buﬀer format and whether it is in video

or main memory, but it is on the order of 32 ×32 pixels.

The PLAYSTATION



3 CPU: The Cell Broadband Engine

While the RSX



is fundamentally similar to GPUs found in other systems,

the Cell Broadband Engine is more unusual and is perhaps a sign of future

developments in rendering systems. The architecture of the Cell Broadband

Engine can be seen in Figure 18.19.

Besides a conventional PowerPC processor (the PowerPC Processor El-

ement,orPPE), the Cell Broadband Engine contains eight additional Syn-

RSX

SPE

(256 KB)

DMA

SPE

(256 KB)

DMA

MIC

Memory

Interface

Controller

XIO

SPE

(256 KB)

DMA

SPE

(256 KB)

DMA

SPE

(256 KB)

DMA

SPE

(256 KB)

DMA

SPE

(256 KB)

DMA

PPE

EIB

L1 (32 KB I/D)

(512 KB)

Flex-

I/O

15 GB/s

20 GB/s

South

Bridge

2.5 GB/s

Main

Memory

25.6 GB/s

Figure 18.19. Cell Broadband Engine architecture. (Illustration after Perthuis [1002].)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (8/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (8/10)