18. Graphics Hardware (7/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

18.4. Case Studies 859

zero (facing away from the light). If so, the program sends a KIL command,

which terminates the fragment if the dot product is less than zero. Assume

that the interval result of the dot product is [−1.2, −0.3]. This means that

all possible combinations of the light vectors, l,andthenormals,n,forthe

fragments in the current tile evaluate to a negative dot product. The tile

is then culled because the diﬀuse shading is guaranteed to be zero.

Assume that the tile size is 8 × 8 pixels, and that the cull shader pro-

gram uses four

times [513] as many instructions as the fragment shader

program. This is so because evaluating the cull shader is done using interval

arithmetic, which is more costly than using plain ﬂoating-point arithmetic.

In this case, it is possible to get a 16× speedup (in theory) for tiles that can

be culled, since the cull shader is only evaluated once for a tile and then

ends the execution. For tiles that cannot be culled, there is actually a slight

slowdown, since you need to execute the PCU’s cull shader program once,

and then the normal 8 ×8 pixel shader programs. Overall the performance

gains are signiﬁcant, with a 1.4–2.1 times speedup. Currently there are no

commercial implementations available, but the PCU solves an important

problem. It allows programmers to use more general shader programs (e.g.,

modiﬁed z-depth) while maintaining performance. A research implemen-

tation shows that the size of a simple fragment shader unit increases by

less than 10% in terms of gates. It should be straightforward to add PCU

hardware to a uniﬁed shader architecture.

18.4 Case Studies

In this section, three diﬀerent graphics hardware architectures will be pre-

sented. The Xbox 360 is described ﬁrst, followed by the PLAYSTATION



system architecture. Finally, an architecture that targets mobile devices,

called Mali,ispresented.

18.4.1 Case Study: Xbox 360

The Xbox 360 is a game console built by Microsoft with a graphics hardware

solution by ATI/AMD [268]. It is built around the memory architecture

illustrated in Figure 18.15. The main GPU chip also acts as a memory

controller (often called the north bridge), so the CPU accesses the system

memory through the GPU chip. So, in a sense this is a kind of a uniﬁed

memory architecture (UMA). However, as can be seen, there is also memory

that only the GPU can access.

The exact number is somewhere between two and four per instruction. However,

the cull shader program is often shorter than the actual fragment shader program.

860 18. Graphics Hardware

10.8 GB/s

CPU

L1, L2 caches

GPU chip

North bridge

32 GB/s

256

GB/s

eDRAM

22.4 GB/s

System RAM

512 MB

South bridge

USBs, audio,

network, etc

Daughter chip

Figure 18.15. Xbox 360 memory architecture.

In this overview of the Xbox 360 architecture, we will focus on the

GPU and its daughter chip. One design choice in the Xbox 360 was to

use embedded DRAM (eDRAM) for the frame buﬀers, and this memory is

embedded in a separate daughter chip. The main point of this design is

that the daughter chip also has some extra logic, here called AZ,which

takes care of all alpha and depth testing, and related operations. Hence,

eDRAM is sometimes called “intelligent” RAM, because it can actually do

something more than just store content. Instead of accessing the frame

buﬀer from main memory, this is accessed through the eDRAM with AZ

logic, and this gives high performance at a low cost. The eDRAM has

10 × 1024

bytes, i.e., 10 MB of storage. In addition, the Xbox 360 is

designed around the concept of a uniﬁed shader architecture, which will be

described later.

A block diagram of the Xbox 360 GPU can be found in Figure 18.16.

Rendering commences when the command processor starts reading com-

mands from the main memory. This can be either the start of a render-

ing batch, or state changes. Drawing commands are then forwarded to

the vertex grouper and tessellator (VGT) unit. This unit may receive a

stream of vertex indices, which it groups into primitives (e.g., triangles or

lines). This unit also acts as a vertex cache after transforms, and hence

already-transformed vertices may be reused. In addition, tessellation can

be performed here, as described in Section 13.6.

The instruction set for the Xbox is a superset of SM 3.0 (not a full

SM 4.0), and its vertex and pixel shader languages have converged, i.e.,

they share the same common core. For this reason, a single shader element

can be designed and implemented that can execute both vertex and pixel

shader programs. This is the main idea of a uniﬁed shader architecture.

In the Xbox 360, shader execution is always done on vectors of 64 vertices

18.4. Case Studies 861

Command

processor

Hierarchical

Z/Stencil

Vertex

grouper &

tessellator

primitive

connectivity

data

Triangle

traversal

Primitive

assembler

pixel barycentric

coordinates

vertex

indices

Sequencer

vertex indices

or pixel

interpolation

control

Shader pipe interpolators

Unified

shader pipes

ALU set 0

ALU set 1

ALU set 2

Shader expoert

color/Z

Backend central

Texture

cache

Texture

pipes

Vertex

cache

shader control

texture fetch control

vertex fetch control

vertex shader attributes

vertex shader position export

Z/stencil test results

resolve +

shader

memory

exports

daughter

chip

Memory HUB

Memory

controller

Memory

controller

Z/stencil

test results

From

daughter

chip

Figure 18.16. Block diagram of the Xbox 360 graphics processor.

or pixels. Each such vector is executed as a thread, and up to 32 vertex

threads or 64 pixel threads can be active at a time. So while a thread

is waiting for, say, a texture fetch from memory, another thread can be

scheduled for execution on the uniﬁed shader pipes. When the requested

memory content is available, the previous thread can be switched back to

active, and execution can continue. This greatly helps to hide latency, and

it also provides for much better load balancing than previous non-uniﬁed

architectures. As an example of this, consider a typical GPGPU application

that renders a screen-size quad with long pixel shader programs. In this

case, few shader cores will be used for vertex computations in the beginning,

and then all shader cores will be used for per-pixel computations. Note also

that all execution is performed on 32-bit ﬂoating point numbers, and that

all threads share a very large register ﬁle containing registers for the shader

computations. For the Xbox 360, there are 24,576 registers. In general,

the more registers a shader needs, the fewer threads can be in ﬂight at a

862 18. Graphics Hardware

time. This concept will be discussed in detail in the next case study, on

the PLAYSTATION



3system.

The sequencer (Figure 18.16) receives vertex indices from the VGT. Its

core task is to schedule threads for execution. Depending on what needs

to be processed, this is done by sending vertex and pixel vectors to units

such as the vertex cache, the texture pipes, or the uniﬁed shader pipes.

In the uniﬁed shader pipes unit, there are three sets of ALUs (arithmetic

logic units), each consisting of 16 small shader cores. One such set of cores

can execute one operation on a vertex vector (64 entries) or pixel vector

(64 entries) over four clock cycles. Each shader core can execute one vector

operation and one scalar operation per cycle.

After a vertex or pixel shader program has been executed, the results

are forwarded to the shader export. If a vertex shader has been executed,

all the outputs, such as transformed vertex coordinates and texture coor-

dinates, are “exported” back into the primitive assembler. The primitive

assembler obtains how a triangle is connected from the VGT (through a

deep FIFO), and “imports” the transformed vertices from the shader export

unit. Note that FIFO buﬀers are there to avoid stalls during operations

with long latency. Then it performs triangle setup, clipping, and view-

port culling. The remaining information is then forwarded to the triangle

traversal unit, which operates on 8 × 8 tiles. The reason to work with

tiles is that texture caching is more eﬃcient, and buﬀer compression and

various Z-culling techniques can be used. Tiles are sent to the hierarchical

Z/stencil unit, where Z-culling (see Section 18.3.7) and early stencil rejec-

tion are done. This unit stores 11 bits for depth (z

max

)andonebitfor

stencil per 16 samples in the depth buﬀer. Each quad is tested against the

max

for occlusion, and up to 16 quads can be accepted/rejected per cycle.

Surviving 2 × 2 pixels quads are sent back to the triangle traversal unit.

Since the ALU sets can handle a vector of 64 pixels at a time, the triangle

traversal unit packs 16 quads (i.e., 16 × 2 × 2) into a vector. Note that

there can be quads from diﬀerent primitives in a vector. A similar process

is used in the PLAYSTATION



3 system, and will be explained further

in that section.

A vector of quad fragments with their barycentric coordinates is then

sent to the shader pipe interpolators, which perform all vertex attribute

interpolation, and place these values into the register ﬁle of the uniﬁed

shader.

The texture cache is a 32 kB four-way set associative cache that stores

uncompressed texture data. The texture pipes retrieve texels from the

texture cache, and can compute 16 bilinear ﬁltered samples per cycle. So,

trilinear mipmapping then runs at half the speed, but this still amounts to

eight trilinear ﬁltered samples per clock. Adaptive anisotropic ﬁltering is

also handled here.

18.4. Case Studies 863

When the uniﬁed shaders have executed a pixel shader program for a 64

entry vector, it “exports” the pixel shader outputs, which typically consist

of color. Two quads (2 ×2 ×2 pixels) can be handled per cycle. Depth can

also be exported (but this is seldom done), which costs an extra cycle. The

backend central groups pixel quads and reorders them to best utilize the

eDRAM bandwidth. Note that at this point, we have “looped through”

the entire architecture. The execution starts when triangles are read from

memory, and the uniﬁed shader pipes executes a vertex shader program for

the vertices. Then, the resulting information is rerouted back to the setup

units and then injected into the uniﬁed shader pipes, in order to execute a

pixel shader program. Finally, pixel output (color + depth) is ready to be

sent to the frame buﬀer.

The daughter chip of the Xbox 360 performs merging operations, and

the available bandwidth is 32 GB/s. Eight pixels, each with four samples,

can be sent per clock cycle. A sample stores a representation of a 32-bit

color and uses lossless Z-compression for depth. Alternatively, 16 pixels can

be handled if only depth is used. This can be useful for shadow volumes

and when rendering out depth only, e.g., for shadow mapping or as a pre-

pass for deferred shading. In the daughter chip, there is extra logic (called

“AZ” in Figure 18.15) for handling alpha blending, stencil testing, and

depth testing. When this has been done, there are 256 GB/s available

directly to the DRAM memory in the daughter chip.

As an example of the great advantage of this design, we consider alpha

blending. Usually, one ﬁrst needs to read the colors from the color buﬀer,

then blend the current color (generated by a pixel shader), and then send

it back to the color buﬀer. With the Xbox 360 architecture, the generated

color is sent to the daughter chip, and the rest of the communication is han-

dled inside that chip. Since depth compression also is used and performed

in the daughter chip, similar advantages accrue.

When a frame has been rendered, all the samples of the pixels reside

in the eDRAM in the daughter chip. The back buﬀer needs to be sent to

main memory, so it can be displayed onscreen. For multi-sampled buﬀers,

the downsampling is done by the AZ unit in the daughter chip, before it is

sent over the bus to main memory.

18.4.2 Case Study: The PLAYSTATION



3 System

The PLAYSTATION 3 system

is a game system built by Sony Computer

Entertainment. The architecture of the PLAYSTATION 3 system can be

seen in Figure 18.17. The Cell Broadband Engine

was developed by

“PlayStation,” “PLAYSTATION,” and the “PS” Family logo are registered trade-

marks and “Cell Broadband Engine” is a trademark of Sony Computer Entertainment

Inc. The “Blu-ray Disc” name and logo are trademarks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (7/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (7/10)