18. Graphics Hardware (9/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

18.4. Case Studies 869

ergistic Processor Element (SPE) cores. Each SPE contains a memory

controller and Synergistic Processor Unit (SPU). The SPE is commonly

referred to as an SPU, but this acronym properly refers only to the com-

putational core of the SPE. In the Cell Broadband Engine processors used

in the PLAYSTATION 3 system, one of the SPEs has been disabled, so

only seven can be used.

The SPEs are diﬀerent from processors that

are designed to run general-purpose code. Each SPE has 256 kilobytes of

local storage that is directly addressed by the processor. Main memory,

video memory, or the local storage of other SPEs cannot be directly ad-

dressed by the SPE. The only way for an SPE to process data stored at

such locations is to transfer it into its local storage via DMA (direct mem-

ory access). Each SPE is a “pure SIMD” processor. All registers are 16

bytes in width; there are no scalar registers or scalar instruction variants.

All memory accesses are 16 bytes wide and must be to addresses that are

aligned to 16 bytes. Smaller or unaligned accesses are emulated via multiple

instructions.

The SPEs were designed in this way because their role is not to run

general-purpose code—that is the role of the PPE. The SPEs are intended

to be used for computation-intensive tasks while the PPE performs more

general code and orchestrates the activities of the SPEs. An SPE can ac-

cess all of its local storage at full speed without any cache misses (there is

no cache) or other penalties. A new memory access can be initiated at a

throughput of one per clock cycle, with its results available ﬁve clock cycles

later. An SPE can also perform multiple DMA transfers concurrently with

processing. This enables an SPE to eﬃciently perform localized computa-

tions on large data sets by working on subsets of data. While the current

subset is being processed, the next subset is being transferred in, and the

results of processing the previous subset are being transferred out.

The SPE can issue two instructions per clock, one of which is an arith-

metic instruction, while the other is a “support” instruction such as load,

store, branch, data shuﬄe, etc. The execution of all instructions (except

double-precision ﬂoating point operations) is fully pipelined, so a similar

instruction can be issued again on the next clock. To enable operation at

high frequencies, the execution units are highly pipelined, so that many

commonly used instructions have six-clock latencies. However, the large

and software pipelining can be used extensively to hide these latencies. In

This is a common maneuver to improve yield in high-volume chips. The advantage is

that if any one of the eight SPEs has a defect, the chip is still viable. Unfortunately, for

compatibility reasons, this means that even a Cell Broadband Engine with no defects

must have one of its SPEs disabled to be used in the PLAYSTATION 3 system. It

should be noted that one additional SPE is reserved for the use of the operating system,

leaving six for application use.

870 18. Graphics Hardware

practice, highly eﬃcient operation can be achieved for most computation-

intensive tasks.

A common use for the SPEs in PLAYSTATION 3 games is to perform

rendering-related tasks. For example, Sony Computer Entertainment’s

EDGE library performs triangle culling and vertex skinning. Other ver-

tex operations are often oﬄoaded from the RSX



onto the SPEs. The

SPEs are not restricted by the “one vertex at a time” model of the ver-

tex shader on the RSX, so it can perform other types of processing. This

can include computations similar to those performed by a GPU geometry

shader, as well as more general processing. For this reason, the SPEs can be

seen as implementing part of the geometry stage. The results of these com-

putations will need to be sorted to keep them in the original order, which

makes the PLAYSTATION 3 system a hybrid sort-ﬁrst (for the SPEs) /

sort-last fragment (for the RSX) architecture.

The EIB (Element Interconnect Bus) is a ring bus that connects the

PPE, the SPEs, and the memory and I/O controllers. It is an extremely

fast bus, capable of sustaining over 200 gigabytes per second in bandwidth.

The PLAYSTATION 3 system is an interesting case because of the in-

clusion of the Cell Broadband Engine. GPUs have been becoming more

“CPU-like” for several years, as programmable shaders have become more

general. The Cell Broadband Engine is an example of a CPU that is some-

what “GPU-like” due to its inclusion of heterogeneous processing units

optimized for localized computations.

The Game Developers Conference 2005 presentation by Mallinson and

DeLoura [812] contains additional interesting information about the Cell

Broadband Engine processor. Full Cell Broadband Engine documentation

is available on IBM’s website [165].

18.4.3 Case Study: Mali 200

In this section, the Mali 200 architecture from ARM will be described. This

architecture is diﬀerent from the Xbox 360 and the PLAYSTATION



system in two major ways. First, the target is not desktop PCs nor game

consoles, but rather mobile devices, such as mobile phones or portable game

consoles. Since these are powered by batteries, and you want long use time

on the battery, it is important to design an energy-eﬃcient architecture,

rather than just one with high performance. In the mobile context, the

focus is therefore both on energy eﬃciency and on performance. Second,

this architecture is what we call a tiling architecture, and this has certain

important implications for the mobile context. The mobile phone is now

one of the most widespread devices with rendering capabilities [13], and

the potential impact of graphics on these is therefore huge. The Mali 200

architecture is fully compliant with the OpenGL ES 2.0 API, which has

18.4. Case Studies 871

Tiling

Rasterizer

Pixel shader

On-chip

buffers

Memory

Frame buffer

Primitives

Transformed scene

data + render state

Primitives

Geometry

Scene data

Tile lists

Primitives per tile

Texture read

Write RGBA/Z

Figure 18.20. Overview of the Mali 200 tiling architecture, targeted toward mobile

devices. The TC block is the texture cache.

been designed speciﬁcally for handheld devices. This API supports both

programmable vertex and pixel shaders using GLSL (OpenGL Shading

Language).

A tile in this case is simply a 16 ×16 pixel region

of the frame buﬀer.

The major diﬀerence with this type of architecture is that a per-pixel pro-

cessing unit (including rasterizer, pixel shaders, blending, etc.) works on

only a single tile at a time, and when this tile is ﬁnished, it will never

be touched again during the current frame. While this may sound awk-

ward, it actually gives several advantages, as explained later. The ﬁrst

tiling architecture was Pixel-Planes 5 [368], and that system has some

high-level similarities to the Mali 200. Other tiling architectures include

the PowerVR-based KYRO II and MBX/SGX GPUs from Imagination

Technologies.

The core idea of tiling architectures is to ﬁrst perform all geometry

processing, so that the screen-space position of each rendering primitive is

found. At the same time, a tile list, containing pointers to all the primitives

overlapping a tile, is built for each tile in the frame buﬀer. When this

sorting has been done, the set of primitives overlapping a tile is known, and

therefore, one can render all the primitives in a tile and output the result

to an external frame buﬀer. Then the next tile is rasterized, and so on,

until the entire frame has been rendered. Conceptually, this is how every

tiling architecture works, and hence, these are sort-middle architectures.

An outline of the architecture is presented in Figure 18.20. As can be

seen, the rendering primitives are ﬁrst read from memory, and geometry

processing with programmable vertex shaders commences. In most GPUs,

Other tile sizes may be used in other tiling architectures.

872 18. Graphics Hardware

the output from the vertex shader is forwarded to the rasterizer stage.

With the Mali 200, this output is written back to memory. The vertex

processing unit is heavily pipelined, and this pipeline is single-threaded,

allowing that processing of one vertex to start every cycle.

The Tiling unit (Figure 18.20) performs backface and viewport culling

ﬁrst, then determines which tiles a primitive overlaps. It stores pointers to

these primitives in each corresponding tile list. The reason it is possible

to process one tile at a time is that all primitives in the scene need to be

known before rasterization can begin. This may be signaled by requesting

that the front and back buﬀers be swapped. When this occurs for a tiling

architecture, all geometry processing is basically done, and the results are

stored in external memory. The next step is to start per-pixel processing of

the triangles in each tile list, and while this is done, the geometry processing

can commence working on the next frame. This processing model implies

that there is more latency in a tiling architecture.

At this point, fragment processing is performed, and this includes tri-

angle traversal (ﬁnding which pixels/samples are inside a triangle), and

pixel shader execution, blending, and other per-pixel operations. The sin-

gle most important feature of a tiling architecture is that the frame buﬀer

(including color, depth, and stencil, for example) for a single tile can be

stored in very fast on-chip memory, here called the on-chip tile buﬀer.This

is aﬀordable because the tiles are small (16 × 16 pixels). Bigger tile sizes

make the chip larger, and hence less suitable for mobile phones. When

all rendering has ﬁnished to a tile, the desired output (usually color, and

possibly depth) of the tile is copied to an oﬀ-chip frame buﬀer (in external

memory) of the same size as the screen. This means that all accesses to

the frame buﬀer during per-pixel processing is essentially for free. Avoiding

using the external buses is highly desirable, because this use comes with

a high cost in terms of energy [13]. This design also means that buﬀer

compression techniques, such as the ones described in Section 18.3.6, are

of no relevance here.

To ﬁnd which pixels or samples are inside a triangle, the Mali 200 em-

ploys a hierarchical testing scheme. Since the triangle is known to overlap

with the 16×16 pixel tile, testing starts against the four 8×8 pixel subtiles.

If the triangle is found not to overlap with a subtile, no further testing, nor

any processing, is done there. Otherwise, the testing continues down until

the size of a subtile is 2 ×2 pixels. At this scale, it is possible to compute

approximations of the derivatives on any variable in the fragment shader.

This is done by simply subtracting in the x-andy-directions. The Mali

200 architecture also performs hierarchical depth testing, similar to the

Z-max culling technique described in Section 18.3.7, during this hierarchi-

cal triangle traversal. At the same time, hierarchical stencil culling and

alpha culling are done. If a tile survives Z-culling, Mali computes individ-

18.4. Case Studies 873

ual fragment z-depths and performs early-Z testing as possible to avoid

unnecessary fragment processing.

The next step is to start with per-fragment processing, and the Mali

200 can have 128 fragments in ﬂight at the same time. This is a common

technique to hide the latency in the system. For example, when fragment

0 requests a texel, it will take awhile before that data is available in the

texture cache, but in the meantime, another 127 pixels can request access

to other texels, as well. When it is time to continue processing fragment 0,

the texel data should be available.

To reduce texture bandwidth, there is a texture cache with hardware

decompression units for ETC [1227] (see Section 6.2.6), which is a texture

compression algorithm. ETC is part of OpenGL ES 2.0. Also, as another

cost-eﬃcient technique, compressed textures are actually stored in com-

pressed form in the cache, as opposed to decompressing them and then

putting the texels in the cache. This means that when a request for a texel

is made, the hardware reads out the block from the cache and then decom-

presses it on the ﬂy. Most other architectures appear to store the texels in

uncompressed form in the cache.

The Mali 200 architecture was designed from the ground up with screen-

space antialiasing in mind, and it implements the rotated-grid supersam-

pling (RGSS) scheme described on page 128, using four samples per pixel.

This means that the native mode for is 4× antialiasing. Another important

consequence of a tiling architecture is that screen-space antialiasing is more

aﬀordable. This is because ﬁltering is done just before the tile leaves the

GPU and is sent out to the external memory. Hence, the frame buﬀer in

external memory needs to store only a single color per pixel. A standard

architecture would need a frame buﬀer to be four times as large (which

gives you less memory for textures, etc.). For a tiling architecture, you

need to increase only the on-chip tile buﬀer by four times, or eﬀectively use

smaller display tiles (half the width and height of a processing tile).

The Mali 200 can also selectively choose to use either multisampling

or supersampling on a batch of rendering primitives. This means that

the more expensive supersampling approach, where you execute the pixel

shader for each sample, can be used when it is really needed. An example

would be rendering a textured tree with alpha mapping (see Section 6.6),

where you need high quality sampling to avoid disturbing artifacts. For

these primitives, supersampling could be enabled. When this complex sit-

uation ends and simpler objects are to be rendered, one can switch back

to using the less expensive multisampling approach. In addition, there is a

16× antialiasing mode as well, where the contents of 2 ×2 pixels, each with

four samples, are ﬁltered down into a single color before they are written to

the external frame buﬀer. See Figure 18.21 for an example of an antialiased

rendering using the Mali 200 architecture.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (9/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (9/10)