i
i
i
i
i
i
i
i
18.4. Case Studies 869
ergistic Processor Element (SPE) cores. Each SPE contains a memory
controller and Synergistic Processor Unit (SPU). The SPE is commonly
referred to as an SPU, but this acronym properly refers only to the com-
putational core of the SPE. In the Cell Broadband Engine processors used
in the PLAYSTATION 3 system, one of the SPEs has been disabled, so
only seven can be used.
9
The SPEs are different from processors that
are designed to run general-purpose code. Each SPE has 256 kilobytes of
local storage that is directly addressed by the processor. Main memory,
video memory, or the local storage of other SPEs cannot be directly ad-
dressed by the SPE. The only way for an SPE to process data stored at
such locations is to transfer it into its local storage via DMA (direct mem-
ory access). Each SPE is a “pure SIMD” processor. All registers are 16
bytes in width; there are no scalar registers or scalar instruction variants.
All memory accesses are 16 bytes wide and must be to addresses that are
aligned to 16 bytes. Smaller or unaligned accesses are emulated via multiple
instructions.
The SPEs were designed in this way because their role is not to run
general-purpose code—that is the role of the PPE. The SPEs are intended
to be used for computation-intensive tasks while the PPE performs more
general code and orchestrates the activities of the SPEs. An SPE can ac-
cess all of its local storage at full speed without any cache misses (there is
no cache) or other penalties. A new memory access can be initiated at a
throughput of one per clock cycle, with its results available five clock cycles
later. An SPE can also perform multiple DMA transfers concurrently with
processing. This enables an SPE to efficiently perform localized computa-
tions on large data sets by working on subsets of data. While the current
subset is being processed, the next subset is being transferred in, and the
results of processing the previous subset are being transferred out.
The SPE can issue two instructions per clock, one of which is an arith-
metic instruction, while the other is a “support” instruction such as load,
store, branch, data shuffle, etc. The execution of all instructions (except
double-precision floating point operations) is fully pipelined, so a similar
instruction can be issued again on the next clock. To enable operation at
high frequencies, the execution units are highly pipelined, so that many
commonly used instructions have six-clock latencies. However, the large
register file (128 registers) means that techniques such as loop unrolling
and software pipelining can be used extensively to hide these latencies. In
9
This is a common maneuver to improve yield in high-volume chips. The advantage is
that if any one of the eight SPEs has a defect, the chip is still viable. Unfortunately, for
compatibility reasons, this means that even a Cell Broadband Engine with no defects
must have one of its SPEs disabled to be used in the PLAYSTATION 3 system. It
should be noted that one additional SPE is reserved for the use of the operating system,
leaving six for application use.
i
i
i
i
i
i
i
i
870 18. Graphics Hardware
practice, highly efficient operation can be achieved for most computation-
intensive tasks.
A common use for the SPEs in PLAYSTATION 3 games is to perform
rendering-related tasks. For example, Sony Computer Entertainment’s
EDGE library performs triangle culling and vertex skinning. Other ver-
tex operations are often offloaded from the RSX
R
onto the SPEs. The
SPEs are not restricted by the “one vertex at a time” model of the ver-
tex shader on the RSX, so it can perform other types of processing. This
can include computations similar to those performed by a GPU geometry
shader, as well as more general processing. For this reason, the SPEs can be
seen as implementing part of the geometry stage. The results of these com-
putations will need to be sorted to keep them in the original order, which
makes the PLAYSTATION 3 system a hybrid sort-first (for the SPEs) /
sort-last fragment (for the RSX) architecture.
The EIB (Element Interconnect Bus) is a ring bus that connects the
PPE, the SPEs, and the memory and I/O controllers. It is an extremely
fast bus, capable of sustaining over 200 gigabytes per second in bandwidth.
The PLAYSTATION 3 system is an interesting case because of the in-
clusion of the Cell Broadband Engine. GPUs have been becoming more
“CPU-like” for several years, as programmable shaders have become more
general. The Cell Broadband Engine is an example of a CPU that is some-
what GPU-like” due to its inclusion of heterogeneous processing units
optimized for localized computations.
The Game Developers Conference 2005 presentation by Mallinson and
DeLoura [812] contains additional interesting information about the Cell
Broadband Engine processor. Full Cell Broadband Engine documentation
is available on IBM’s website [165].
18.4.3 Case Study: Mali 200
In this section, the Mali 200 architecture from ARM will be described. This
architecture is different from the Xbox 360 and the PLAYSTATION
R
3
system in two major ways. First, the target is not desktop PCs nor game
consoles, but rather mobile devices, such as mobile phones or portable game
consoles. Since these are powered by batteries, and you want long use time
on the battery, it is important to design an energy-efficient architecture,
rather than just one with high performance. In the mobile context, the
focus is therefore both on energy efficiency and on performance. Second,
this architecture is what we call a tiling architecture, and this has certain
important implications for the mobile context. The mobile phone is now
one of the most widespread devices with rendering capabilities [13], and
the potential impact of graphics on these is therefore huge. The Mali 200
architecture is fully compliant with the OpenGL ES 2.0 API, which has
i
i
i
i
i
i
i
i
18.4. Case Studies 871
Tiling
Rasterizer
Pixel shader
On-chip
buffers
Memory
Frame buffer
G
P
U
Primitives
Transformed scene
data + render state
Primitives
Geometry
Scene data
Tile lists
Primitives per tile
Texture read
Write RGBA/Z
TC
Figure 18.20. Overview of the Mali 200 tiling architecture, targeted toward mobile
devices. The TC block is the texture cache.
been designed specifically for handheld devices. This API supports both
programmable vertex and pixel shaders using GLSL (OpenGL Shading
Language).
A tile in this case is simply a 16 ×16 pixel region
10
of the frame buffer.
The major difference with this type of architecture is that a per-pixel pro-
cessing unit (including rasterizer, pixel shaders, blending, etc.) works on
only a single tile at a time, and when this tile is finished, it will never
be touched again during the current frame. While this may sound awk-
ward, it actually gives several advantages, as explained later. The first
tiling architecture was Pixel-Planes 5 [368], and that system has some
high-level similarities to the Mali 200. Other tiling architectures include
the PowerVR-based KYRO II and MBX/SGX GPUs from Imagination
Technologies.
The core idea of tiling architectures is to first perform all geometry
processing, so that the screen-space position of each rendering primitive is
found. At the same time, a tile list, containing pointers to all the primitives
overlapping a tile, is built for each tile in the frame buffer. When this
sorting has been done, the set of primitives overlapping a tile is known, and
therefore, one can render all the primitives in a tile and output the result
to an external frame buffer. Then the next tile is rasterized, and so on,
until the entire frame has been rendered. Conceptually, this is how every
tiling architecture works, and hence, these are sort-middle architectures.
An outline of the architecture is presented in Figure 18.20. As can be
seen, the rendering primitives are rst read from memory, and geometry
processing with programmable vertex shaders commences. In most GPUs,
10
Other tile sizes may be used in other tiling architectures.
i
i
i
i
i
i
i
i
872 18. Graphics Hardware
the output from the vertex shader is forwarded to the rasterizer stage.
With the Mali 200, this output is written back to memory. The vertex
processing unit is heavily pipelined, and this pipeline is single-threaded,
allowing that processing of one vertex to start every cycle.
The Tiling unit (Figure 18.20) performs backface and viewport culling
first, then determines which tiles a primitive overlaps. It stores pointers to
these primitives in each corresponding tile list. The reason it is possible
to process one tile at a time is that all primitives in the scene need to be
known before rasterization can begin. This may be signaled by requesting
that the front and back buffers be swapped. When this occurs for a tiling
architecture, all geometry processing is basically done, and the results are
stored in external memory. The next step is to start per-pixel processing of
the triangles in each tile list, and while this is done, the geometry processing
can commence working on the next frame. This processing model implies
that there is more latency in a tiling architecture.
At this point, fragment processing is performed, and this includes tri-
angle traversal (finding which pixels/samples are inside a triangle), and
pixel shader execution, blending, and other per-pixel operations. The sin-
gle most important feature of a tiling architecture is that the frame buffer
(including color, depth, and stencil, for example) for a single tile can be
stored in very fast on-chip memory, here called the on-chip tile buffer.This
is affordable because the tiles are small (16 × 16 pixels). Bigger tile sizes
make the chip larger, and hence less suitable for mobile phones. When
all rendering has finished to a tile, the desired output (usually color, and
possibly depth) of the tile is copied to an off-chip frame buffer (in external
memory) of the same size as the screen. This means that all accesses to
the frame buffer during per-pixel processing is essentially for free. Avoiding
using the external buses is highly desirable, because this use comes with
a high cost in terms of energy [13]. This design also means that buffer
compression techniques, such as the ones described in Section 18.3.6, are
of no relevance here.
To find which pixels or samples are inside a triangle, the Mali 200 em-
ploys a hierarchical testing scheme. Since the triangle is known to overlap
with the 16×16 pixel tile, testing starts against the four 8×8 pixel subtiles.
If the triangle is found not to overlap with a subtile, no further testing, nor
any processing, is done there. Otherwise, the testing continues down until
the size of a subtile is 2 ×2 pixels. At this scale, it is possible to compute
approximations of the derivatives on any variable in the fragment shader.
This is done by simply subtracting in the x-andy-directions. The Mali
200 architecture also performs hierarchical depth testing, similar to the
Z-max culling technique described in Section 18.3.7, during this hierarchi-
cal triangle traversal. At the same time, hierarchical stencil culling and
alpha culling are done. If a tile survives Z-culling, Mali computes individ-
i
i
i
i
i
i
i
i
18.4. Case Studies 873
ual fragment z-depths and performs early-Z testing as possible to avoid
unnecessary fragment processing.
The next step is to start with per-fragment processing, and the Mali
200 can have 128 fragments in flight at the same time. This is a common
technique to hide the latency in the system. For example, when fragment
0 requests a texel, it will take awhile before that data is available in the
texture cache, but in the meantime, another 127 pixels can request access
to other texels, as well. When it is time to continue processing fragment 0,
the texel data should be available.
To reduce texture bandwidth, there is a texture cache with hardware
decompression units for ETC [1227] (see Section 6.2.6), which is a texture
compression algorithm. ETC is part of OpenGL ES 2.0. Also, as another
cost-efficient technique, compressed textures are actually stored in com-
pressed form in the cache, as opposed to decompressing them and then
putting the texels in the cache. This means that when a request for a texel
is made, the hardware reads out the block from the cache and then decom-
presses it on the fly. Most other architectures appear to store the texels in
uncompressed form in the cache.
The Mali 200 architecture was designed from the ground up with screen-
space antialiasing in mind, and it implements the rotated-grid supersam-
pling (RGSS) scheme described on page 128, using four samples per pixel.
This means that the native mode for is 4× antialiasing. Another important
consequence of a tiling architecture is that screen-space antialiasing is more
affordable. This is because filtering is done just before the tile leaves the
GPU and is sent out to the external memory. Hence, the frame buffer in
external memory needs to store only a single color per pixel. A standard
architecture would need a frame buffer to be four times as large (which
gives you less memory for textures, etc.). For a tiling architecture, you
need to increase only the on-chip tile buffer by four times, or effectively use
smaller display tiles (half the width and height of a processing tile).
The Mali 200 can also selectively choose to use either multisampling
or supersampling on a batch of rendering primitives. This means that
the more expensive supersampling approach, where you execute the pixel
shader for each sample, can be used when it is really needed. An example
would be rendering a textured tree with alpha mapping (see Section 6.6),
where you need high quality sampling to avoid disturbing artifacts. For
these primitives, supersampling could be enabled. When this complex sit-
uation ends and simpler objects are to be rendered, one can switch back
to using the less expensive multisampling approach. In addition, there is a
16× antialiasing mode as well, where the contents of 2 ×2 pixels, each with
four samples, are filtered down into a single color before they are written to
the external frame buffer. See Figure 18.21 for an example of an antialiased
rendering using the Mali 200 architecture.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset