18. Graphics Hardware (10/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

874 18. Graphics Hardware

Figure 18.21. Two images rendered using the Mali 200 architecture for mobile devices,

using the OpenGL ES 2.0 API. Rotated-grid supersampling (with four samples per

pixel) is used to increase the image quality, and the antialiasing eﬀect can be seen in

the zoomed inset in the left image. The planet was rendered using pixel shader pro-

grams implementing a variety of techniques, including parallax mapping, environment

mapping, and a full lighting equation. (Images courtesy of ARM.)

When all triangles have been rasterized, and per-pixel processing has

ﬁnished for a tile, the desired content (color and possibly depth) of the

on-chip tile buﬀer is copied out to the frame buﬀer in external memory.

The Mali 200 has two on-chip tile buﬀers, so once rendering has ﬁnished

into on-chip tile buﬀer 1, processing of the next tile into on-chip tile buﬀer

2 can start, while the content of tile buﬀer 1 is written to external memory.

Low-level power saving techniques, such as clock gating, are also heavily

used in this architecture. This basically means that unused or inactive parts

of the pipeline are shut down in order to preserve energy.

As we have seen here, the Mali 200 is not designed as a uniﬁed shader

architecture. Keeping the size of the hardware down is another important

design factor for mobile graphics hardware, and the Mali 200 hardware

designers found that their vertex shader unit occupies less than 30% of the

gates, as compared to a fragment shader unit. Hence, if you can keep the

vertex shader and fragment shader units reasonably busy, you have a very

cost-eﬀective architecture.

An advantage of the tiling architecture, in general, is that it is inherently

designed for rendering in parallel. For example, more fragment-processing

units could be added, where each unit is responsible for independently

rendering to a single tile at a time. A disadvantage of tiling architectures

is that the entire scene data needs to be sent to the graphics hardware and

stored in local memory. This places an upper limit on how large a scene

can be rendered. Note that more complex scenes can be rendered with

a multipass approach. Assume that in the ﬁrst pass, 30,000 triangles are

18.4. Case Studies 875

rendered, and the Z-buﬀer is saved out to the local memory. Now, in the

second pass, another 30,000 triangles are rendered. Before the per-pixel

processing of a tile starts, the Z-buﬀer, and possibly the color buﬀer (and

stencil) for that tile, is read into the on-chip tile buﬀer external memory.

This multipass method comes at a cost of more bandwidth usage, and

so there is an associated performance hit. Performance could also drop

in some pathological cases. For example, say there are many diﬀerent and

long pixel shader programs for small triangles that are to be executed inside

a tile. Switching long shader programs often comes with a signiﬁcant cost.

Such situations seldom happen with normal content.

More and more mobile phones are being equipped with special-purpose

hardware for three-dimensional graphics. Energy-eﬃcient architectures

such as the one described here will continue to be important, since even

with a “miracle” battery (lasting long), the heat has to dissipate through

the cover of the phone. Too much heat will be inconvenient for the user.

This suggests that there is much research to be done for mobile graphics,

and that fundamental changes in the core architecture should be investi-

gated.

18.4.4 Other Architectures

There are many other three-dimensional graphics architectures that have

been proposed and built over the years. Two major systems, Pixel-Planes

and PixelFlow, have been designed and built by the graphics group at

the University of North Carolina, Chapel Hill. Pixel-Planes was built in

the late 1980s [368] and was a sort-middle architecture with tile-based

rendering. PixelFlow [328, 895, 897] was an example of a sort-last image

architecture with deferred shading and programmable shading. Recently,

the group at UNC also has developed the WarpEngine [1024], which is

a rendering architecture for image-based rendering. The primitives used

are images with depth. Owens et al. describe a stream architecture for

polygon rendering, where the pipeline work is divided in time, and each

stage is run in sequence using a single processor [980]. This architecture

can be software programmed at all pipeline stages, giving high ﬂexibility.

The SHARP architecture developed at Stanford uses ray tracing as its

rendering algorithm [10].

The REYES software rendering architecture [196] used in RenderMan

has used stochastic rasterization for a long time now. This technique is

more eﬃcient for motion blur and depth-of-ﬁeld rendering, among other

advantages. Stochastic rasterization is not trivial to simply move over

to the GPU. However, some research on a new architecture to perform

this algorithm has been done, with some of the problems solved [14]. See

Figure 18.22 for an example of motion blur rendering. One of the con-

876 18. Graphics Hardware

Figure 18.22. The top row shows a slowly rotating and translating wheel, while the

bottom row shows a faster-moving wheel. The left column was rendered by accumlating

four diﬀerent images. The middle column was rendered with four samples using a new

stochastic rasterizer targeting new GPU hardware, and the right column is a reference

rendering with 64 jittered samples.

clusions was that new hardware modiﬁcations are necessary in order to

obtain high performance. In addition, since the pixel shader writes out

custom depth, Z-culling (Section 18.3.7) must be disabled, unless the PCU

(Section 18.3.8) is implemented on top. Algorithms for stochastic lookups

in shadow maps and cube maps were also presented. This gives blurred

shadows and reﬂections for moving objects.

In terms of newer GPU architectures, the GeForce 8800 GTX archi-

tecture from NVIDIA [948] is the next generation after that used in the

PLAYSTATION



3 system. This is a uniﬁed shader architecture with

128 stream processors running at 1.35 GHz, so there are similarities with

the Xbox 360 GPU. Each stream processor works on scalars, which was

a design choice made to further increase eﬃciency. For example, not all

instructions in the code may need all four components (x, y, z,andw).

The eﬃciency can thus be higher when using stream processors based on

scalar execution. The GeForce 8800 can issue both a MAD and a MUL

instruction each clock cycle. Note also that a consequence of the uniﬁed ar-

chitecture is that the number of pipeline stages is reduced drastically. The

GPU supports thousands of execution threads for better load balancing,

and is DirectX 10-compatible.

There has also been considerable research done on new hardware specif-

ically for ray tracing, where the most recent work is on a ray processing

unit [1378], i.e., an RPU. This work includes, among other things, a fully

programmable shader unit, and custom traversal units for quickly accessing

a scene described by a k-d tree. Similar to the uniﬁed shader architecture

18.4. Case Studies 877

for GPUs, the RPU also works on executing threads, and may switch be-

tween these quickly to hide latency, etc.

Table of Contents for 18. Graphics Hardware (10/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (10/10)