i
i
i
i
i
i
i
i
874 18. Graphics Hardware
Figure 18.21. Two images rendered using the Mali 200 architecture for mobile devices,
using the OpenGL ES 2.0 API. Rotated-grid supersampling (with four samples per
pixel) is used to increase the image quality, and the antialiasing effect can be seen in
the zoomed inset in the left image. The planet was rendered using pixel shader pro-
grams implementing a variety of techniques, including parallax mapping, environment
mapping, and a full lighting equation. (Images courtesy of ARM.)
When all triangles have been rasterized, and per-pixel processing has
finished for a tile, the desired content (color and possibly depth) of the
on-chip tile buffer is copied out to the frame buffer in external memory.
The Mali 200 has two on-chip tile buffers, so once rendering has finished
into on-chip tile buffer 1, processing of the next tile into on-chip tile buffer
2 can start, while the content of tile buffer 1 is written to external memory.
Low-level power saving techniques, such as clock gating, are also heavily
used in this architecture. This basically means that unused or inactive parts
of the pipeline are shut down in order to preserve energy.
As we have seen here, the Mali 200 is not designed as a unified shader
architecture. Keeping the size of the hardware down is another important
design factor for mobile graphics hardware, and the Mali 200 hardware
designers found that their vertex shader unit occupies less than 30% of the
gates, as compared to a fragment shader unit. Hence, if you can keep the
vertex shader and fragment shader units reasonably busy, you have a very
cost-effective architecture.
An advantage of the tiling architecture, in general, is that it is inherently
designed for rendering in parallel. For example, more fragment-processing
units could be added, where each unit is responsible for independently
rendering to a single tile at a time. A disadvantage of tiling architectures
is that the entire scene data needs to be sent to the graphics hardware and
stored in local memory. This places an upper limit on how large a scene
can be rendered. Note that more complex scenes can be rendered with
a multipass approach. Assume that in the first pass, 30,000 triangles are
i
i
i
i
i
i
i
i
18.4. Case Studies 875
rendered, and the Z-buffer is saved out to the local memory. Now, in the
second pass, another 30,000 triangles are rendered. Before the per-pixel
processing of a tile starts, the Z-buffer, and possibly the color buffer (and
stencil) for that tile, is read into the on-chip tile buffer external memory.
This multipass method comes at a cost of more bandwidth usage, and
so there is an associated performance hit. Performance could also drop
in some pathological cases. For example, say there are many different and
long pixel shader programs for small triangles that are to be executed inside
a tile. Switching long shader programs often comes with a significant cost.
Such situations seldom happen with normal content.
More and more mobile phones are being equipped with special-purpose
hardware for three-dimensional graphics. Energy-efficient architectures
such as the one described here will continue to be important, since even
with a “miracle” battery (lasting long), the heat has to dissipate through
the cover of the phone. Too much heat will be inconvenient for the user.
This suggests that there is much research to be done for mobile graphics,
and that fundamental changes in the core architecture should be investi-
gated.
18.4.4 Other Architectures
There are many other three-dimensional graphics architectures that have
been proposed and built over the years. Two major systems, Pixel-Planes
and PixelFlow, have been designed and built by the graphics group at
the University of North Carolina, Chapel Hill. Pixel-Planes was built in
the late 1980s [368] and was a sort-middle architecture with tile-based
rendering. PixelFlow [328, 895, 897] was an example of a sort-last image
architecture with deferred shading and programmable shading. Recently,
the group at UNC also has developed the WarpEngine [1024], which is
a rendering architecture for image-based rendering. The primitives used
are images with depth. Owens et al. describe a stream architecture for
polygon rendering, where the pipeline work is divided in time, and each
stage is run in sequence using a single processor [980]. This architecture
can be software programmed at all pipeline stages, giving high flexibility.
The SHARP architecture developed at Stanford uses ray tracing as its
rendering algorithm [10].
The REYES software rendering architecture [196] used in RenderMan
has used stochastic rasterization for a long time now. This technique is
more efficient for motion blur and depth-of-field rendering, among other
advantages. Stochastic rasterization is not trivial to simply move over
to the GPU. However, some research on a new architecture to perform
this algorithm has been done, with some of the problems solved [14]. See
Figure 18.22 for an example of motion blur rendering. One of the con-
i
i
i
i
i
i
i
i
876 18. Graphics Hardware
Figure 18.22. The top row shows a slowly rotating and translating wheel, while the
bottom row shows a faster-moving wheel. The left column was rendered by accumlating
four different images. The middle column was rendered with four samples using a new
stochastic rasterizer targeting new GPU hardware, and the right column is a reference
rendering with 64 jittered samples.
clusions was that new hardware modifications are necessary in order to
obtain high performance. In addition, since the pixel shader writes out
custom depth, Z-culling (Section 18.3.7) must be disabled, unless the PCU
(Section 18.3.8) is implemented on top. Algorithms for stochastic lookups
in shadow maps and cube maps were also presented. This gives blurred
shadows and reflections for moving objects.
In terms of newer GPU architectures, the GeForce 8800 GTX archi-
tecture from NVIDIA [948] is the next generation after that used in the
PLAYSTATION
R
3 system. This is a unified shader architecture with
128 stream processors running at 1.35 GHz, so there are similarities with
the Xbox 360 GPU. Each stream processor works on scalars, which was
a design choice made to further increase efficiency. For example, not all
instructions in the code may need all four components (x, y, z,andw).
The efficiency can thus be higher when using stream processors based on
scalar execution. The GeForce 8800 can issue both a MAD and a MUL
instruction each clock cycle. Note also that a consequence of the unified ar-
chitecture is that the number of pipeline stages is reduced drastically. The
GPU supports thousands of execution threads for better load balancing,
and is DirectX 10-compatible.
There has also been considerable research done on new hardware specif-
ically for ray tracing, where the most recent work is on a ray processing
unit [1378], i.e., an RPU. This work includes, among other things, a fully
programmable shader unit, and custom traversal units for quickly accessing
a scene described by a k-d tree. Similar to the unified shader architecture
i
i
i
i
i
i
i
i
18.4. Case Studies 877
for GPUs, the RPU also works on executing threads, and may switch be-
tween these quickly to hide latency, etc.
Further Reading and Resources
A great resource is the set of course notes on computer graphics architec-
tures by Akeley and Hanrahan [10] and Hwu and Kirk [580]. Owens [981]
discusses trends in hardware properties and how they affect architecture.
The annual SIGGRAPH/Eurographics Workshop on Graphics Hardware
and SIGGRAPH conference proceedings are good sources for more infor-
mation. The book Advanced Graphics Programming Using OpenGL [849]
discusses ways of creating stereo pairs. The book Mobile 3D Graphics
by Pulli et al. [1037] thoroughly covers programming in OpenGL ES and
M3G, a graphics API for Java ME. A survey on GPUs for handhelds [15]
also describes some trends in the mobile field, as well as differences and
similarities between desktop GPUs and mobile GPUs.
Check this book’s website, http://www.realtimerendering.com, for infor-
mation on benchmarking tests and results, as well as other hardware-related
links.
i
i
i
i
i
i
i
i
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset