i
i
i
i
i
i
i
i
18.3. Architecture 849
Figure 18.12. As can be seen, this is a space-filling curve, called a Mor-
ton sequence [906], and these are known to improve coherency. In this
case, the curve is two dimensional, since textures normally are, too. See
Section 12.4.4 for more on space-filling curves.
18.3.2 Memory Architecture
Here, we will mention a few different architectures that have been used
for memory layout. The Xbox, i.e., the first game console from Microsoft,
uses a unified memory architecture (UMA), which means that the graphics
accelerator can use any part of the host memory for textures and different
kinds of buffers [650]. An example of UMA is shown in Figure 18.13. As can
be seen, both the CPU and the graphics accelerator use the same memory,
and thus also the same bus.
The Xbox 360 (Section 18.4.1) has a different memory architecture, as
shown in Figure 18.15 on page 860. The CPU and the GPU share the
same bus and interface to the system memory, which the GPU uses mainly
for textures. However, for the different kinds of buffers (color, stencil, Z,
etc.), the Xbox 360 has separate memory (eDRAM) that only the GPU can
access. This GPU-exclusive memory is often called video memory or local
memory. Access to this memory is usually much faster than letting the
GPU access system memory over a bus. As an example, on a typical PC in
2006 local memory bandwidth was greater than 50 GB/sec, compared to
at most 4 GB/sec between CPU and GPU using PCI Express [123]. Using
a small video memory (e.g., 32 MB) and then accessing system memory
Figure 18.13. The memory architecture of the first Xbox, which is an example of a
unified memory architecture (UMA).
i
i
i
i
i
i
i
i
850 18. Graphics Hardware
using the GPU is called TurboCache by NVIDIA and HyperMemory by
ATI/AMD.
Another somewhat less unified layout is to have dedicated memory for
the GPU, which can then be used for textures and buffers in any way
desired, but cannot be used directly by the CPU. This is the approach taken
by the architecture of the PLAYSTATION
R
3 system (Section 18.4.2),
which uses a local memory for scene data and for textures.
18.3.3 Ports and Buses
A port is a channel for sending data between two devices, and a bus is a
shared channel for sending data among more than two devices. Bandwidth
is the term used to describe throughput of data over the port or bus, and
is measured in bytes per second, b/s. Ports and buses are important in
computer graphics architecture because, simply put, they glue together
different building blocks. Also important is that bandwidth is a scarce
resource, and so a careful design and analysis must be done before building
a graphics system. An example of a port is one that connects the CPU
with the graphics accelerator, such as PCI Express (PCIe) used in PCs.
Since ports and buses both provide data transfer capabilities, ports are
often referred to as buses, a convention we will follow here.
Many objects in a scene do not appreciably change shape from frame
to frame. Even a human character is typically rendered with a set of
unchanging meshes that use GPU-side vertex blending at the joints. For
this type of data, animated purely by modeling matrices and vertex shader
programs, it is common to use static vertex buffers, which are placed in
video memory (sometimes called local memory), i.e., dedicated graphics
memory. Doing so makes for fast access by the GPU. For vertices that
are updated by the CPU each frame, dynamic vertex buffers are used, and
these are placed in system memory that can be accessed over a bus, such
as PCI Express. This is the main reason why vertex buffers are so fast.
Another nice property of PCI Express is that queries can be pipelined, so
that several queries can be requested before results return.
18.3.4 Memory Bandwidth
Next, we will present a simplified view of memory bandwidth usage between
the CPU and the GPU, and the bandwidth usage for a single fragment. We
say simplified, because there are many factors, such as caching and DMA,
that are hard to take into account.
To begin, let us discuss a theoretical model for the bandwidth used by
a single fragment on its way through the pixel pipeline. Let us assume
that the bandwidth usage consists of three terms, namely, B
c
,whichisthe
i
i
i
i
i
i
i
i
18.3. Architecture 851
bandwidth usage to the color buffer, B
z
, which is the bandwidth usage to
the depth buffer (Z-buffer), and B
t
, which is bandwidth usage to textures.
The total bandwidth usage, B,isthen
B = B
c
+ B
z
+ B
t
. (18.3)
Recall that the average depth complexity, here denoted d,isthenumberof
times a pixel is covered by. The average overdraw, here denoted o(d), is the
number of times a pixel has its contents written. Both depth complexity
and overdraw are covered in Section 15.4.5. The average overdraw can be
modeled as a Harmonic series, as shown in Equation 15.2, and this equation
is repeated here for clarity:
o(d)=1+
1
2
+
1
3
+ ···+
1
d
. (18.4)
The interesting thing is that if there are d triangles covering a pixel, then
one will write to o(d) of these. In terms of depth buffer bandwidth, B
z
,
this means that there will d depth buffer reads (costing Z
r
bytes each),
but only o(d) depth buffer writes (costing Z
w
bytes each). This can be
summarized as
B
z
= d × Z
r
+ o(d) × Z
w
. (18.5)
For blending operations, one may also need to read the color buffer (C
r
),
but we assume this is not part of the common case. However, there will be
as many color buffer writes as depth buffer writes, so B
c
= o(d)×C
w
,where
C
w
is the cost (in bytes) of writing the color of a single pixel. Since most
scenes are textured, one or more texture reads (T
r
) may also occur. Some
architectures may perform texturing before the depth test, which means
B
t
= d × T
r
. However, in the following, we assume that texture reads are
done after the depth test, which means that B
t
= o(d) ×T
r
. Together, the
total bandwidth cost is then [13]
B = d × Z
r
+ o(d) × (Z
w
+ C
w
+ T
r
). (18.6)
With trilinear mipmapping, each T
r
may cost 8 × 4 = 32, i.e., eight texel
accesses costing four bytes each. If we assume that four textures are ac-
cessed per pixel, then T
r
= 128 bytes. We also assume C
w
, Z
r
,andZ
w
each cost four bytes. Assuming a target depth complexity of d =6gives
o(d) 2.45, which means a pixel costs
b =6× 4+2.45 × (4 + 4 + 128) 357 bytes per pixel. (18.7)
However, we can also take a texture cache (Section 18.3.1) into account,
and this reduces the B
t
-cost quite a bit. With a texture cache miss rate
i
i
i
i
i
i
i
i
852 18. Graphics Hardware
m, Equation 18.6 is refined to
b = d × Z
r
+ o(d) × (Z
w
+ C
w
+ m × T
r
)
= d × Z
r
+ o(d) × Z
w
$ %& '
depth buffer,B
z
+ o(d) × C
w
$ %& '
color buffer,B
c
+ o(d) × m × T
r
$ %& '
texture read,B
t
= B
d
+ B
c
+ B
t
. (18.8)
This equation is what we call the rasterization equation. Hakura and Gupta
uses m =0.25, which means that on every fourth texel access, we get a
miss in the cache. Now, let us use this formula for an example where,
again, d =6ando 2.45. This gives B
c
=2.45 × 4=9.8bytes,and
B
z
=6×4+2.45 ×4=33.8 bytes. The texture bandwidth usage becomes:
B
t
=2.45×4×8×4×0.25 = 78.4 bytes. This sums to b =33.8+9.8+78.4=
122 bytes, which is a drastic reduction compared to the previous example
(357 bytes).
A cost of 122 bytes per pixel may seem small, but put in a real context,
it is not. Assume that we render at 72 frames per second at 1920 × 1200
resolution. This gives
72 ×1920 ×1200 × 122 bytes/s 18.8Gbytes/s. (18.9)
Next, assume that the clock of the memory system is running at 500
MHz. Also, assume that a type of memory called DDRAM is used. Now,
256 bits can be accessed per clock from DDRAM, as opposed to 128 bits
for SDRAM. Using DDRAM gives
500Mhz ×
256
8
15.6Gbytes/s. (18.10)
As can be seen here, the available memory bandwidth (15.6 GB/s) is almost
sufficient for this case, where the per-pixel bandwidth usage is 18.8 GB/s.
However, the figure of 18.8 GB/s is not truly realistic either. The depth
complexity could be higher, and buffers with more bits per component can
be used (i.e., floating-point buffers, etc.). In addition, the screen resolution
could be increased, even more textures could be accessed, better texture
filtering (which costs more memory accesses) could be used, multisampling
or supersampling may be used, etc. Furthermore, we have only looked at
memory bandwidth usage for fragment processing. Reading vertices and
vertex attributes into the GPU also uses up bandwidth resources.
On top of that, the usage of bandwidth is never 100% in a real system.
It should now be clear that memory bandwidth is extremely important in
a computer graphics system, and that care must be taken when designing
the memory subsystem. However, it is not as bad as it sounds. There are
i
i
i
i
i
i
i
i
18.3. Architecture 853
many techniques for reducing the number of memory accesses, including a
texture cache with prefetching, texture compression, and the techniques in
Sections 18.3.6 and 18.3.7. Another technique often used is to put several
memory banks that can be accessed in parallel. This also increases the
bandwidth delivered by the memory system.
Let us take a look at the bus bandwidth from the CPU to the GPU.
Assume that a vertex needs 56 bytes (3×4 for position, 3×4 for normal, and
4 × 2 × 4 for texture coordinates). Then, using an indexed vertex array,
an additional 6 bytes per triangle are needed to index into the vertices.
For large closed triangle meshes, the number of triangles is about twice
the number of vertices (see Equation 12.8 on page 554). This gives (56 +
6 × 2)/2 = 34 bytes per triangle. Assuming a goal of 300 million triangles
per second, a rate of 10.2 Gbytes per second is needed just for sending the
triangles from the CPU to the graphics hardware. Compare this to PCI
Express 1.1 with 16 lanes of data (a commonly used version in 2007), which
can provide a peak (and essentially unreachable) rate of 4.0 GBytes/sec in
one direction.
These numbers imply that the memory system of a GPU and the corre-
sponding algorithms should be designed with great care. Furthermore, the
needed bus bandwidth in a graphics system is huge, and one should design
the buses with the target performance in mind.
18.3.5 Latency
In general, the latency is the time between making the query and receiving
the result. As an example, one may ask for the value at a certain address
in memory, and the time it takes from the query to getting the result is
the latency. In a pipelined system with n pipeline stages, it takes at least
n clock cycles to get through the entire pipeline, and the latency is thus
n clock cycles. This type of latency is a relatively minor problem. As an
example, we will examine an older GPU, where variables such as the effect
of shader program length are less irrelevant. The GeForce3 accelerator has
600–800 pipeline stages and is clocked at 233 MHz. For simplicity, assume
that 700 pipeline stages are used on average, and that one can get through
the entire pipeline in 700 clock cycles (which is ideal). This gives 700/(233·
10
6
) 3·10
6
seconds = 3 microseconds (μs). Now assume that we want to
render the scene at 50 Hz. This gives 1/50 seconds = 20 milliseconds (ms)
per frame. Since 3 μs is much smaller than 20 ms (about four magnitudes),
it is possible to pass through the entire pipeline many times per frame.
More importantly, due to the pipelined design, results will be generated
every clock cycle, that is, 233 million times per second. On top of that,
as we have seen, the architectures are often parallelized. So, in terms of
rendering, this sort of latency is not often much of a problem. There is also
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset