18. Graphics Hardware (5/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

18.3. Architecture 849

Figure 18.12. As can be seen, this is a space-ﬁlling curve, called a Mor-

ton sequence [906], and these are known to improve coherency. In this

case, the curve is two dimensional, since textures normally are, too. See

Section 12.4.4 for more on space-ﬁlling curves.

18.3.2 Memory Architecture

Here, we will mention a few diﬀerent architectures that have been used

for memory layout. The Xbox, i.e., the ﬁrst game console from Microsoft,

uses a uniﬁed memory architecture (UMA), which means that the graphics

accelerator can use any part of the host memory for textures and diﬀerent

kinds of buﬀers [650]. An example of UMA is shown in Figure 18.13. As can

be seen, both the CPU and the graphics accelerator use the same memory,

and thus also the same bus.

The Xbox 360 (Section 18.4.1) has a diﬀerent memory architecture, as

shown in Figure 18.15 on page 860. The CPU and the GPU share the

same bus and interface to the system memory, which the GPU uses mainly

for textures. However, for the diﬀerent kinds of buﬀers (color, stencil, Z,

etc.), the Xbox 360 has separate memory (eDRAM) that only the GPU can

access. This GPU-exclusive memory is often called video memory or local

memory. Access to this memory is usually much faster than letting the

GPU access system memory over a bus. As an example, on a typical PC in

2006 local memory bandwidth was greater than 50 GB/sec, compared to

at most 4 GB/sec between CPU and GPU using PCI Express [123]. Using

a small video memory (e.g., 32 MB) and then accessing system memory

Figure 18.13. The memory architecture of the ﬁrst Xbox, which is an example of a

uniﬁed memory architecture (UMA).

850 18. Graphics Hardware

using the GPU is called TurboCache by NVIDIA and HyperMemory by

ATI/AMD.

Another somewhat less uniﬁed layout is to have dedicated memory for

the GPU, which can then be used for textures and buﬀers in any way

desired, but cannot be used directly by the CPU. This is the approach taken

by the architecture of the PLAYSTATION



3 system (Section 18.4.2),

which uses a local memory for scene data and for textures.

18.3.3 Ports and Buses

A port is a channel for sending data between two devices, and a bus is a

shared channel for sending data among more than two devices. Bandwidth

is the term used to describe throughput of data over the port or bus, and

is measured in bytes per second, b/s. Ports and buses are important in

computer graphics architecture because, simply put, they glue together

diﬀerent building blocks. Also important is that bandwidth is a scarce

resource, and so a careful design and analysis must be done before building

a graphics system. An example of a port is one that connects the CPU

with the graphics accelerator, such as PCI Express (PCIe) used in PCs.

Since ports and buses both provide data transfer capabilities, ports are

often referred to as buses, a convention we will follow here.

Many objects in a scene do not appreciably change shape from frame

to frame. Even a human character is typically rendered with a set of

unchanging meshes that use GPU-side vertex blending at the joints. For

this type of data, animated purely by modeling matrices and vertex shader

programs, it is common to use static vertex buﬀers, which are placed in

video memory (sometimes called local memory), i.e., dedicated graphics

memory. Doing so makes for fast access by the GPU. For vertices that

are updated by the CPU each frame, dynamic vertex buﬀers are used, and

these are placed in system memory that can be accessed over a bus, such

as PCI Express. This is the main reason why vertex buﬀers are so fast.

Another nice property of PCI Express is that queries can be pipelined, so

that several queries can be requested before results return.

18.3.4 Memory Bandwidth

Next, we will present a simpliﬁed view of memory bandwidth usage between

the CPU and the GPU, and the bandwidth usage for a single fragment. We

say simpliﬁed, because there are many factors, such as caching and DMA,

that are hard to take into account.

To begin, let us discuss a theoretical model for the bandwidth used by

a single fragment on its way through the pixel pipeline. Let us assume

that the bandwidth usage consists of three terms, namely, B

,whichisthe

18.3. Architecture 851

bandwidth usage to the color buﬀer, B

, which is the bandwidth usage to

the depth buﬀer (Z-buﬀer), and B

, which is bandwidth usage to textures.

The total bandwidth usage, B,isthen

B = B

+ B

. (18.3)

Recall that the average depth complexity, here denoted d,isthenumberof

times a pixel is covered by. The average overdraw, here denoted o(d), is the

number of times a pixel has its contents written. Both depth complexity

and overdraw are covered in Section 15.4.5. The average overdraw can be

modeled as a Harmonic series, as shown in Equation 15.2, and this equation

is repeated here for clarity:

o(d)=1+

+ ···+

. (18.4)

The interesting thing is that if there are d triangles covering a pixel, then

one will write to o(d) of these. In terms of depth buﬀer bandwidth, B

this means that there will d depth buﬀer reads (costing Z

bytes each),

but only o(d) depth buﬀer writes (costing Z

bytes each). This can be

summarized as

= d × Z

+ o(d) × Z

. (18.5)

For blending operations, one may also need to read the color buﬀer (C

but we assume this is not part of the common case. However, there will be

as many color buﬀer writes as depth buﬀer writes, so B

= o(d)×C

,where

is the cost (in bytes) of writing the color of a single pixel. Since most

scenes are textured, one or more texture reads (T

) may also occur. Some

architectures may perform texturing before the depth test, which means

= d × T

. However, in the following, we assume that texture reads are

done after the depth test, which means that B

= o(d) ×T

. Together, the

total bandwidth cost is then [13]

B = d × Z

+ o(d) × (Z

+ C

+ T

). (18.6)

With trilinear mipmapping, each T

may cost 8 × 4 = 32, i.e., eight texel

accesses costing four bytes each. If we assume that four textures are ac-

cessed per pixel, then T

= 128 bytes. We also assume C

, Z

,andZ

each cost four bytes. Assuming a target depth complexity of d =6gives

o(d) ≈ 2.45, which means a pixel costs

b =6× 4+2.45 × (4 + 4 + 128) ≈ 357 bytes per pixel. (18.7)

However, we can also take a texture cache (Section 18.3.1) into account,

and this reduces the B

-cost quite a bit. With a texture cache miss rate

852 18. Graphics Hardware

m, Equation 18.6 is reﬁned to

b = d × Z

+ o(d) × (Z

+ C

+ m × T

)

= d × Z

+ o(d) × Z

$ %& '

depth buﬀer,B

+ o(d) × C

$ %& '

color buﬀer,B

+ o(d) × m × T

$ %& '

texture read,B

= B

+ B

. (18.8)

This equation is what we call the rasterization equation. Hakura and Gupta

uses m =0.25, which means that on every fourth texel access, we get a

miss in the cache. Now, let us use this formula for an example where,

again, d =6ando ≈ 2.45. This gives B

=2.45 × 4=9.8bytes,and

=6×4+2.45 ×4=33.8 bytes. The texture bandwidth usage becomes:

=2.45×4×8×4×0.25 = 78.4 bytes. This sums to b =33.8+9.8+78.4=

122 bytes, which is a drastic reduction compared to the previous example

(357 bytes).

A cost of 122 bytes per pixel may seem small, but put in a real context,

it is not. Assume that we render at 72 frames per second at 1920 × 1200

resolution. This gives

72 ×1920 ×1200 × 122 bytes/s ≈ 18.8Gbytes/s. (18.9)

Next, assume that the clock of the memory system is running at 500

MHz. Also, assume that a type of memory called DDRAM is used. Now,

256 bits can be accessed per clock from DDRAM, as opposed to 128 bits

for SDRAM. Using DDRAM gives

500Mhz ×

256

≈ 15.6Gbytes/s. (18.10)

As can be seen here, the available memory bandwidth (15.6 GB/s) is almost

suﬃcient for this case, where the per-pixel bandwidth usage is 18.8 GB/s.

However, the ﬁgure of 18.8 GB/s is not truly realistic either. The depth

complexity could be higher, and buﬀers with more bits per component can

be used (i.e., ﬂoating-point buﬀers, etc.). In addition, the screen resolution

could be increased, even more textures could be accessed, better texture

ﬁltering (which costs more memory accesses) could be used, multisampling

or supersampling may be used, etc. Furthermore, we have only looked at

memory bandwidth usage for fragment processing. Reading vertices and

vertex attributes into the GPU also uses up bandwidth resources.

On top of that, the usage of bandwidth is never 100% in a real system.

It should now be clear that memory bandwidth is extremely important in

a computer graphics system, and that care must be taken when designing

the memory subsystem. However, it is not as bad as it sounds. There are

18.3. Architecture 853

many techniques for reducing the number of memory accesses, including a

texture cache with prefetching, texture compression, and the techniques in

Sections 18.3.6 and 18.3.7. Another technique often used is to put several

memory banks that can be accessed in parallel. This also increases the

bandwidth delivered by the memory system.

Let us take a look at the bus bandwidth from the CPU to the GPU.

Assume that a vertex needs 56 bytes (3×4 for position, 3×4 for normal, and

4 × 2 × 4 for texture coordinates). Then, using an indexed vertex array,

an additional 6 bytes per triangle are needed to index into the vertices.

For large closed triangle meshes, the number of triangles is about twice

the number of vertices (see Equation 12.8 on page 554). This gives (56 +

6 × 2)/2 = 34 bytes per triangle. Assuming a goal of 300 million triangles

per second, a rate of 10.2 Gbytes per second is needed just for sending the

triangles from the CPU to the graphics hardware. Compare this to PCI

Express 1.1 with 16 lanes of data (a commonly used version in 2007), which

can provide a peak (and essentially unreachable) rate of 4.0 GBytes/sec in

one direction.

These numbers imply that the memory system of a GPU and the corre-

sponding algorithms should be designed with great care. Furthermore, the

needed bus bandwidth in a graphics system is huge, and one should design

the buses with the target performance in mind.

18.3.5 Latency

In general, the latency is the time between making the query and receiving

the result. As an example, one may ask for the value at a certain address

in memory, and the time it takes from the query to getting the result is

the latency. In a pipelined system with n pipeline stages, it takes at least

n clock cycles to get through the entire pipeline, and the latency is thus

n clock cycles. This type of latency is a relatively minor problem. As an

example, we will examine an older GPU, where variables such as the eﬀect

of shader program length are less irrelevant. The GeForce3 accelerator has

600–800 pipeline stages and is clocked at 233 MHz. For simplicity, assume

that 700 pipeline stages are used on average, and that one can get through

the entire pipeline in 700 clock cycles (which is ideal). This gives 700/(233·

) ≈ 3·10

−6

seconds = 3 microseconds (μs). Now assume that we want to

render the scene at 50 Hz. This gives 1/50 seconds = 20 milliseconds (ms)

per frame. Since 3 μs is much smaller than 20 ms (about four magnitudes),

it is possible to pass through the entire pipeline many times per frame.

More importantly, due to the pipelined design, results will be generated

every clock cycle, that is, 233 million times per second. On top of that,

as we have seen, the architectures are often parallelized. So, in terms of

rendering, this sort of latency is not often much of a problem. There is also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18. Graphics Hardware (5/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
18. Graphics Hardware (5/10)