i
i
i
i
i
i
i
i
18.3. Architecture 853
many techniques for reducing the number of memory accesses, including a
texture cache with prefetching, texture compression, and the techniques in
Sections 18.3.6 and 18.3.7. Another technique often used is to put several
memory banks that can be accessed in parallel. This also increases the
bandwidth delivered by the memory system.
Let us take a look at the bus bandwidth from the CPU to the GPU.
Assume that a vertex needs 56 bytes (3×4 for position, 3×4 for normal, and
4 × 2 × 4 for texture coordinates). Then, using an indexed vertex array,
an additional 6 bytes per triangle are needed to index into the vertices.
For large closed triangle meshes, the number of triangles is about twice
the number of vertices (see Equation 12.8 on page 554). This gives (56 +
6 × 2)/2 = 34 bytes per triangle. Assuming a goal of 300 million triangles
per second, a rate of 10.2 Gbytes per second is needed just for sending the
triangles from the CPU to the graphics hardware. Compare this to PCI
Express 1.1 with 16 lanes of data (a commonly used version in 2007), which
can provide a peak (and essentially unreachable) rate of 4.0 GBytes/sec in
one direction.
These numbers imply that the memory system of a GPU and the corre-
sponding algorithms should be designed with great care. Furthermore, the
needed bus bandwidth in a graphics system is huge, and one should design
the buses with the target performance in mind.
18.3.5 Latency
In general, the latency is the time between making the query and receiving
the result. As an example, one may ask for the value at a certain address
in memory, and the time it takes from the query to getting the result is
the latency. In a pipelined system with n pipeline stages, it takes at least
n clock cycles to get through the entire pipeline, and the latency is thus
n clock cycles. This type of latency is a relatively minor problem. As an
example, we will examine an older GPU, where variables such as the effect
of shader program length are less irrelevant. The GeForce3 accelerator has
600–800 pipeline stages and is clocked at 233 MHz. For simplicity, assume
that 700 pipeline stages are used on average, and that one can get through
the entire pipeline in 700 clock cycles (which is ideal). This gives 700/(233·
10
6
) ≈ 3·10
−6
seconds = 3 microseconds (μs). Now assume that we want to
render the scene at 50 Hz. This gives 1/50 seconds = 20 milliseconds (ms)
per frame. Since 3 μs is much smaller than 20 ms (about four magnitudes),
it is possible to pass through the entire pipeline many times per frame.
More importantly, due to the pipelined design, results will be generated
every clock cycle, that is, 233 million times per second. On top of that,
as we have seen, the architectures are often parallelized. So, in terms of
rendering, this sort of latency is not often much of a problem. There is also