i
i
i
i
i
i
i
i
18.3. Architecture 847
Texture Access
The performance increase in terms of pure computations has grown expo-
nentially for many years. While processors have continued to increase in
speed at a rate in keeping with Moore’s Law, and graphics pipelines have
actually exceeded this rate [735], memory bandwidth and memory latency
have not kept up. Bandwidth is the rate at which data can be transferred,
and latency is the time between request and retrieval. While the capabil-
ity of a processor has been going up 71% per year, DRAM bandwidth is
improving about 25%, and latency a mere 5% [981]. For NVIDIA’s 8800
architecture [948], you can do about 14 floating-point operations per texel
access.
6
Chip density rises faster than available bandwidth, so this ratio
will only increase [1400]. In addition, the trend is to use more and more
textures per primitive. Reading from texture memory is often the main con-
sumer of bandwidth [10]. Therefore, when reading out texels from memory,
care must be taken to reduce the bandwidth used and to hide latency.
To save bandwidth, most architectures use caches at various places in
the pipeline, and to hide latency, a technique called prefetching is often
used. Caching is implemented with a small on-chip memory (a few kilo-
bytes) where the result of recent texture reads are stored, and access is very
fast [489, 582, 583]. This memory is shared among all textures currently in
use. If neighboring pixels need to access the same or closely located texels,
they are likely to find these in the cache. This is what is done for standard
CPUs, as well. However, reading texels into the cache takes time, and most
often entire cache blocks (e.g., 32 bytes) are read in at once.
So, if a texel is not in the cache, it may take relatively long before it can
be found there. One solution employed by GPUs today to hide this latency
is to keep many fragments in flight at a time. Say the shader program is
about to execute a texture lookup instruction. If only one fragment is kept
in flight, we need to wait until the texels are available in the texture cache,
and this will keep the pipeline idle. However, assume that we can keep
100 fragments in flight at a time. First, fragment 0 will request a texture
access. However, since it will take many clock cycles before the requested
data is in the cache, the GPU executes the same texture lookup instruction
for fragment 1, and so on for all 100 fragments. When these 100 texture
lookup instructions have been executed, we are back at processing fragment
0. Say a dot product instruction, using the texture lookup as an argument,
follows. At this time, it is very likely that the requested data is in the
cache, and processing can continue immediately. This is a common way
of hiding latency in GPUs. This is also the reason why a GPU has many
registers. See the Xbox 360 description in Section 18.4.1 for an example.
6
This ratio was obtained using 520 GFLOPS as the computing power, and 38.4 billion
texel accesses per second. It should be noted that these are peak performance numbers.