i
i
i
i
i
i
i
i
866 18. Graphics Hardware
storage, and Post T&L has, in practice, storage for roughly 24 vertices. The
task of the Pre T&L vertex cache is to avoid redundant memory fetches.
When a vertex that is needed for a triangle can be found in the Pre T&L
vertex cache, the memory fetch for that vertex data can be avoided. The
Post T&L vertex cache, on the other hand, is there to avoid processing
thesamevertexwiththevertexshader more than once. This can happen
because a vertex is, on average, shared by six other triangles. Both these
caches can improve performance tremendously.
Due to the memory requirements of a shaded vertex located in the
Post T&L vertex cache, it may take quite some time to fetch it from
that cache. Therefore, a Primitive Assembly cache is inserted after the
Post T&L vertex cache. This cache can store only four fully shaded ver-
tices, and its task is to avoid fetches from the Post T&L vertex cache.
For example, when rendering a triangle strip, two vertices from the pre-
vious triangle are used, along with another new vertex, to create a new
triangle. If those two vertices already are located in the Primitive As-
sembly cache, only one vertex is fetched from the Post T&L. It should
be noted, however, that as with all caches, the hit rate is not perfect,
so when the desired data is not in the cache, it needs to be fetched or
recomputed.
When all the vertices for a triangle have been assembled in the Primitive
Assembly cache, they are forwarded to the cull, clip, and setup block. This
block implements clipping, triangle setup, and triangle traversal, so it can
be seen as straddling the boundary between the geometry and rasteriza-
tion stages. In addition, the cull, clip, and setup block also implements two
types of culling: backface culling (discussed in Section 14.2), and Z-culling
(described in Section 18.3.7). Since this block generates all the fragments
that are inside a triangle, the PLAYSTATION 3 system can be seen as a
sort-last fragment architecture when only the GPU is taken into consider-
ation. As we shall see, usage of the Cell Broadband Engine can introduce
another level of sorting, making the PLAYSTATION 3 system a hybrid of
sort-first and sort-last fragment.
Fragments are always generated in 2 × 2-pixel units called quads.All
the pixels in a quad must belong to the same triangle. Each quad is sent to
one of six quad pixel shader (rasterizer) units. Each of these units processes
all the pixels in a quad simultaneously, so these units can be thought of as
comprising 24 pixel shader units in total. However, in practice, many of the
quads will have some invalid pixels (pixels that are outside the triangle),
especially in scenes composed of many small triangles. These invalid pixels
are still processed by the quad pixel shader units, and their shading results
are discarded. In the extreme case where each quad has one valid pixel,
this inefficiency can decrease pixel shader throughput by a factor of four.
See McCormack et al.’s paper [842] for different rasterization strategies.