i
i
i
i
i
i
i
i
872 18. Graphics Hardware
the output from the vertex shader is forwarded to the rasterizer stage.
With the Mali 200, this output is written back to memory. The vertex
processing unit is heavily pipelined, and this pipeline is single-threaded,
allowing that processing of one vertex to start every cycle.
The Tiling unit (Figure 18.20) performs backface and viewport culling
first, then determines which tiles a primitive overlaps. It stores pointers to
these primitives in each corresponding tile list. The reason it is possible
to process one tile at a time is that all primitives in the scene need to be
known before rasterization can begin. This may be signaled by requesting
that the front and back buffers be swapped. When this occurs for a tiling
architecture, all geometry processing is basically done, and the results are
stored in external memory. The next step is to start per-pixel processing of
the triangles in each tile list, and while this is done, the geometry processing
can commence working on the next frame. This processing model implies
that there is more latency in a tiling architecture.
At this point, fragment processing is performed, and this includes tri-
angle traversal (finding which pixels/samples are inside a triangle), and
pixel shader execution, blending, and other per-pixel operations. The sin-
gle most important feature of a tiling architecture is that the frame buffer
(including color, depth, and stencil, for example) for a single tile can be
stored in very fast on-chip memory, here called the on-chip tile buffer.This
is affordable because the tiles are small (16 × 16 pixels). Bigger tile sizes
make the chip larger, and hence less suitable for mobile phones. When
all rendering has finished to a tile, the desired output (usually color, and
possibly depth) of the tile is copied to an off-chip frame buffer (in external
memory) of the same size as the screen. This means that all accesses to
the frame buffer during per-pixel processing is essentially for free. Avoiding
using the external buses is highly desirable, because this use comes with
a high cost in terms of energy [13]. This design also means that buffer
compression techniques, such as the ones described in Section 18.3.6, are
of no relevance here.
To find which pixels or samples are inside a triangle, the Mali 200 em-
ploys a hierarchical testing scheme. Since the triangle is known to overlap
with the 16×16 pixel tile, testing starts against the four 8×8 pixel subtiles.
If the triangle is found not to overlap with a subtile, no further testing, nor
any processing, is done there. Otherwise, the testing continues down until
the size of a subtile is 2 ×2 pixels. At this scale, it is possible to compute
approximations of the derivatives on any variable in the fragment shader.
This is done by simply subtracting in the x-andy-directions. The Mali
200 architecture also performs hierarchical depth testing, similar to the
Z-max culling technique described in Section 18.3.7, during this hierarchi-
cal triangle traversal. At the same time, hierarchical stencil culling and
alpha culling are done. If a tile survives Z-culling, Mali computes individ-