i
i
i
i
i
i
i
i
15.5. Multiprocessing 717
Figure 15.6. Two different ways of utilizing multiple processors. At the top we show
how three processors (CPUs) are used in a multiprocessor pipeline, and at the bottom
we show paral lel execution on three CPUs. One of the differences between these two
implementations is that lower latency can be achieved if the configuration at the bottom
is used. On the other hand, it may be easier to use a multiprocessor pipeline. The ideal
speedup for both of these configurations is linear, i.e., using n CPUs would give a speedup
of n times.
between the processors to communicate results. Shared memory multipro-
cessors are just as they sound; all processors share a logical address space
of memory among themselves. Most popular multiprocessor systems use
shared memory, and most of these have a symmetric multiprocessing (SMP)
design. SMP means that all the processors are identical. A multicore PC
system is an example of a symmetric multiprocessing architecture.
Here, we will present two methods for utilizing multiple processors for
real-time graphics. The first method—multiprocessor pipelining, also called
temporal parallelism—will be covered in more detail than the second—
parallel processing, also called spatial parallelism. These two methods are
illustrated in Figure 15.6. Multiprocessor pipelining and parallel processing
can also be combined, but little research has been done in this area. For
this discussion, we assume that a multiprocessor computer is available, but
do not go into details about different types of multiprocessors.
15.5.1 Multiprocessor Pipelining
As we have seen, pipelining is a method for speeding up execution by di-
viding a job into certain pipeline stages that are executed in parallel. The
result from one pipeline stage is passed on to the next. The ideal speedup
is n times for n pipeline stages, and the slowest stage (the bottleneck)
determines the actual speedup. Up to this point, we have seen pipelin-
ing used to run the application, geometry processing, and rasterization in
parallel for a single-CPU system. This technique can also be used when
i
i
i
i
i
i
i
i
718 15. Pipeline Optimization
multiple processors are available on the host, and in these cases, it is called
multiprocess pipelining or software pipelining.
We assume there is only one graphics accelerator available to the system,
and that only one processor thread can access it. This leaves the applica-
tion stage to be pipelined. Therefore, the application stage is divided into
three stages [1079]: APP, CULL,andDRAW.Thisisverycoarse-grained
pipelining, which means that each stage is relatively long. The APP stage
is the first stage in the pipeline and therefore controls the others. It is
in this stage that the application programmer can put in additional code
that does, for example, collision detection. This stage also updates the
viewpoint. The CULL stage can perform:
Traversal and hierarchical view frustum culling on a scene graph (see
Section 14.3).
Level of detail selection (see Section 14.7).
State sorting—geometry with similar state is sorted into bins in order
to minimize state changes. For example, all transparent objects are
sorted into one bin, which minimizes the need to change the blend
state. These objects could also be sorted in a rough back-to-front
order for correct rendering (see Section 5.7).
Finally (and always performed), generation of a simple list of all
objects that should be rendered.
Figure 15.7. Different configurations for a multiprocessor pipeline. The thick lines
represent synchronization between the stages, and the subscripts represent the frame
number. At the top, a single CPU pipeline is shown. In the middle and at the bottom
are shown two different pipeline subdivisions using two CPUs. The middle has one
pipeline stage for APP and CULL and one pipeline stage for DRAW. This is a suitable
subdivision if DRAW has much more work to do than the others. At the bottom, APP
has one pipeline stage and the other two have another. This is suitable if APP has much
more work than the others. Note that the bottom two configurations have more time
for the APP, CULL,andDRAW stages.
i
i
i
i
i
i
i
i
15.5. Multiprocessing 719
The DRAW stage takes the list from the CULL stage and issues all
graphics calls in this list. This means that it simply walks through the
list and feeds the geometry stage, which in turn feeds the rasterizer stage.
Figure 15.7 shows some examples of how this pipeline can be used.
If one processor is available, then all three stages are run on that CPU.
If two CPUs are available, then APP and CULL can be executed on one
CPU and DRAW on the other. Another configuration would be to execute
APP on one processor and CULL and DRAW on the other. Which is the
best depends on the workloads for the different stages. Finally, if the host
hasthreeCPUs,theneachstagecanbeexecutedonaseparateCPU.This
possibility is shown in Figure 15.8.
The advantage of this technique is that the throughput, i.e., the render-
ing speed, increases. The downside is that, compared to parallel processing,
the latency is greater. Latency, or temporal delay, is the time it takes from
the polling of the user’s actions to the final image [1329]. This should not
be confused with frame rate, which is the number of frames displayed per
second. For example, say the user is using a head-mounted display. The
determination of the head’s position may take 50 milliseconds to reach the
CPU, then it takes 15 milliseconds to render the frame. The latency is then
65 milliseconds from initial input to display. Even though the frame rate
is 66.7 Hz (1/0.015 seconds), interactivity will feel sluggish because of the
delay in sending the position changes to the CPU. Ignoring any delay due
to user interaction (which is a constant under both systems), multiprocess-
ing has more latency than parallel processing because it uses a pipeline.
As is discussed in detail in the next section, parallel processing breaks up
the frame’s work into pieces that are run concurrently.
In comparison to using a single CPU on the host, multiprocessor pipelin-
ing gives a higher frame rate and the latency is about the same or a little
greater due to the cost of synchronization. The latency increases with
Figure 15.8. At the top, a three-stage pipeline is shown. In comparison to the config-
urations in Figure 15.7, this configuration has more time for each pipeline stage. The
bottom illustration shows a way to reduce the latency: The CULL and the DRAW are
overlapped with FIFO buffering in between.
i
i
i
i
i
i
i
i
720 15. Pipeline Optimization
the number of stages in the pipeline. For a well-balanced application the
speedup is n times for n CPUs.
One technique for reducing the latency is to update the viewpoint and
other latency-critical parameters at the end of the APP stage [1079]. This
reduces the latency by (approximately) one frame. Another way to reduce
latency is to execute CULL and DRAW overlapped. This means that the
result from CULL is sent over to DRAW as soon as anything is ready for
rendering. For this to work, there has to be some buffering, typically a
FIFO, between those stages. The stages are stalled on empty and full
conditions; i.e., when the buffer is full, then CULL has to stall, and when
the buffer is empty, DRAW has to stall. The disadvantage is that techniques
such as state sorting cannot be used to the same extent, since primitives
have to be rendered as soon as they have been processed by CULL.This
latency reduction technique is visualized in Figure 15.8.
The pipeline in this figure uses a maximum of three CPUs, and the
stages have certain tasks. However, this technique is in no way limited to
this configuration—rather, you can use any number of CPUs and divide
the work in any way you want. The key is to make a smart division of the
entire job to be done so that the pipeline tends to be balanced. For example,
multiprocessor pipelining has been used to speed up the occlusion culling
algorithm in Section 14.6 using a different pipeline stage division than the
one presented here [1403]. Multiprocessor pipelining has also been used by
Funkhouser [372] to speed up LOD management (see Section 14.7.3) and
portal culling (Section 14.4).
The multiprocessor pipelining technique requires a minimum of syn-
chronization in that it needs to synchronize only when switching frames.
In all kinds of multiprocessing systems, there are many factors to consider
when it comes to data sharing, data exclusion, multibuffering, and more.
Details on these topics can be found in Rohlf and Helman’s paper [1079].
15.5.2 Parallel Processing
The most apparent disadvantage of using a multiprocessor pipeline tech-
nique is that the latency tends to increase. For some applications, such as
flight simulators, first person shooters, etc., this is not acceptable. When
moving the viewpoint, you usually want instant (next-frame) response, but
when the latency is long this will not happen. That said, it all depends:
If multiprocessing raised the frame rate from 30 fps with 1 frame latency
to 60 fps with 2 frames latency, the extra frame delay would then cause no
additional latency.
If multiple processors are available, one can try to parallelize the code
instead, which may result in shorter latency. To do this, the program
must possess the characteristics of parallelism. This means that, for a
i
i
i
i
i
i
i
i
15.5. Multiprocessing 721
program with no or a small amount of parallelism, there is no gain in
parallelizing the program—a parallel version may even become slower due
to the overhead involved in extra synchronization, etc. However, many
programs and algorithms do have a large amount of parallelism and can
therefore benefit.
There are several different methods for parallelizing an algorithm. As-
sume that n processors are available. Using static assignment [212], the
total work package, such as the traversal of an acceleration structure, is
divided into n work packages. Each processor then takes care of a work
package, and all processors execute their work packages in parallel. When
all processors have completed their work packages, it may be necessary to
merge the results from the processors. For this to work, the workload must
be highly predictable. When that is not the case, dynamic assignment al-
gorithms that adapt to different workloads may be used [212]. These use
one or more work pools. When jobs are generated, they are put into the
work pools. CPUs can then fetch one or more jobs from the queue when
they have finished their current job. Care must be taken so that only one
CPU can fetch a particular job, and so that the overhead in maintaining
the queue does not damage performance. Larger jobs mean that the over-
head for maintaining the queue becomes less of a problem, but, on the
other hand, if the jobs are too big, then performance may degrade due to
imbalance in the system—i.e., one or more CPUs may starve.
As for the multiprocessor pipeline, the ideal speedup for a parallel pro-
gram running on n processors would be n times. This is called linear
speedup. Even though linear speedup rarely happens, actual results can
sometimes be very close to it.
In Figure 15.6 on page 717, both a multiprocessor pipeline and a parallel
processing system with three CPUs are shown. Temporarily assume that
these should do the same amount of work for each frame and that both
configurations achieve linear speedup. This means that the execution will
run three times faster in comparison to serial execution (i.e., on a single
CPU). Furthermore, we assume that the total amount of work per frame
takes 30 ms, which would mean that the maximum frame rate on a single
CPU would be 1/0.03 33 frames per second.
The multiprocessor pipeline would (ideally) divide the work into three
equal-sized work packages and let each of the CPUs be responsible for one
work package. Each work package should then take 10 ms to complete.
If we follow the work flow through the pipeline, we will see that the first
CPU in the pipeline does work for 10 ms (i.e., one-third of the job) and
then sends it on to the next CPU. The first CPU then starts working on
the first part of the next frame. When a frame is finally finished, it has
taken 30 ms for it to complete, but since the work has been done in parallel
in the pipeline, one frame will be finished every 10 ms. So the latency is
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset