i
i
i
i
i
i
i
i
15.5. Multiprocessing 721
program with no or a small amount of parallelism, there is no gain in
parallelizing the program—a parallel version may even become slower due
to the overhead involved in extra synchronization, etc. However, many
programs and algorithms do have a large amount of parallelism and can
therefore benefit.
There are several different methods for parallelizing an algorithm. As-
sume that n processors are available. Using static assignment [212], the
total work package, such as the traversal of an acceleration structure, is
divided into n work packages. Each processor then takes care of a work
package, and all processors execute their work packages in parallel. When
all processors have completed their work packages, it may be necessary to
merge the results from the processors. For this to work, the workload must
be highly predictable. When that is not the case, dynamic assignment al-
gorithms that adapt to different workloads may be used [212]. These use
one or more work pools. When jobs are generated, they are put into the
work pools. CPUs can then fetch one or more jobs from the queue when
they have finished their current job. Care must be taken so that only one
CPU can fetch a particular job, and so that the overhead in maintaining
the queue does not damage performance. Larger jobs mean that the over-
head for maintaining the queue becomes less of a problem, but, on the
other hand, if the jobs are too big, then performance may degrade due to
imbalance in the system—i.e., one or more CPUs may starve.
As for the multiprocessor pipeline, the ideal speedup for a parallel pro-
gram running on n processors would be n times. This is called linear
speedup. Even though linear speedup rarely happens, actual results can
sometimes be very close to it.
In Figure 15.6 on page 717, both a multiprocessor pipeline and a parallel
processing system with three CPUs are shown. Temporarily assume that
these should do the same amount of work for each frame and that both
configurations achieve linear speedup. This means that the execution will
run three times faster in comparison to serial execution (i.e., on a single
CPU). Furthermore, we assume that the total amount of work per frame
takes 30 ms, which would mean that the maximum frame rate on a single
CPU would be 1/0.03 ≈ 33 frames per second.
The multiprocessor pipeline would (ideally) divide the work into three
equal-sized work packages and let each of the CPUs be responsible for one
work package. Each work package should then take 10 ms to complete.
If we follow the work flow through the pipeline, we will see that the first
CPU in the pipeline does work for 10 ms (i.e., one-third of the job) and
then sends it on to the next CPU. The first CPU then starts working on
the first part of the next frame. When a frame is finally finished, it has
taken 30 ms for it to complete, but since the work has been done in parallel
in the pipeline, one frame will be finished every 10 ms. So the latency is