15. Pipeline Optimization (5/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

15.5. Multiprocessing 717

Figure 15.6. Two diﬀerent ways of utilizing multiple processors. At the top we show

how three processors (CPUs) are used in a multiprocessor pipeline, and at the bottom

we show paral lel execution on three CPUs. One of the diﬀerences between these two

implementations is that lower latency can be achieved if the conﬁguration at the bottom

is used. On the other hand, it may be easier to use a multiprocessor pipeline. The ideal

speedup for both of these conﬁgurations is linear, i.e., using n CPUs would give a speedup

of n times.

between the processors to communicate results. Shared memory multipro-

cessors are just as they sound; all processors share a logical address space

of memory among themselves. Most popular multiprocessor systems use

shared memory, and most of these have a symmetric multiprocessing (SMP)

design. SMP means that all the processors are identical. A multicore PC

system is an example of a symmetric multiprocessing architecture.

Here, we will present two methods for utilizing multiple processors for

real-time graphics. The ﬁrst method—multiprocessor pipelining, also called

temporal parallelism—will be covered in more detail than the second—

parallel processing, also called spatial parallelism. These two methods are

illustrated in Figure 15.6. Multiprocessor pipelining and parallel processing

can also be combined, but little research has been done in this area. For

this discussion, we assume that a multiprocessor computer is available, but

do not go into details about diﬀerent types of multiprocessors.

15.5.1 Multiprocessor Pipelining

As we have seen, pipelining is a method for speeding up execution by di-

viding a job into certain pipeline stages that are executed in parallel. The

result from one pipeline stage is passed on to the next. The ideal speedup

is n times for n pipeline stages, and the slowest stage (the bottleneck)

determines the actual speedup. Up to this point, we have seen pipelin-

ing used to run the application, geometry processing, and rasterization in

parallel for a single-CPU system. This technique can also be used when

718 15. Pipeline Optimization

multiple processors are available on the host, and in these cases, it is called

multiprocess pipelining or software pipelining.

We assume there is only one graphics accelerator available to the system,

and that only one processor thread can access it. This leaves the applica-

tion stage to be pipelined. Therefore, the application stage is divided into

three stages [1079]: APP, CULL,andDRAW.Thisisverycoarse-grained

pipelining, which means that each stage is relatively long. The APP stage

is the ﬁrst stage in the pipeline and therefore controls the others. It is

in this stage that the application programmer can put in additional code

that does, for example, collision detection. This stage also updates the

viewpoint. The CULL stage can perform:

• Traversal and hierarchical view frustum culling on a scene graph (see

Section 14.3).

• Level of detail selection (see Section 14.7).

• State sorting—geometry with similar state is sorted into bins in order

to minimize state changes. For example, all transparent objects are

sorted into one bin, which minimizes the need to change the blend

state. These objects could also be sorted in a rough back-to-front

order for correct rendering (see Section 5.7).

• Finally (and always performed), generation of a simple list of all

objects that should be rendered.

Figure 15.7. Diﬀerent conﬁgurations for a multiprocessor pipeline. The thick lines

represent synchronization between the stages, and the subscripts represent the frame

number. At the top, a single CPU pipeline is shown. In the middle and at the bottom

are shown two diﬀerent pipeline subdivisions using two CPUs. The middle has one

pipeline stage for APP and CULL and one pipeline stage for DRAW. This is a suitable

subdivision if DRAW has much more work to do than the others. At the bottom, APP

has one pipeline stage and the other two have another. This is suitable if APP has much

more work than the others. Note that the bottom two conﬁgurations have more time

for the APP, CULL,andDRAW stages.

15.5. Multiprocessing 719

The DRAW stage takes the list from the CULL stage and issues all

graphics calls in this list. This means that it simply walks through the

list and feeds the geometry stage, which in turn feeds the rasterizer stage.

Figure 15.7 shows some examples of how this pipeline can be used.

If one processor is available, then all three stages are run on that CPU.

If two CPUs are available, then APP and CULL can be executed on one

CPU and DRAW on the other. Another conﬁguration would be to execute

APP on one processor and CULL and DRAW on the other. Which is the

best depends on the workloads for the diﬀerent stages. Finally, if the host

hasthreeCPUs,theneachstagecanbeexecutedonaseparateCPU.This

possibility is shown in Figure 15.8.

The advantage of this technique is that the throughput, i.e., the render-

ing speed, increases. The downside is that, compared to parallel processing,

the latency is greater. Latency, or temporal delay, is the time it takes from

the polling of the user’s actions to the ﬁnal image [1329]. This should not

be confused with frame rate, which is the number of frames displayed per

second. For example, say the user is using a head-mounted display. The

determination of the head’s position may take 50 milliseconds to reach the

CPU, then it takes 15 milliseconds to render the frame. The latency is then

65 milliseconds from initial input to display. Even though the frame rate

is 66.7 Hz (1/0.015 seconds), interactivity will feel sluggish because of the

delay in sending the position changes to the CPU. Ignoring any delay due

to user interaction (which is a constant under both systems), multiprocess-

ing has more latency than parallel processing because it uses a pipeline.

As is discussed in detail in the next section, parallel processing breaks up

the frame’s work into pieces that are run concurrently.

In comparison to using a single CPU on the host, multiprocessor pipelin-

ing gives a higher frame rate and the latency is about the same or a little

greater due to the cost of synchronization. The latency increases with

Figure 15.8. At the top, a three-stage pipeline is shown. In comparison to the conﬁg-

urations in Figure 15.7, this conﬁguration has more time for each pipeline stage. The

bottom illustration shows a way to reduce the latency: The CULL and the DRAW are

overlapped with FIFO buﬀering in between.

720 15. Pipeline Optimization

the number of stages in the pipeline. For a well-balanced application the

speedup is n times for n CPUs.

One technique for reducing the latency is to update the viewpoint and

other latency-critical parameters at the end of the APP stage [1079]. This

reduces the latency by (approximately) one frame. Another way to reduce

latency is to execute CULL and DRAW overlapped. This means that the

result from CULL is sent over to DRAW as soon as anything is ready for

rendering. For this to work, there has to be some buﬀering, typically a

FIFO, between those stages. The stages are stalled on empty and full

conditions; i.e., when the buﬀer is full, then CULL has to stall, and when

the buﬀer is empty, DRAW has to stall. The disadvantage is that techniques

such as state sorting cannot be used to the same extent, since primitives

have to be rendered as soon as they have been processed by CULL.This

latency reduction technique is visualized in Figure 15.8.

The pipeline in this ﬁgure uses a maximum of three CPUs, and the

stages have certain tasks. However, this technique is in no way limited to

this conﬁguration—rather, you can use any number of CPUs and divide

the work in any way you want. The key is to make a smart division of the

entire job to be done so that the pipeline tends to be balanced. For example,

multiprocessor pipelining has been used to speed up the occlusion culling

algorithm in Section 14.6 using a diﬀerent pipeline stage division than the

one presented here [1403]. Multiprocessor pipelining has also been used by

Funkhouser [372] to speed up LOD management (see Section 14.7.3) and

portal culling (Section 14.4).

The multiprocessor pipelining technique requires a minimum of syn-

chronization in that it needs to synchronize only when switching frames.

In all kinds of multiprocessing systems, there are many factors to consider

when it comes to data sharing, data exclusion, multibuﬀering, and more.

Details on these topics can be found in Rohlf and Helman’s paper [1079].

15.5.2 Parallel Processing

The most apparent disadvantage of using a multiprocessor pipeline tech-

nique is that the latency tends to increase. For some applications, such as

ﬂight simulators, ﬁrst person shooters, etc., this is not acceptable. When

moving the viewpoint, you usually want instant (next-frame) response, but

when the latency is long this will not happen. That said, it all depends:

If multiprocessing raised the frame rate from 30 fps with 1 frame latency

to 60 fps with 2 frames latency, the extra frame delay would then cause no

additional latency.

If multiple processors are available, one can try to parallelize the code

instead, which may result in shorter latency. To do this, the program

must possess the characteristics of parallelism. This means that, for a

15.5. Multiprocessing 721

program with no or a small amount of parallelism, there is no gain in

parallelizing the program—a parallel version may even become slower due

to the overhead involved in extra synchronization, etc. However, many

programs and algorithms do have a large amount of parallelism and can

therefore beneﬁt.

There are several diﬀerent methods for parallelizing an algorithm. As-

sume that n processors are available. Using static assignment [212], the

total work package, such as the traversal of an acceleration structure, is

divided into n work packages. Each processor then takes care of a work

package, and all processors execute their work packages in parallel. When

all processors have completed their work packages, it may be necessary to

merge the results from the processors. For this to work, the workload must

be highly predictable. When that is not the case, dynamic assignment al-

gorithms that adapt to diﬀerent workloads may be used [212]. These use

one or more work pools. When jobs are generated, they are put into the

work pools. CPUs can then fetch one or more jobs from the queue when

they have ﬁnished their current job. Care must be taken so that only one

CPU can fetch a particular job, and so that the overhead in maintaining

the queue does not damage performance. Larger jobs mean that the over-

head for maintaining the queue becomes less of a problem, but, on the

other hand, if the jobs are too big, then performance may degrade due to

imbalance in the system—i.e., one or more CPUs may starve.

As for the multiprocessor pipeline, the ideal speedup for a parallel pro-

gram running on n processors would be n times. This is called linear

speedup. Even though linear speedup rarely happens, actual results can

sometimes be very close to it.

In Figure 15.6 on page 717, both a multiprocessor pipeline and a parallel

processing system with three CPUs are shown. Temporarily assume that

these should do the same amount of work for each frame and that both

conﬁgurations achieve linear speedup. This means that the execution will

run three times faster in comparison to serial execution (i.e., on a single

CPU). Furthermore, we assume that the total amount of work per frame

takes 30 ms, which would mean that the maximum frame rate on a single

CPU would be 1/0.03 ≈ 33 frames per second.

The multiprocessor pipeline would (ideally) divide the work into three

equal-sized work packages and let each of the CPUs be responsible for one

work package. Each work package should then take 10 ms to complete.

If we follow the work ﬂow through the pipeline, we will see that the ﬁrst

CPU in the pipeline does work for 10 ms (i.e., one-third of the job) and

then sends it on to the next CPU. The ﬁrst CPU then starts working on

the ﬁrst part of the next frame. When a frame is ﬁnally ﬁnished, it has

taken 30 ms for it to complete, but since the work has been done in parallel

in the pipeline, one frame will be ﬁnished every 10 ms. So the latency is

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Pipeline Optimization (5/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
15. Pipeline Optimization (5/6)