So far in this book, we’ve examined threading as a programming technique that allows us to simplify programming: we have used threading to achieve asynchronous behavior or to perform independent tasks. Although we discussed how threads are scheduled on machines with multiple processors, by and large the techniques that we’ve shown so far are not affected by a machine with multiple processors, nor do they exploit the number of processors on a machine to make the program run faster.
Multithreaded applications have a special bond with multiprocessor systems. The separation of threads provides a clear and simple separation for the multiprocessor machine. Since the operating system can place different threads on different processors, the application will run faster.
In this chapter, we’ll look at how to parallelize Java programs so that they will run faster on a machine with multiple CPUs. The processes that we’ll examine are beneficial not only to newly developed Java programs, but also to existing Java programs that have a CPU-intensive loop, allowing us to improve the performance of those programs on a multiprocessor system.
How does the Java threading system behave in a multiprocessor system? There are no conceptual differences between a program running on a machine with one processor and a machine with two or more processors; the threads behave exactly the same in either case. However, as we discussed in Chapter 6, the key difference between a multiprocessor and a single-processor system is that there may be one currently running thread for each CPU on the host platform. The impact of this is that when our Java program runs on a machine with multiple processors, the following assumptions become very important:
We can no longer assume that a currently running thread has the highest priority. A higher-priority thread may be running on a different processor.
We can no longer assume that a low-priority thread will not run. There may be enough processors to give it execution time.
We can no longer assume that threads of different priorities will not be running at the same time.
We can no longer assume that certain race conditions can be ignored because it is “unreasonable” for a particular case to occur. Race conditions in a multiprocessor system are real, whereas race conditions in a single-processor system are more dependent on the scheduling engine of the Java virtual machine.
The point to understand here is that these assumptions were never guaranteed in the first place. However, on a single-processor machine (especially under the green-thread model), violation of these assumptions was rare. On a multiprocessor system, these assumptions are violated quite often.
Without redesigning a program, the best area to parallelize—that is, the area in which to introduce multiple threads to increase the program’s performance—is where the application is CPU bound. After all, there is no reason to bring in more processors if the first processor cannot stay busy. In many of the cases where the process is CPU bound—that is, the process is using all of the computer processors’ cycles, while not using the disks or the network at full capacity—the speed of the application can increase with the addition of more processors. The process could be involved in a long mathematical calculation or, more likely, in large iterations of shorter mathematical calculations. Furthermore, these calculations probably involve a large control loop or even a large number of loops inside loops. These are the types of common algorithms that we will examine here. Consider the following calculation:
public class SinTable { private float lookupValues[] = null; public synchronized float[] getValues() { if (lookupValues == null) { lookupValues = new float [360 * 100]; for (int i = 0; i < (360*100); i++) { float sinValue = (float)Math.sin( (i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } } return lookupValues; } }
We’ll use this code as the basis of our example for the rest of
this chapter. A single thread, and hence a single processor, will
execute the loop as specified in the code and store the results in
the lookupValues
array. Assuming that the
calculation of the sinValue
variable is
time-consuming, the whole loop may take a long time to execute. For
some cases, this is acceptable. However, on a twelve-processor
computer without any other application running, only one CPU will be
working while the other eleven would be sitting idle. Considering the
cost of a twelve-way machine, this is not acceptable.
Before we get started, let’s define some terminology.[12] The variable
sinValue
has a few special properties.
Obviously, it exists only during the duration of the loop. It is a
temporary variable used to aid the calculation of the lookup table.
It does not carry a value in one iteration of the loop that is used
in another iteration of the loop, and the value of the variable is
reassigned in the next iteration. We will define
sinValue
as a loop-private
variable, that is, a variable that is initialized,
calculated, and used entirely in a single iteration of the loop.
Examining further, we can state that the index variable
i
is also a loop-private variable: it is also
used completely in an iteration of the loop. It can be considered as
a special type of loop-private variable. Since it is never changed in
the iteration and is directly tied to the iteration index, we can
actually treat it as a constant during the iteration of a loop.
However, for now, simply considering it as a loop-private variable is
good enough.
We may try to break the parts of this loop among many threads as follows:
public class SinTable implements Runnable { private class SinTableRange { public int start, end; } private float lookupValues[]; private Thread lookupThreads[]; private int startLoop, endLoop, curLoop, numThreads; public SinTable() { lookupValues = new float [360 * 100]; lookupThreads = new Thread[12]; startLoop = curLoop = 0; endLoop = (360 * 100); numThreads = 12; } private synchronized SinTableRange loopGetRange() { if (curLoop >= endLoop) return null; SinTableRange ret = new SinTableRange(); ret.start = curLoop; curLoop += (endLoop-startLoop)/numThreads+1; ret.end = (curLoop<endLoop)?curLoop:endLoop; return ret; } private void loopDoRange(int start, int end) { for (int i = start; i < end; i += 1) { float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } } public void run() { SinTableRange str; while ((str = loopGetRange()) != null) { loopDoRange(str.start, str.end); } } public float[] getValues() { for (int i = 0; i < numThreads; i++) { lookupThreads[i] = new Thread(this); lookupThreads[i].start(); } for (int i = 0; i < numThreads; i++) { try { lookupThreads[i].join(); } catch (InterruptedException iex) {} } return lookupValues; } }
The code in this new version is functionally the same as the previous
version, albeit with many modifications to its logic. First, instead
of a loop that does the calculation, we now have a loop that starts
off 12 (numThreads
) different worker threads and
provides each worker thread with different parts of the mathematical
loop to calculate. The original mathematical calculation is moved to
a new method, loopDoRange()
. In this method, the
loop has been modified to work on only part of the lookup table
instead of the whole table. Each different thread is responsible for
calculating only its portion of the table. Each thread must call the
loopGetRange()
method to determine which portion
they must calculate. The original thread that started the 12 worker
threads then simply waits for all 12 worker threads to finish. Since
the long calculation is now accomplished by 12 threads instead of by
a single thread, it is now possible for a multiprocessor-based
operating system to place the different threads on different
processors.
The calculation works for a number of reasons. First, the loop index
variable i
and the sinValue
variable, which were originally classified as loop private, are now
stack variables in each worker thread. The
loopDoRange()
method uses different copies of
these two variables in each thread executing the loop. This means
that each of the 12 worker threads has its own copy of these
variables while completing its portion of the calculation.
Second, although the lookupTable
array is not
loop private, the individual members of the array can be considered
loop private. Each individual member of the array is only accessed in
a particular iteration. There is no race condition because each
iteration affects one and only one member of the array, and although
the different worker threads handle many iterations of the loop, no
single iteration is handled by more than one thread.
The only synchronization we need is in the assignment of the
different ranges. To prevent the worker threads from stepping on each
other during this assignment, the loopGetRange()
method is synchronized. In this example, since the loop is
partitioned only in 12 ranges, the execution time for this method is
insignificant when compared with the loop calculation itself.
The code for this new version is more complicated than our first version. This new code now has to start and track 12 separate threads. The worker threads had to be modified to handle parts of the loop whose ranges they have to determine. Although there is very little synchronization in this case, we could easily have had a complicated requirement for synchronization depending on the algorithm used in the mathematical calculation.
Given the complexity we introduced to handle this simple loop, it may become too hard to handle more complex loops. To help with this complexity, we’ll move all the logic related to loop management into a separate class. We can then implement the loop by simply using the services provided by this class:
public class LoopHandler implements Runnable { protected class LoopRange { public int start, end; } protected Thread lookupThreads[]; protected int startLoop, endLoop, curLoop, numThreads; public LoopHandler(int start, int end, int threads) { startLoop = curLoop = start; endLoop = end; numThreads = threads; lookupThreads = new Thread[numThreads]; } protected synchronized LoopRange loopGetRange() { if (curLoop >= endLoop) return null; LoopRange ret = new LoopRange(); ret.start = curLoop; curLoop += (endLoop-startLoop)/numThreads+1; ret.end = (curLoop<endLoop) ? curLoop : endLoop; return ret; } public void loopDoRange(int start, int end) { } public void loopProcess() { for (int i = 0; i < numThreads; i++) { lookupThreads[i] = new Thread(this); lookupThreads[i].start(); } for (int i = 0; i < numThreads; i++) { try { lookupThreads[i].join(); } catch (InterruptedException iex) {} } } public void run() { LoopRange str; while ((str = loopGetRange()) != null) { loopDoRange(str.start, str.end); } } }
In
our new LoopHandler class, we have implemented the logic that we
applied on our SinTable class. The logic of creating, tracking, and
joining back with the original thread has been moved to the newly
created loopProcess()
method. The logic of
determining the ranges and processing the loop—originally coded
in the run()
and
loopGetRange()
methods of the SinTable
class—remains nearly unchanged. The loop handler has also been
modified to handle more generic loops and has a constructor that will
assign the start of the loop, the end of the loop, and the number of
threads. Just as in our earlier example, the algorithm will call the
loopDoRange()
method to handle the processing.
However, in this case, the LoopHandler class has an empty
implementation for this method.
Now our implementation of the SinTable class is much simpler:
public class SinTable extends LoopHandler { private float lookupValues[]; public SinTable() { super(0, 360*100, 12); lookupValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { for (int i = start; i < end; i++) { float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } } public float[] getValues() { loopProcess(); return lookupValues; } }
In this case, we simply need to configure the ranges needed by the
loop handler, provide the logic of the loop in the
loopDoRange()
method, and call the
loopProcess()
method to process the loop in a
multithreaded fashion. While this is still more complicated than the
first SinTable class implementation, it is now much more manageable
and less complex than the previous implementation.
We
define the
process of distributing the
iterations of the loop to the individual threads as loop
scheduling. In our LoopHandler class, this is handled by
the loop-GetRange()
method. To maximize
processor usage, we should distribute the work to the threads as
evenly as possible, with the least amount of overhead in determining
this distribution. This is defined as load
balancing.
Here are the basic loop-scheduling types at our disposal:
Under static scheduling, each thread is assigned an equal number of iterations that depends on the number of threads available. If there are 1000 iterations in the loop that are to be distributed and 10 threads that are assigned to the task, then each thread will be assigned 100 iterations of the loop. This is the algorithm that is used by the LoopHandler class. The algorithm also adds 1 to the size to make sure that the distribution is rounded up. Otherwise, there might be an iteration left over and a worker thread would have to perform that single iteration after already performing the original chunk.
The problem with this algorithm is that it assumes that each iteration of the loop takes the same amount of time. If this is not true, then one of the threads will take more time than the other threads to complete. Since all the work is divided up at the beginning of the loop, the other threads will be idle while the final iterations are completed by the last remaining thread.
In self-scheduling, each worker thread grabs a small chunk of the iterations to execute. After completion of its assigned range, it grabs another small chunk. If there are 1000 iterations in the loop that are to be distributed and 10 threads are assigned to the task, then each worker thread will work on a small chunk—say 20—until all 1000 iterations are completed.
As with static scheduling, the different worker threads may not complete at the same time. However, since the chunks are small in the self-scheduling model, the idle time of the threads at the end of the process is also small. To make this idle time even smaller, we can make the individual chunks smaller. However, there is an overhead in obtaining the ranges to execute; this overhead will increase as the chunks get smaller.
Here’s an implementation of this model:
public class SelfLoopHandler extends LoopHandler { protected int groupSize; public SelfLoopHandler(int start, int end, int size, int threads) { super(start, end, threads); groupSize = size; } protected synchronized LoopRange loopGetRange() { if (curLoop >= endLoop) return null; LoopRange ret = new LoopRange(); ret.start = curLoop; curLoop += groupSize; ret.end = (curLoop<endLoop)?curLoop:endLoop; return ret; } }
Implementation of a self-scheduling loop handler is straightforward.
Our current LoopHandler class already has the logic of working until
the loop has been completed. We simply need to modify the constructor
to handle the chunk size requested, and modify the
loopGetRange()
method to return this fixed chunk
size. In our implementation of the self-scheduler, we simply subclass
from the original loop handler and implement only the changes.
Guided self-scheduling is a compromise between the static scheduler and the self-scheduler. In the beginning, the guided scheduler grabs a large number of iterations of the loop, which becomes progressively smaller near the end of the loop. There is also a minimum chunk size that the guided self-scheduler uses. Thus, it basically behaves like a static scheduler that slowly becomes a self-scheduler.
If 1000 iterations in the loop are to be distributed and 10 threads are assigned to the task, then the first worker thread gets one-tenth of the work—100 iterations. The second thread gets one-tenth of the remaining work—90 iterations. This slowly gets smaller and smaller until the minimum—say 10—is assigned; the minimum is assigned until all 1000 iterations are completed.
This algorithm seems to have the fewest problems. Unlike the self-scheduler, the extra overhead only appears at the end of the loop. And unless the individual iterations have drastically different execution periods from the longer-term iterations at the beginning, it doesn’t have the problems that the static scheduler has.
Here’s how to implement guided self-scheduling:
public class GuidedLoopHandler extends LoopHandler { protected int minSize; public GuidedLoopHandler(int start, int end, int min, int threads){ super(start, end, threads); minSize = min; } protected synchronized LoopRange loopGetRange() { if (curLoop >= endLoop) return null; LoopRange ret = new LoopRange(); ret.start = curLoop; int sizeLoop = (endLoop-curLoop)/numThreads; curLoop += (sizeLoop>minSize)?sizeLoop:minSize; ret.end = (curLoop<endLoop)?curLoop:endLoop; return ret; } }
Implementation of a guided self-scheduling loop handler is also
straightforward. We simply need to modify the constructor to handle
the minimum size required, and modify the
loopGetRange()
method to return a portion of the
remaining loop. In our implementation of the guided self-scheduler,
we also subclass the original loop handler and implement only the
changes.
The implementation of the self-scheduler
and the guided self-scheduler is simple for a reason: it was designed
to be so. The original loop handler was designed to be subclassed so
that the scheduler algorithm could be modified. As good as the
implementation of the guided self-scheduler may be, it is still
designed for a generic loop. There will be cases where each of the
different schedulers will work better than others. However, if enough
information concerning the loop is known, and the effort is large
enough, it may justify the implementation of yet another scheduler.
This entails figuring out the appropriate logic and coding a new
loopGetRange()
method.
Here’s how our original example can be modified to use one of the scheduling techniques we’ve just seen:
public class SinTable extends GuidedLoopHandler { private float lookupValues[]; public SinTable() { super(0, 360*100, 100, 12); lookupValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { for (int i = start; i < end; i++) { float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } } public float[] getValues() { loopProcess(); return lookupValues; } }
To use the guided self-scheduler algorithm in our SinTable class, we simply subclass from the GuidedLoopHandler class and modify our constructor to pass the minimum chunk size. We could also have written the GuidedLoopHandler class to have an overloaded constructor that picks a default minimum. This would allow it to have a constructor with the same signature as the static loop handler.
In the implementation of the SinTable class, we classified the variables used in the original nonthreaded loop as loop-private variables, but other variable classifications exist. The reason for classifying variables at all is that different types of variables require different types of handling within and between threads, because many loops have a data dependency that occurs between different iterations. By classifying the variables, we are able to correctly update and modify them without any race conditions. Different types of variable classifications can be determined by their usage, and these classifications will determine how they are to be implemented or treated in the multithreaded loop handler.
As
mentioned, a loop-private variable is a variable that does not pass
its value from one iteration of the loop to another iteration of the
loop. It can actually be a variable that is declared in the loop
itself, and it can also be an instance or publicly accessed variable
that is accessed by only one iteration of the loop. This was the case
with the lookupValues
array variable, where each
member of the array was only accessed by one iteration of the loop.
Although the whole array was not loop private to any iteration,
specific members were loop private to specific iterations.
As shown with the SinTable class, treatment of loop-private variables
is often done with a local copy of the variable in each thread. Since
each thread has a copy, no interference between the threads is
possible. In the case of the lookupValues
array,
there is an understanding that the threads will respect the privacy
of the other threads by only accessing their loop-private portions of
the array.
Read-only variables are variables that do not change in value during the duration of the loop. They can be true constants or simply variables that have been initialized and will not change until after the loop has been processed.
No special treatment of read-only variables is necessary. The worker threads do not need to have their own copies of the variables, and access to them does not require synchronization of any type.
Storeback variables are basically
loop-private variables that are needed after the loop has been
completed. For example, say that the processing of the
lookupValues
array required some extra work to
be done after the loop was finished:
public float[] getValues() {
if (lookupValues == null) {
float sinValue = 0;
lookupValues = new float [360 * 100];
for (int i = 0; i < (360*100); i++) {
sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
lookupValues[i] = sinValue * (float)i / 180.0f;
}
lookupValues[0] += sinValue;
}
return lookupValues;
}
In this slightly modified version of the SinTable loop, both the
sinValue
variable and the individual members of
the lookupValues
array are still loop-private
variables. There is no data dependency between these two variables in
different iterations of the loop. However, in this case the
sinValue
variable is also a storeback variable.
Since the variable is important after the loop has completed, it must
be set to the value as if the loop had run in the correct order. The
members of the lookupValues
array were always
considered as storeback variables, but since no individual copies
were kept, there was little need to make this extra distinction.
Here’s how we can handle the storeback variable:
public class SinTable extends GuidedLoopHandler { private float lookupValues[]; private float sinValue; public SinTable() { super(0, 360*100, 100, 12); lookupValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { float sinValue = 0; for (int i = start; i < end; i++) { sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } if (end == endLoop) this.sinValue = sinValue; } public float[] getValues() { loopProcess(); lookupValues[0] += sinValue; return lookupValues; } }
The sinValue
variable is still treated as a
loop-private variable. However, since this variable is really a
storeback variable, we need to store the “last” value of
this variable. Since the algorithm is now executed in a multithreaded
manner, the last iteration is not necessarily the last value assigned
to the variable by a thread.
A thread must check that it has executed the last chunk of the loop before copying the value of its loop-private copy to the global copy. Also note that no synchronization is necessary. Since only the last iteration will be copied, only one thread will be executing the code, and no race condition is possible.
Obviously, it is not possible to make every variable a loop-private variable, since there are cases where there are real data dependencies between different iterations of the loop. Because of these data dependencies, different threads executing different iterations might interfere with each other during execution. We will define these types of variables as shared variables , since they are shared between iterations of the loop.
Shared variables have many problems. The first is the race conditions that exist when different threads access the variable simultaneously. The second is that the value of a variable may depend on the order in which it is processed. In the first case, we can simply use synchronization techniques to prevent the race conditions from existing. The second case poses a much greater problem.
However, what if the order did not matter? We would be able to process the loop in any order and would simply have to synchronize access to the shared variable. For example, let us assume that we also need to calculate the sum of our SinTable:
public float[] getValues() {
for (int i = 0; i < (360*100); i++) {
sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
lookupValues[i] = sinValue * (float)i / 180.0f;
sumValue += lookupValues[i];
}
return lookupValues;
}
In this case, the sumValue
variable is clearly
not a loop-private variable. The value of
sumValue
is passed from one iteration to
another, and the correct result requires this dependency to exist.
However, the sumValue
variable is only useful
after the loop has completed. The iterations simply add to the
running total—subtotals or other order-based requirements are
not necessary. Furthermore, addition itself is order independent: it
is possible to add a bunch of numbers in any order, and the final
result will be the same.
The sumValue
variable is a reduction variable.
It must still be shared among the threads, but since order does not
matter, this sharing only requires synchronization to prevent race
conditions:
public class SinTable extends GuidedLoopHandler { private float lookupValues[]; public float sumValue; public SinTable() { super(0, 360*100, 100, 12); lookupValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { float sinValue = 0; for (int i = start; i < end; i++) { sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; synchronized (this) { sumValue += lookupValues[i]; } } } public float[] getValues() { loopProcess(); return lookupValues; } }
Race conditions in this example are prevented by using the synchronization lock of the SinTable instance. If we have many reduction variables that are not dependent on each other and we cannot store them all at the same time, it might be a better idea to have separate synchronization locks—or BusyFlags—for each reduction variable.
Furthermore, we are synchronizing with each iteration of the loop. This is not very efficient. It is better to assign the value to loop-private variables and only synchronize the final summed value of the range to the reduction variable. By doing this, we are removing most of the need for synchronization, which can drastically add to the parallelization of the threads:
public class SinTable extends GuidedLoopHandler { private float lookupValues[]; public float sumValue; public SinTable() { super(0, 360*100, 100, 12); lookupValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { float sinValue = 0.0f; float sumValue = 0.0f; for (int i = start; i < end; i++) { sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; sumValue += lookupValues[i]; } synchronized (this) { this.sumValue += sumValue; } } public float[] getValues() { loopProcess(); System.out.println(sumValue); return lookupValues; } }
In this new example, we are doing a two-stage reduction of the
values. We are reducing the value of each iteration to the local copy
of the sumValue
variable, and then we are
reducing this local copy to the actual reduction variable. Since the
local copy of the sumValue
variable is loop
private, synchronization is not necessary. Synchronization is still
necessary when adding to the reduction variable. However, this is now
done once per range instead of once per iteration.
Finally, all reduction variables are storeback variables. There is no need to have special storeback handling logic for reduction variables.
Originally, all variables in the loop are shared variables, since all variables can be accessed by all the threads that are executing the loop. As we parallelize the loop, we can quickly classify the shared variables that are also read-only variables. We can also reclassify those variables that are loop-private variables. Of the remaining shared variables, it may be possible either to convert them to loop-private variables or to classify them as reduction variables.
Unfortunately, there will be cases where a shared variable cannot be classified as anything but a shared variable. This is where our technique fails to work. As much as we would like to convert any loop to run in a multithreaded environment, not all algorithms can be redesigned to run in a parallel environment.
The other problem with shared variables is the side effect. For
example, if we needed to save each of the subtotals of the
sumValue
variable, it could not be treated as a
reduction variable since the changes in the variable are also
important. If we had to print the subtotals during the loop, not only
would the intermediate results be out of order, but the intermediate
results would be different.
When variable classification is not enough for parallelization, we have other techniques that can help. They may not solve every case, but with experience, more and more loops can be converted to run in a multithreaded environment.
To assist our parallelizing techniques, we can analyze the algorithms of the loop itself instead of just analyzing the variables in the loop. In the majority of the cases, there is very little that we can do without redesigning the algorithm, but there are a few situations where we can quickly modify the code without a complete redesign. By implementing simple transformations on the original code, we may be able to use the techniques discussed so far to thread the loop.
In many cases, only a small portion of a large complex loop contains code that must be executed sequentially. It may be possible to separate the large complex loop into two separate loops. Once the complex loop is separated into two loops—one loop containing the code that can be parallelized, the other containing the sequential code—we can then parallelize a portion of the original loop. We may even be able to run the sequential loop in parallel with the loop that can be threaded.
Returning to our SinTable example, let’s assume that we need not only a total but also a running subtotal of the table that is to be generated:
public float[] getValues() { for (int i = 0; i < (360*100); i++) { sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; if (i == 0) { sumValues[0] = lookupValues[0]; } else { sumValues[i] = lookupValues[i] + lookupValues[i-1]; } } return lookupValues; }
The sumValues
array variable is definitely a
shared variable. The members of the sumValues
variable are also shared in that some of them are accessed by two
different threads. Furthermore, the order matters. It is not possible
for one thread to start a chunk before the thread that is working on
the previous chunk is finished.
We can solve that problem like this:
public class SinTable extends GuidedLoopHandler { private float lookupValues[]; public float sumValues[]; public SinTable() { super(0, 360*100, 100, 12); lookupValues = new float [360 * 100]; sumValues = new float [360 * 100]; } public void loopDoRange(int start, int end) { float sinValue = 0.0f; for (int i = start; i < end; i++) { sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[i] = sinValue * (float)i / 180.0f; } } public float[] getValues() { loopProcess(); sumValues[0] = lookupValues[0]; for (int i = 1; i < (360*100); i++) { sumValues[i] = lookupValues[i] + lookupValues[i-1]; } return lookupValues; } }
While it is not possible to parallelize the running subtotal without drastically changing the algorithm, we can quickly convert the loop into two separate loops. The first loop contains the threadable code, and the second processes the subtotal. Once this is accomplished, we can then thread the first loop without changing the second. In the new SinTable class, we have moved the running subtotal code to a separate loop. This separate loop runs on a single thread, and only after the first loop is processed.
Some comparisons should be taken when using this technique. Since a large portion of the loop may be running single threaded, the performance gain may not justify the effort involved. In most cases, calculations of the subtotal are small considering the effort of the main calculation, and the performance penalty may be small in comparison.
Many applications do not contain a single large loop. Even if a particular loop is determined to be unparallelizable, there may be other loops in the application. Even if these other loops also cannot be parallelized, we may be able to run each separate loop in a different thread.
Although the many loops may be very complex, with large data dependencies between iterations, there may be few data dependencies between the different loops. It may be possible to isolate the individual loops themselves and run them each in a separate thread. With this technique, load balancing is no longer possible. After all, if the application contains four major loops and you were able to isolate them all, it is still impossible to distribute these four loops among twelve processors.
Multilayered loops are a prime cause of CPU-bound applications that run for a large period of time. This could be loops that are directly inside of other loops or, more likely, loops that call methods that contain loops. This scenario is so common that we will examine inner-loop threading later in this chapter. For now, there is a simple case to look for:
public float[][] getValues() { for (int i = 0; i < 360; i++) { lookupValues[0][i] = 0; } for (int j = 1; j < 1000; j++) { for (int i = 0; i < 360; i++) { float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[j][i] = sinValue * (float)i / 180.0f; lookupValues[j][i] += lookupValues[j-1][i]*(float)j/180.0f; } } return lookupValues; }
For multilayered loops, it is generally more profitable to thread the outer loop instead of the inner one. It is not necessary to thread both the inner and outer loop because threading either one should use all the processors. If the outer loop is threaded, threading the inner loop will not provide any further speedup since there are no more processors to run the extra threads (and vice versa). The reason we prefer to thread the outer loop is that there is an overhead in creating, destroying, and synchronizing among the many threads. By threading the outer loop, we create and destroy the threads once and only synchronize at a coarse level—less synchronization should be necessary.
In this new version of the table calculation, we are now working on a two-dimensional table. There are three loops used during this calculation. However, the first loop is merely setting the first row of values to zero. The next two loops are actually a pair of multilayered loops. The algorithm is looping the processing from row to row, executing the inner loop that is processing the values to be stored in the different columns.
The problem in this case is that there is a data dependency between
the rows themselves. Because the calculation at any row is dependent
on the calculation of the previous row, the members of any column in
the lookupValues
array cannot be considered loop
private—or made loop private. The inner loop can be
parallelized with no problems since there are no data dependencies
between the iterations. The only requirement is that the inner loop
must assume that the outer loop ran in the correct order; this
requirement is fine since we are not threading the outer loop.
However, we can also rewrite our original code as follows:
public float[][] getValues() { for (int i = 0; i < 360; i++) { lookupValues[0][i] = 0; } for (int i = 0; i < 360; i++) { for (int j = 1; j < 1000; j++) { float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0); lookupValues[j][i] = sinValue * (float)i / 180.0f; lookupValues[j][i] += lookupValues[j-1][i]*(float)j/180.0f; } } return lookupValues; }
In this example, the loops are interchanged. Instead of working from
row to row, we can work from column to column. The inner loop can
then process the data from row to row. By interchanging the loops,
the inner loop is now no longer threadable, since there is data
dependency between the members of the columns in the
lookupValues
array. However, the outer loop is
now threadable. Once the outer loop has been threaded, there will no
longer be a reason to thread the inner loop. Since it is more
profitable to thread an outer loop than an inner loop, this simple
change prior to multithreading gives us a better return on our
development time investment.
Unfortunately, although loops within loops are common, this example may not be. There is generally setup code for an inner loop, and there may be multiple loops that are run sequentially within the outer loop, or the inner loop may be inside another method that is called from the outer loop. And the data dependencies may be such that a loop interchange will not solve the problem.
Having an inner loop that is threadable in an outer loop that is not threadable is common. We will be examining inner-loop threading in more detail later in this chapter.
As you may have noticed, the loop
handler that we have developed is fairly restrictive. It only applies
to for
loops, the range of the loop must be
known prior to execution, it only works with integers as its index,
and it has an interval of only 1 between iterations. While some of
these restrictions are caused by the fact that we have not
implemented support for certain features in the loop handler, the
main cause is that it is difficult, if not impossible, to implement
an algorithm that can handle all generic loops.
If all else fails during loop transformation, programming experience
is still very useful. A while
loop or a
do
loop may be converted to a
for
loop. The start and end iterations may be
calculated prior to loop execution. Code may be moved from or into a
loop, or between loops, to allow other loop transformations to occur.
Code changes can also cause variable classifications to change. A
shared variable may be reclassified as loop private or as a reduction
variable because of how it is used in a loop.
Unfortunately, success is never guaranteed. The goal is to balance the effort of development with the acceleration that may be gained. It may take days to implement a change that can only achieve another one or two percent acceleration. After all, if unlimited effort were allowed, we would redesign the whole application from scratch.
[12] The terminology that we will be using in this section is somewhat based on the autothreading MP C compiler available for the Solaris operating system.