Chapter 9. Parallelizing for Multiprocessor Machines

So far in this book, we’ve examined threading as a programming technique that allows us to simplify programming: we have used threading to achieve asynchronous behavior or to perform independent tasks. Although we discussed how threads are scheduled on machines with multiple processors, by and large the techniques that we’ve shown so far are not affected by a machine with multiple processors, nor do they exploit the number of processors on a machine to make the program run faster.

Multithreaded applications have a special bond with multiprocessor systems. The separation of threads provides a clear and simple separation for the multiprocessor machine. Since the operating system can place different threads on different processors, the application will run faster.

In this chapter, we’ll look at how to parallelize Java programs so that they will run faster on a machine with multiple CPUs. The processes that we’ll examine are beneficial not only to newly developed Java programs, but also to existing Java programs that have a CPU-intensive loop, allowing us to improve the performance of those programs on a multiprocessor system.

How does the Java threading system behave in a multiprocessor system? There are no conceptual differences between a program running on a machine with one processor and a machine with two or more processors; the threads behave exactly the same in either case. However, as we discussed in Chapter 6, the key difference between a multiprocessor and a single-processor system is that there may be one currently running thread for each CPU on the host platform. The impact of this is that when our Java program runs on a machine with multiple processors, the following assumptions become very important:

  • We can no longer assume that a currently running thread has the highest priority. A higher-priority thread may be running on a different processor.

  • We can no longer assume that a low-priority thread will not run. There may be enough processors to give it execution time.

  • We can no longer assume that threads of different priorities will not be running at the same time.

  • We can no longer assume that certain race conditions can be ignored because it is “unreasonable” for a particular case to occur. Race conditions in a multiprocessor system are real, whereas race conditions in a single-processor system are more dependent on the scheduling engine of the Java virtual machine.

The point to understand here is that these assumptions were never guaranteed in the first place. However, on a single-processor machine (especially under the green-thread model), violation of these assumptions was rare. On a multiprocessor system, these assumptions are violated quite often.

Parallelizing a Single-Threaded Program

Without redesigning a program, the best area to parallelize—that is, the area in which to introduce multiple threads to increase the program’s performance—is where the application is CPU bound. After all, there is no reason to bring in more processors if the first processor cannot stay busy. In many of the cases where the process is CPU bound—that is, the process is using all of the computer processors’ cycles, while not using the disks or the network at full capacity—the speed of the application can increase with the addition of more processors. The process could be involved in a long mathematical calculation or, more likely, in large iterations of shorter mathematical calculations. Furthermore, these calculations probably involve a large control loop or even a large number of loops inside loops. These are the types of common algorithms that we will examine here. Consider the following calculation:

public class SinTable {
    private float lookupValues[] = null;

    public synchronized float[] getValues() {
        if (lookupValues == null) {
            lookupValues = new float [360 * 100];
            for (int i = 0; i < (360*100); i++) {
                float sinValue = (float)Math.sin(
                                    (i % 360)*Math.PI/180.0);
                lookupValues[i] = sinValue * (float)i / 180.0f;
            }    
        }    
        return lookupValues;
    }
}

We’ll use this code as the basis of our example for the rest of this chapter. A single thread, and hence a single processor, will execute the loop as specified in the code and store the results in the lookupValues array. Assuming that the calculation of the sinValue variable is time-consuming, the whole loop may take a long time to execute. For some cases, this is acceptable. However, on a twelve-processor computer without any other application running, only one CPU will be working while the other eleven would be sitting idle. Considering the cost of a twelve-way machine, this is not acceptable.

Before we get started, let’s define some terminology.[12] The variable sinValue has a few special properties. Obviously, it exists only during the duration of the loop. It is a temporary variable used to aid the calculation of the lookup table. It does not carry a value in one iteration of the loop that is used in another iteration of the loop, and the value of the variable is reassigned in the next iteration. We will define sinValue as a loop-private variable, that is, a variable that is initialized, calculated, and used entirely in a single iteration of the loop.

Examining further, we can state that the index variable i is also a loop-private variable: it is also used completely in an iteration of the loop. It can be considered as a special type of loop-private variable. Since it is never changed in the iteration and is directly tied to the iteration index, we can actually treat it as a constant during the iteration of a loop. However, for now, simply considering it as a loop-private variable is good enough.

We may try to break the parts of this loop among many threads as follows:

public class SinTable implements Runnable {
    private class SinTableRange {
        public int start, end;
    }

    private float lookupValues[];
    private Thread lookupThreads[];
    private int startLoop, endLoop, curLoop, numThreads;
 
    public SinTable() {
        lookupValues = new float [360 * 100];
        lookupThreads = new Thread[12];
        startLoop = curLoop = 0;
        endLoop = (360 * 100);
        numThreads = 12;
    }
 
    private synchronized SinTableRange loopGetRange() {
        if (curLoop >= endLoop)
            return null;
        SinTableRange ret = new SinTableRange();
        ret.start = curLoop;
        curLoop += (endLoop-startLoop)/numThreads+1;
        ret.end = (curLoop<endLoop)?curLoop:endLoop;
        return ret;
    }
 
    private void loopDoRange(int start, int end) {
        for (int i = start; i < end; i += 1) {
                        float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
                        lookupValues[i] = sinValue * (float)i / 180.0f;
                    }
    }
 
    public void run() {
        SinTableRange str;
        while ((str = loopGetRange()) != null) {
            loopDoRange(str.start, str.end);
        }
    }
 
    public float[] getValues() {
        for (int i = 0; i < numThreads; i++) {
                        lookupThreads[i] = new Thread(this);
                        lookupThreads[i].start();
                    }
                    for (int i = 0; i < numThreads; i++) {
                        try {
                            lookupThreads[i].join();
                        } catch (InterruptedException iex) {}
                    }
        return lookupValues;
    }
}

The code in this new version is functionally the same as the previous version, albeit with many modifications to its logic. First, instead of a loop that does the calculation, we now have a loop that starts off 12 (numThreads) different worker threads and provides each worker thread with different parts of the mathematical loop to calculate. The original mathematical calculation is moved to a new method, loopDoRange(). In this method, the loop has been modified to work on only part of the lookup table instead of the whole table. Each different thread is responsible for calculating only its portion of the table. Each thread must call the loopGetRange() method to determine which portion they must calculate. The original thread that started the 12 worker threads then simply waits for all 12 worker threads to finish. Since the long calculation is now accomplished by 12 threads instead of by a single thread, it is now possible for a multiprocessor-based operating system to place the different threads on different processors.

The calculation works for a number of reasons. First, the loop index variable i and the sinValue variable, which were originally classified as loop private, are now stack variables in each worker thread. The loopDoRange() method uses different copies of these two variables in each thread executing the loop. This means that each of the 12 worker threads has its own copy of these variables while completing its portion of the calculation.

Second, although the lookupTable array is not loop private, the individual members of the array can be considered loop private. Each individual member of the array is only accessed in a particular iteration. There is no race condition because each iteration affects one and only one member of the array, and although the different worker threads handle many iterations of the loop, no single iteration is handled by more than one thread.

The only synchronization we need is in the assignment of the different ranges. To prevent the worker threads from stepping on each other during this assignment, the loopGetRange() method is synchronized. In this example, since the loop is partitioned only in 12 ranges, the execution time for this method is insignificant when compared with the loop calculation itself.

The code for this new version is more complicated than our first version. This new code now has to start and track 12 separate threads. The worker threads had to be modified to handle parts of the loop whose ranges they have to determine. Although there is very little synchronization in this case, we could easily have had a complicated requirement for synchronization depending on the algorithm used in the mathematical calculation.

Given the complexity we introduced to handle this simple loop, it may become too hard to handle more complex loops. To help with this complexity, we’ll move all the logic related to loop management into a separate class. We can then implement the loop by simply using the services provided by this class:

public class LoopHandler implements Runnable {
    protected class LoopRange {
        public int start, end;
    }
    protected Thread lookupThreads[];
    protected int startLoop, endLoop, curLoop, numThreads;

    public LoopHandler(int start, int end, int threads) {
        startLoop = curLoop = start;
        endLoop = end;
        numThreads = threads;
        lookupThreads = new Thread[numThreads];
    }

    protected synchronized LoopRange loopGetRange() {
        if (curLoop >= endLoop)
            return null;
        LoopRange ret = new LoopRange();
        ret.start = curLoop;
        curLoop += (endLoop-startLoop)/numThreads+1;
        ret.end = (curLoop<endLoop) ? curLoop : endLoop;
        return ret;
    }

    public void loopDoRange(int start, int end) {
    }

    public void loopProcess() {
        for (int i = 0; i < numThreads; i++) {
            lookupThreads[i] = new Thread(this);
            lookupThreads[i].start();
        }
        for (int i = 0; i < numThreads; i++) {
            try {
                lookupThreads[i].join();
            } catch (InterruptedException iex) {}
        }
    }
 
    public void run() {
        LoopRange str;
        while ((str = loopGetRange()) != null) {
            loopDoRange(str.start, str.end);
        }
    }
}

In our new LoopHandler class, we have implemented the logic that we applied on our SinTable class. The logic of creating, tracking, and joining back with the original thread has been moved to the newly created loopProcess() method. The logic of determining the ranges and processing the loop—originally coded in the run() and loopGetRange() methods of the SinTable class—remains nearly unchanged. The loop handler has also been modified to handle more generic loops and has a constructor that will assign the start of the loop, the end of the loop, and the number of threads. Just as in our earlier example, the algorithm will call the loopDoRange() method to handle the processing. However, in this case, the LoopHandler class has an empty implementation for this method.

Now our implementation of the SinTable class is much simpler:

public class SinTable extends LoopHandler {
    private float lookupValues[];

    public SinTable() {
        super(0, 360*100, 12);
        lookupValues = new float [360 * 100];
    }

    public void loopDoRange(int start, int end) {
        for (int i = start; i < end; i++) {
            float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
        }
    }    

    public float[] getValues() {
        loopProcess();
        return lookupValues;
    }
}

In this case, we simply need to configure the ranges needed by the loop handler, provide the logic of the loop in the loopDoRange() method, and call the loopProcess() method to process the loop in a multithreaded fashion. While this is still more complicated than the first SinTable class implementation, it is now much more manageable and less complex than the previous implementation.

Loop Scheduling and Load Balancing

We define the process of distributing the iterations of the loop to the individual threads as loop scheduling. In our LoopHandler class, this is handled by the loop-GetRange() method. To maximize processor usage, we should distribute the work to the threads as evenly as possible, with the least amount of overhead in determining this distribution. This is defined as load balancing.

Here are the basic loop-scheduling types at our disposal:

Static or chunk scheduling

Under static scheduling, each thread is assigned an equal number of iterations that depends on the number of threads available. If there are 1000 iterations in the loop that are to be distributed and 10 threads that are assigned to the task, then each thread will be assigned 100 iterations of the loop. This is the algorithm that is used by the LoopHandler class. The algorithm also adds 1 to the size to make sure that the distribution is rounded up. Otherwise, there might be an iteration left over and a worker thread would have to perform that single iteration after already performing the original chunk.

The problem with this algorithm is that it assumes that each iteration of the loop takes the same amount of time. If this is not true, then one of the threads will take more time than the other threads to complete. Since all the work is divided up at the beginning of the loop, the other threads will be idle while the final iterations are completed by the last remaining thread.

Self-scheduling

In self-scheduling, each worker thread grabs a small chunk of the iterations to execute. After completion of its assigned range, it grabs another small chunk. If there are 1000 iterations in the loop that are to be distributed and 10 threads are assigned to the task, then each worker thread will work on a small chunk—say 20—until all 1000 iterations are completed.

As with static scheduling, the different worker threads may not complete at the same time. However, since the chunks are small in the self-scheduling model, the idle time of the threads at the end of the process is also small. To make this idle time even smaller, we can make the individual chunks smaller. However, there is an overhead in obtaining the ranges to execute; this overhead will increase as the chunks get smaller.

Here’s an implementation of this model:

public class SelfLoopHandler extends LoopHandler {
    protected int groupSize;

    public SelfLoopHandler(int start, int end, int size, int threads) {
        super(start, end, threads);
        groupSize = size;
    }

    protected synchronized LoopRange loopGetRange() {
        if (curLoop >= endLoop)
            return null;
        LoopRange ret = new LoopRange();
        ret.start = curLoop;
        curLoop += groupSize;
        ret.end = (curLoop<endLoop)?curLoop:endLoop;
        return ret;
    }
}

Implementation of a self-scheduling loop handler is straightforward. Our current LoopHandler class already has the logic of working until the loop has been completed. We simply need to modify the constructor to handle the chunk size requested, and modify the loopGetRange() method to return this fixed chunk size. In our implementation of the self-scheduler, we simply subclass from the original loop handler and implement only the changes.

Guided self-scheduling

Guided self-scheduling is a compromise between the static scheduler and the self-scheduler. In the beginning, the guided scheduler grabs a large number of iterations of the loop, which becomes progressively smaller near the end of the loop. There is also a minimum chunk size that the guided self-scheduler uses. Thus, it basically behaves like a static scheduler that slowly becomes a self-scheduler.

If 1000 iterations in the loop are to be distributed and 10 threads are assigned to the task, then the first worker thread gets one-tenth of the work—100 iterations. The second thread gets one-tenth of the remaining work—90 iterations. This slowly gets smaller and smaller until the minimum—say 10—is assigned; the minimum is assigned until all 1000 iterations are completed.

This algorithm seems to have the fewest problems. Unlike the self-scheduler, the extra overhead only appears at the end of the loop. And unless the individual iterations have drastically different execution periods from the longer-term iterations at the beginning, it doesn’t have the problems that the static scheduler has.

Here’s how to implement guided self-scheduling:

public class GuidedLoopHandler extends LoopHandler {
    protected int minSize;

    public GuidedLoopHandler(int start, int end, int min, int threads){
        super(start, end, threads);
        minSize = min;
    }

    protected synchronized LoopRange loopGetRange() {
        if (curLoop >= endLoop)
            return null;
        LoopRange ret = new LoopRange();
        ret.start = curLoop;
        int sizeLoop = (endLoop-curLoop)/numThreads;
        curLoop += (sizeLoop>minSize)?sizeLoop:minSize;
        ret.end = (curLoop<endLoop)?curLoop:endLoop;
        return ret;
    }
}

Implementation of a guided self-scheduling loop handler is also straightforward. We simply need to modify the constructor to handle the minimum size required, and modify the loopGetRange() method to return a portion of the remaining loop. In our implementation of the guided self-scheduler, we also subclass the original loop handler and implement only the changes.

User-defined scheduler

The implementation of the self-scheduler and the guided self-scheduler is simple for a reason: it was designed to be so. The original loop handler was designed to be subclassed so that the scheduler algorithm could be modified. As good as the implementation of the guided self-scheduler may be, it is still designed for a generic loop. There will be cases where each of the different schedulers will work better than others. However, if enough information concerning the loop is known, and the effort is large enough, it may justify the implementation of yet another scheduler. This entails figuring out the appropriate logic and coding a new loopGetRange() method.

Here’s how our original example can be modified to use one of the scheduling techniques we’ve just seen:

public class SinTable extends GuidedLoopHandler {
    private float lookupValues[];

    public SinTable() {
        super(0, 360*100, 100, 12);
        lookupValues = new float [360 * 100];
    }

    public void loopDoRange(int start, int end) {
        for (int i = start; i < end; i++) {
            float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
        }
    }    

    public float[] getValues() {
        loopProcess();
        return lookupValues;
    }
}

To use the guided self-scheduler algorithm in our SinTable class, we simply subclass from the GuidedLoopHandler class and modify our constructor to pass the minimum chunk size. We could also have written the GuidedLoopHandler class to have an overloaded constructor that picks a default minimum. This would allow it to have a constructor with the same signature as the static loop handler.

Variable Classifications

In the implementation of the SinTable class, we classified the variables used in the original nonthreaded loop as loop-private variables, but other variable classifications exist. The reason for classifying variables at all is that different types of variables require different types of handling within and between threads, because many loops have a data dependency that occurs between different iterations. By classifying the variables, we are able to correctly update and modify them without any race conditions. Different types of variable classifications can be determined by their usage, and these classifications will determine how they are to be implemented or treated in the multithreaded loop handler.

Loop-private variables

As mentioned, a loop-private variable is a variable that does not pass its value from one iteration of the loop to another iteration of the loop. It can actually be a variable that is declared in the loop itself, and it can also be an instance or publicly accessed variable that is accessed by only one iteration of the loop. This was the case with the lookupValues array variable, where each member of the array was only accessed by one iteration of the loop. Although the whole array was not loop private to any iteration, specific members were loop private to specific iterations.

As shown with the SinTable class, treatment of loop-private variables is often done with a local copy of the variable in each thread. Since each thread has a copy, no interference between the threads is possible. In the case of the lookupValues array, there is an understanding that the threads will respect the privacy of the other threads by only accessing their loop-private portions of the array.

Read-only variables

Read-only variables are variables that do not change in value during the duration of the loop. They can be true constants or simply variables that have been initialized and will not change until after the loop has been processed.

No special treatment of read-only variables is necessary. The worker threads do not need to have their own copies of the variables, and access to them does not require synchronization of any type.

Storeback variables

Storeback variables are basically loop-private variables that are needed after the loop has been completed. For example, say that the processing of the lookupValues array required some extra work to be done after the loop was finished:

public float[] getValues() {
    if (lookupValues == null) {
        float sinValue = 0;
        lookupValues = new float [360 * 100];
        for (int i = 0; i < (360*100); i++) {
            sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
        }
        lookupValues[0] += sinValue;
    }    
    return lookupValues;
}

In this slightly modified version of the SinTable loop, both the sinValue variable and the individual members of the lookupValues array are still loop-private variables. There is no data dependency between these two variables in different iterations of the loop. However, in this case the sinValue variable is also a storeback variable. Since the variable is important after the loop has completed, it must be set to the value as if the loop had run in the correct order. The members of the lookupValues array were always considered as storeback variables, but since no individual copies were kept, there was little need to make this extra distinction.

Here’s how we can handle the storeback variable:

public class SinTable extends GuidedLoopHandler {
    private float lookupValues[];
    private float sinValue;
 
    public SinTable() {
        super(0, 360*100, 100, 12);
        lookupValues = new float [360 * 100];
    }
 
    public void loopDoRange(int start, int end) {
        float sinValue = 0;
        for (int i = start; i < end; i++) {
            sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
        }
        if (end == endLoop)
                              this.sinValue = sinValue;
    }
 
    public float[] getValues() {
        loopProcess();
        lookupValues[0] += sinValue;
        return lookupValues;
    }
}

The sinValue variable is still treated as a loop-private variable. However, since this variable is really a storeback variable, we need to store the “last” value of this variable. Since the algorithm is now executed in a multithreaded manner, the last iteration is not necessarily the last value assigned to the variable by a thread.

A thread must check that it has executed the last chunk of the loop before copying the value of its loop-private copy to the global copy. Also note that no synchronization is necessary. Since only the last iteration will be copied, only one thread will be executing the code, and no race condition is possible.

Reduction variables

Obviously, it is not possible to make every variable a loop-private variable, since there are cases where there are real data dependencies between different iterations of the loop. Because of these data dependencies, different threads executing different iterations might interfere with each other during execution. We will define these types of variables as shared variables , since they are shared between iterations of the loop.

Shared variables have many problems. The first is the race conditions that exist when different threads access the variable simultaneously. The second is that the value of a variable may depend on the order in which it is processed. In the first case, we can simply use synchronization techniques to prevent the race conditions from existing. The second case poses a much greater problem.

However, what if the order did not matter? We would be able to process the loop in any order and would simply have to synchronize access to the shared variable. For example, let us assume that we also need to calculate the sum of our SinTable:

public float[] getValues() {
    for (int i = 0; i < (360*100); i++) {
        sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
        lookupValues[i] = sinValue * (float)i / 180.0f;
        sumValue += lookupValues[i];
    }
    return lookupValues;
  }

In this case, the sumValue variable is clearly not a loop-private variable. The value of sumValue is passed from one iteration to another, and the correct result requires this dependency to exist. However, the sumValue variable is only useful after the loop has completed. The iterations simply add to the running total—subtotals or other order-based requirements are not necessary. Furthermore, addition itself is order independent: it is possible to add a bunch of numbers in any order, and the final result will be the same.

The sumValue variable is a reduction variable. It must still be shared among the threads, but since order does not matter, this sharing only requires synchronization to prevent race conditions:

public class SinTable extends GuidedLoopHandler {
    private float lookupValues[];
    public float sumValue;

    public SinTable() {
        super(0, 360*100, 100, 12);
        lookupValues = new float [360 * 100];
    }

    public void loopDoRange(int start, int end) {
        float sinValue = 0;
        for (int i = start; i < end; i++) {
            sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
            synchronized (this) {
                                  sumValue += lookupValues[i];
                              }    
        }    
    }    

    public float[] getValues() {
        loopProcess();
        return lookupValues;
    }
}

Race conditions in this example are prevented by using the synchronization lock of the SinTable instance. If we have many reduction variables that are not dependent on each other and we cannot store them all at the same time, it might be a better idea to have separate synchronization locks—or BusyFlags—for each reduction variable.

Furthermore, we are synchronizing with each iteration of the loop. This is not very efficient. It is better to assign the value to loop-private variables and only synchronize the final summed value of the range to the reduction variable. By doing this, we are removing most of the need for synchronization, which can drastically add to the parallelization of the threads:

public class SinTable extends GuidedLoopHandler {
    private float lookupValues[];
    public float sumValue;

    public SinTable() {
        super(0, 360*100, 100, 12);
        lookupValues = new float [360 * 100];
    }

    public void loopDoRange(int start, int end) {
        float sinValue = 0.0f;
        float sumValue = 0.0f;
        for (int i = start; i < end; i++) {
            sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
            sumValue += lookupValues[i];
        }
        synchronized (this) {
                              this.sumValue += sumValue;
                          }
    }

    public float[] getValues() {
        loopProcess();
        System.out.println(sumValue);
        return lookupValues;
    }
}

In this new example, we are doing a two-stage reduction of the values. We are reducing the value of each iteration to the local copy of the sumValue variable, and then we are reducing this local copy to the actual reduction variable. Since the local copy of the sumValue variable is loop private, synchronization is not necessary. Synchronization is still necessary when adding to the reduction variable. However, this is now done once per range instead of once per iteration.

Finally, all reduction variables are storeback variables. There is no need to have special storeback handling logic for reduction variables.

Shared variables

Originally, all variables in the loop are shared variables, since all variables can be accessed by all the threads that are executing the loop. As we parallelize the loop, we can quickly classify the shared variables that are also read-only variables. We can also reclassify those variables that are loop-private variables. Of the remaining shared variables, it may be possible either to convert them to loop-private variables or to classify them as reduction variables.

Unfortunately, there will be cases where a shared variable cannot be classified as anything but a shared variable. This is where our technique fails to work. As much as we would like to convert any loop to run in a multithreaded environment, not all algorithms can be redesigned to run in a parallel environment.

The other problem with shared variables is the side effect. For example, if we needed to save each of the subtotals of the sumValue variable, it could not be treated as a reduction variable since the changes in the variable are also important. If we had to print the subtotals during the loop, not only would the intermediate results be out of order, but the intermediate results would be different.

When variable classification is not enough for parallelization, we have other techniques that can help. They may not solve every case, but with experience, more and more loops can be converted to run in a multithreaded environment.

Loop Analysis and Transformations

To assist our parallelizing techniques, we can analyze the algorithms of the loop itself instead of just analyzing the variables in the loop. In the majority of the cases, there is very little that we can do without redesigning the algorithm, but there are a few situations where we can quickly modify the code without a complete redesign. By implementing simple transformations on the original code, we may be able to use the techniques discussed so far to thread the loop.

Loop distribution

In many cases, only a small portion of a large complex loop contains code that must be executed sequentially. It may be possible to separate the large complex loop into two separate loops. Once the complex loop is separated into two loops—one loop containing the code that can be parallelized, the other containing the sequential code—we can then parallelize a portion of the original loop. We may even be able to run the sequential loop in parallel with the loop that can be threaded.

Returning to our SinTable example, let’s assume that we need not only a total but also a running subtotal of the table that is to be generated:

public float[] getValues() {
    for (int i = 0; i < (360*100); i++) {
        sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
        lookupValues[i] = sinValue * (float)i / 180.0f;
        if (i == 0) {
                              sumValues[0] = lookupValues[0];
                          } else {
                              sumValues[i] = lookupValues[i] + lookupValues[i-1];
                          }
    }
    return lookupValues;
}

The sumValues array variable is definitely a shared variable. The members of the sumValues variable are also shared in that some of them are accessed by two different threads. Furthermore, the order matters. It is not possible for one thread to start a chunk before the thread that is working on the previous chunk is finished.

We can solve that problem like this:

public class SinTable extends GuidedLoopHandler {
    private float lookupValues[];
    public float sumValues[];

    public SinTable() {
        super(0, 360*100, 100, 12);
        lookupValues = new float [360 * 100];
        sumValues = new float [360 * 100];
    }

    public void loopDoRange(int start, int end) {
        float sinValue = 0.0f;
        for (int i = start; i < end; i++) {
            sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[i] = sinValue * (float)i / 180.0f;
        }
    }    

    public float[] getValues() {
        loopProcess();
        sumValues[0] = lookupValues[0];
                          for (int i = 1; i < (360*100); i++) {
                              sumValues[i] = lookupValues[i] + lookupValues[i-1];
                          }
        return lookupValues;
    }
}

While it is not possible to parallelize the running subtotal without drastically changing the algorithm, we can quickly convert the loop into two separate loops. The first loop contains the threadable code, and the second processes the subtotal. Once this is accomplished, we can then thread the first loop without changing the second. In the new SinTable class, we have moved the running subtotal code to a separate loop. This separate loop runs on a single thread, and only after the first loop is processed.

Some comparisons should be taken when using this technique. Since a large portion of the loop may be running single threaded, the performance gain may not justify the effort involved. In most cases, calculations of the subtotal are small considering the effort of the main calculation, and the performance penalty may be small in comparison.

Loop isolation

Many applications do not contain a single large loop. Even if a particular loop is determined to be unparallelizable, there may be other loops in the application. Even if these other loops also cannot be parallelized, we may be able to run each separate loop in a different thread.

Although the many loops may be very complex, with large data dependencies between iterations, there may be few data dependencies between the different loops. It may be possible to isolate the individual loops themselves and run them each in a separate thread. With this technique, load balancing is no longer possible. After all, if the application contains four major loops and you were able to isolate them all, it is still impossible to distribute these four loops among twelve processors.

Loop interchange

Multilayered loops are a prime cause of CPU-bound applications that run for a large period of time. This could be loops that are directly inside of other loops or, more likely, loops that call methods that contain loops. This scenario is so common that we will examine inner-loop threading later in this chapter. For now, there is a simple case to look for:

public float[][] getValues() {
    for (int i = 0; i < 360; i++) {
        lookupValues[0][i] = 0;
    }
    for (int j = 1; j < 1000; j++) {
                          for (int i = 0; i < 360; i++) {
            float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[j][i] = sinValue * (float)i / 180.0f;
            lookupValues[j][i] += lookupValues[j-1][i]*(float)j/180.0f;
        }  
    }
    return lookupValues;
  }

For multilayered loops, it is generally more profitable to thread the outer loop instead of the inner one. It is not necessary to thread both the inner and outer loop because threading either one should use all the processors. If the outer loop is threaded, threading the inner loop will not provide any further speedup since there are no more processors to run the extra threads (and vice versa). The reason we prefer to thread the outer loop is that there is an overhead in creating, destroying, and synchronizing among the many threads. By threading the outer loop, we create and destroy the threads once and only synchronize at a coarse level—less synchronization should be necessary.

In this new version of the table calculation, we are now working on a two-dimensional table. There are three loops used during this calculation. However, the first loop is merely setting the first row of values to zero. The next two loops are actually a pair of multilayered loops. The algorithm is looping the processing from row to row, executing the inner loop that is processing the values to be stored in the different columns.

The problem in this case is that there is a data dependency between the rows themselves. Because the calculation at any row is dependent on the calculation of the previous row, the members of any column in the lookupValues array cannot be considered loop private—or made loop private. The inner loop can be parallelized with no problems since there are no data dependencies between the iterations. The only requirement is that the inner loop must assume that the outer loop ran in the correct order; this requirement is fine since we are not threading the outer loop.

However, we can also rewrite our original code as follows:

public float[][] getValues() {
    for (int i = 0; i < 360; i++) {
        lookupValues[0][i] = 0;
    }
    for (int i = 0; i < 360; i++) {
                          for (int j = 1; j < 1000; j++) {
            float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[j][i] = sinValue * (float)i / 180.0f;
            lookupValues[j][i] += lookupValues[j-1][i]*(float)j/180.0f;
        }  
    }
    return lookupValues;
}

In this example, the loops are interchanged. Instead of working from row to row, we can work from column to column. The inner loop can then process the data from row to row. By interchanging the loops, the inner loop is now no longer threadable, since there is data dependency between the members of the columns in the lookupValues array. However, the outer loop is now threadable. Once the outer loop has been threaded, there will no longer be a reason to thread the inner loop. Since it is more profitable to thread an outer loop than an inner loop, this simple change prior to multithreading gives us a better return on our development time investment.

Unfortunately, although loops within loops are common, this example may not be. There is generally setup code for an inner loop, and there may be multiple loops that are run sequentially within the outer loop, or the inner loop may be inside another method that is called from the outer loop. And the data dependencies may be such that a loop interchange will not solve the problem.

Having an inner loop that is threadable in an outer loop that is not threadable is common. We will be examining inner-loop threading in more detail later in this chapter.

Loop reimplementation

As you may have noticed, the loop handler that we have developed is fairly restrictive. It only applies to for loops, the range of the loop must be known prior to execution, it only works with integers as its index, and it has an interval of only 1 between iterations. While some of these restrictions are caused by the fact that we have not implemented support for certain features in the loop handler, the main cause is that it is difficult, if not impossible, to implement an algorithm that can handle all generic loops.

If all else fails during loop transformation, programming experience is still very useful. A while loop or a do loop may be converted to a for loop. The start and end iterations may be calculated prior to loop execution. Code may be moved from or into a loop, or between loops, to allow other loop transformations to occur. Code changes can also cause variable classifications to change. A shared variable may be reclassified as loop private or as a reduction variable because of how it is used in a loop.

Unfortunately, success is never guaranteed. The goal is to balance the effort of development with the acceleration that may be gained. It may take days to implement a change that can only achieve another one or two percent acceleration. After all, if unlimited effort were allowed, we would redesign the whole application from scratch.



[12] The terminology that we will be using in this section is somewhat based on the autothreading MP C compiler available for the Solaris operating system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset