Multiprocessor Scaling

Scaling is a term that is sometimes overused. It can apply to how many applications a computer can execute simultaneously, how many disks can be written to simultaneously, or how many cream cheese bagel orders can be processed by the local bagel shop’s crew. When the output cannot be increased no matter how many resources are added, this limit is generally the value used to specify what something scales to. If the oven cannot produce more bagels per hour, it does not matter how many people are added to the assembly line: the rate of bagels cannot exceed the rate produced by the oven. The scaling limit can also be controlled by many other factors, such as the rate that the cream cheese can be produced, the size of the refrigerators, or even by the suppliers for the bagel shop.

In this chapter, when we refer to the scalability of a multithreaded application, we are referring to the limit on the number of processors we can add and still obtain an acceleration. Adding more than this limit will not make the application run faster. Obviously, how an application scales depends on many factors: the operating system, the Java virtual machine implementation, the browser or application server, and the Java application itself. The best an application can scale will be based on the scalability limits of all of these factors.

For perfect CPU-bound programs in a perfect world, we could expect perfect scaling: adding a second CPU would halve the amount of time that it takes the program to run, adding another CPU would reduce the time by another third, and so on. Even for the loop-based programs we’ve examined in this chapter, however, the amount of scaling that we will see is also limited by these important constraints:

Setup time

A certain amount of time is required to execute the code outside of the loop that is being parallelized. This amount of time is independent of the number of threads and processors that are available, because only a single thread will execute that code.

New synchronization requirements

In parallelizing the loops of this chapter, we’ve introduced some additional bookkeeping code, some of which is synchronized. Because obtaining a synchronization lock is expensive, this increases the time required to execute the code.

Serialization of methods

Some methods in our parallelized code must run sequentially because they are synchronized. Contention for the lock associated with these methods will also affect the scalability of our parallelized programs.

If we view the setup time, synchronization time, and time required to execute the serialized methods as a percentage of the total running time, the remaining time is the amount of code that is parallelized. The maximum amount of scaling that we’ll see is given by Amdahl’s law:

The Effect of the Virtual Machine

Here, S is the scaling we’ll see, assuming that F % of code is parallelized over N processors. If 95% of the code is parallelized and we have eight processors available, the code will run in 16.8% of the original time required (.05 + .95/8). However, when we introduce code to calculate loop ranges (or any other code), we’ve actually increased the amount of serialized code, so F could potentially be a negative number. In that case, our parallelized code will take longer to run than our original code.

So what sort of scaling can we expect from the techniques of this chapter? In order to answer this question, we will test several implementations of our sample double loop:

public float[][] getValues() {
    for (int i = 0; i < 360; i++) {
        lookupValues[0][i] = 0;
    }
    for (int j = 1; j < 1000; j++) {
        for (int i = 0; i < 360; i++) {
            float sinValue = (float)Math.sin((i % 360)*Math.PI/180.0);
            lookupValues[j][i] = sinValue * (float)i / 180.0f;
            lookupValues[j][i] += lookupValues[j-1][i]*(float)j/180.0f;
        }  
    }
    return lookupValues;
}

To make testing easier, we will use the following class and interface to build a system by which we may test various loop handlers. Since we’re working with CPU-intensive threads, we’ve included the Solaris-specific code to set the number of LWPs, but this code will run on any operating system:

public interface ScaleTester {
    public void init(int nRows, int nCols, int nThreads);
    public float[][] doCalc();
}

import java.util.*;
import java.text.*;
import java.io.*;

public class ScaleTest {
    private int nIter = 200;
    private int nRows = 2000;
    private int nCols = 200;
    private int nThreads = 8;
    Class target;

    ScaleTest(int nIter, int nRows, int nCols, int nThreads,
                String className) {
        this.nIter = nIter;
        this.nRows = nRows;
        this.nCols = nCols;
        this.nThreads = nThreads;
        try {
            target = Class.forName(className);
        } catch (ClassNotFoundException cnfe) {
            System.out.println(cnfe);
            System.exit(-1);
        }
    }

    void chart() {
        long sumTime = 0;
        long startLoop = System.currentTimeMillis();
        try {
            ScaleTester st = (ScaleTester) target.newInstance();
            for (int i = 0; i < nIter; i++) {
                st.init(nRows, nCols, nThreads);
                long then = System.currentTimeMillis();
                float ans[][] = st.doCalc();
                long now = System.currentTimeMillis();
                sumTime += (now - then);
            }
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(-1);
        }
        long endLoop = System.currentTimeMillis();
        long calcTime = endLoop - startLoop;
        System.err.println("Loop time " + sumTime +
                            " (" + ((sumTime * 100) / calcTime) + "%)");
        System.err.println("Calculation time  " + calcTime);
    }

    public static void main(String args[]) {
        if (args.length != 5) {
            System.out.println(
    "Usage: java ScaleTester nIter nRows nCols nThreads className");
            System.exit(-1);
        }
        ScaleTest sc = new ScaleTest(Integer.parseInt(args[0]),
                                     Integer.parseInt(args[1]),
                                     Integer.parseInt(args[2]),
                                     Integer.parseInt(args[3]),
                                     args[4]);
        CPUSupport.setConcurrency(Integer.parseInt(args[3]) + 5);
        sc.chart();
    }
}

When we use the ScaleTest class, we get two numbers: the number of milliseconds required to run the entire program (including initialization, which is single-threaded) and the number of milliseconds required to run just the loop calculation. We can then compare these numbers to determine the scalability of various implementations of our loop handling classes.

As a baseline, we’ll take the measurement of this class:

public class Basic implements ScaleTester {
    private float lookupValues[][];
    int nCols, nRows;

    public void init(int nRows, int nCols, int nThreads) {
        this.nCols = nCols;
        this.nRows = nRows;
        lookupValues = new float[nRows][];
        for (int j = 0; j < nRows; j++) {
            lookupValues[j] = new float[nCols];
        }
    }

    public float[][] doCalc() {
        for (int i = 0; i < nCols; i++) {
            lookupValues[0][i] = 0;
        }
        for (int j = 1; j < nRows; j++) {
            for (int i = 0; i < nCols; i++) {
                float sinValue =
                                (float)Math.sin((i % 360)*Math.PI/180.0);
                lookupValues[j][i] = sinValue * (float)i / 180.0f;
                lookupValues[j][i] +=
                                lookupValues[j-1][i]*(float)j/180.0f;
            }
        }
        return lookupValues;
    }
}

This class contains no threading; it is the way that we would normally implement the basic calculation we’re interested in testing. One of the implementations that we’ll compare this class against is the following loop handler class:

public class GuidedLoopInterchanged implements ScaleTester {
    private float lookupValues[][];
    private int nRows, nCols, nThreads;

    private class GuidedLoopInterchangedHandler
                                    extends GuidedLoopHandler {
        GuidedLoopInterchangedHandler(int nc, int nt) {
            super(0, nc, 10, nt);
        }

        public void loopDoRange(int start, int end) {
            for (int i = start; i < end; i++) {
                lookupValues[0][i] = 0;
            }
            for (int i = start; i < end; i++) {
                for (int j = 1; j < nRows; j++) {
                    float sinValue =
                                (float)Math.sin((i % 360)*Math.PI/180.0);
                    lookupValues[j][i] = sinValue * (float)i / 180.0f;
                    lookupValues[j][i] +=
                                lookupValues[j-1][i]*(float)j/180.0f;
                }  
            }
        }
    }

    public void init(int nRows, int nCols, int nThreads) {
        this.nRows = nRows;
        this.nCols = nCols;
        this.nThreads = nThreads;
        lookupValues = new float[nRows][];
        for (int j = 0; j < nRows; j++) {
            lookupValues[j] = new float[nCols];
        }
    }

    public float[][] doCalc() {
        GuidedLoopInterchangedHandler loop =
                    new GuidedLoopInterchangedHandler(nCols, nThreads);
        loop.loopProcess();
        return lookupValues;
    }
}

This class uses our simple loop handler to process the loop; notice, however, that we’ve interchanged the loops in order to make the outer loop threadable.

Table 9.1 lists the results of the ScaleTest program when run with different implementations of the interchanged loop: we’ve used chunk, self-scheduled, and guided self-scheduling loop handlers in conjunction with the code we showed earlier. These tests were run on a machine with eight CPUs, using an iteration count of 200. We’ve normalized the running time for the baseline run to be 100 so that other numbers can be viewed as a percentage: the best that we do is run in 20.6% of the time required for the original run. The first number in each cell represents a run with 500 rows and 1000 columns, and the second number represents a run with 1000 rows and 500 columns.

Table 9-1. Scalability of Simple Loop Handlers

 

Number of Threads

Total Time

Loop Time

Basic

1

100/100

96/96

Chunk scheduling

1

124.6/123.4

120.8/119.7

 

2

64.5/63.1

61.2/59.3

 

4

34.7/35.3

31.5/31.8

 

8

23.7/23.0

20.3/19.3

 

12

24.0/24.0

20.6/20.2

Self-scheduling

1

129.7/127.6

125.8/123.8

 

2

71.9/70.3

69.0/66.8

 

4

39.3/39.6

36.1/36.1

 

8

23.1/24.1

19.8/20.5

 

12

22.7/23.5

19.2/19.8

Guided self-scheduling

1

124.7/122.5

120.9/118.9

 

2

64.0/63.6

60.8/60.5

 

4

34.4/34.2

31.3/30.8

 

8

20.6/21.8

17.3/18.1

 

12

22.3/23.1

18.9/19.1

There are a few conclusions that we can draw from this table:

  • The overhead of setting up the thread and loop handling class itself is significant: it requires 22% to 29% more time to execute that code when only a single thread is available. So we would never use this technique on a machine with only one CPU.

  • The scaling of the loop calculation itself is good. Since the original loop accounted for 96% of the code, with eight CPUs the best that we can hope for (using Amdahl’s law) is 16.8%. We’ve achieved 20.6%, which implies that 90% of the code is now parallelized: the 6% difference is accounted for by the serialized calls to the loopGetRange() method and by the fact that each thread is probably not doing exactly the same amount of work.

  • Going past eight threads—that is, the number of CPUs available—yields a penalty. This is partially because we now have threads competing for a CPU, but it is also because of the synchronization around the additional calls to the loopGetRange() method.

  • The guided self-scheduler is the best choice in this example. This is not surprising: calculations based on sine values do not always require the same amount of time, so the chunk scheduler can be penalized by having one particular thread that requires too much time. That contributes to a loss of scaling, since the threads do not end up performing equal amounts of work.

All in all, though, we’ve achieved very good scalability.

What effect does a storeback variable have in our testing? We can rewrite our tests so that every time we calculate a lookup value, we add that value to a sumValue instance variable. Using the reduction technique we showed earlier, the modified test generates the numbers given in Table 9.2.

Table 9-2. Scalability of Loop Handlers with Storeback Variables

 

Number of Threads

Total Time

Loop Time

Basic

1

100/100

97/96

Chunk scheduling

1

123.3/121.9

119.6/118.3

 

2

64.1/62.7

61.5/59.5

 

4

36.4/35.2

33.4/32.0

 

8

22.5/22.7

19.3/19.3

 

12

24.1/23.7

20.9/20.1

Guided self-scheduling

1

123.3/121.6

119.6/117.9

 

2

64.6/63.2

62.0/60.0

 

4

36.0/34.3

33.1/31.2

 

8

20.2/21.5

17.1/18.0

 

12

22.1/22.3

19.0/18.7

Because there’s only one storeback variable, the effect on the scaling is minor. In fact, in some cases we did better because the baseline now takes longer to execute. However, the effect of many storeback variables could potentially aggregate into something more noticeable.

What if we had threaded only the inner loop? This question is very interesting, since it demonstrates the effect of synchronization overhead versus the amount of savings we obtain if the inner loop is small. Rewriting our first test (with no storeback variable) so that no loop interchange is performed and the inner loop is threaded instead produces the results in Table 9.3.

Table 9-3. Scalability of Inner Loop Handlers

 

Number of Threads

Total Time

Loop Time

Basic

1

100/100

97/96

Guided self-scheduling

1

138.0/159.7

133.8/155.0

 

2

82.2/138.3

77.2/131.4

 

4

66.7/164.1

60.0/154.2

 

8

104.3/515.3

92.8/499.9

 

12

1318.9/4466.3

1292.5/4421.7

So what has happened here? First, we’ve slightly modified our test parameters: the first number was produced with a run of 100 rows and 5000 columns, and the second number was produced with a run of 500 rows and 1000 columns. In the first case, we’ve achieved some scalability to a point of four CPUs, which allows us to run inner loops of about 250 calculations per CPU. By the time we get to eight CPUs, however, the inner loop has only 125 calculations, and the additional overhead of repeatedly calling the synchronized loopGetRange() method has overcome any advantage we received by running the small loops in parallel. Things get drastically worse as we add additional threads.

In the second case, the inner loop is so small that we end up calling the loopGet-Range() method so many times that there is never any scalability. In the best case (with two threads), we’ve added the equivalent of 43% more code than we’ve parallelized.

As we mentioned, threading of small loops—and particularly of small inner loops—is not necessarily worthwhile.

Finally, what if we add code to the loop that prints out the result of some calculations? We can still thread such a case using the LoopPrinter class that we developed earlier. However, remember that we ended our section on the LoopPrinter class with a discussion that would enable us to remove the synchronization from the LoopPrinter class. Because in this particular test we always know the size of the output array and we can ensure that the same index is not used by two different threads, we can rewrite the LoopPrinter class like this:

import java.util.*;
import java.io.*;

// Non-thread-safe version of a loop printer
public class LoopPrinter {
    private Vector pStorage[];

    public LoopPrinter(int size) {
        pStorage = new Vector[size];
    }

    public void print(int index, Object obj) {
        if (pStorage[index] == null) {
            pStorage[index] = new Vector();
        }
        pStorage[index].addElement(obj.toString());
    }

    public void println(int index, Object obj) {
        print(index, obj);
        print(index, "
");
    }

    public void send2stream(PrintStream ps) {
        for (int i = 0; i < pStorage.length; i++) {
            if (pStorage[i] != null) {
                Enumeration e = pStorage[i].elements();
                while (e.hasMoreElements()) {
                    ps.print(e.nextElement());
                }
            }
        }
    }
}

With this new version of the loop printer, there is no longer any synchronized code, and hence it should have fewer problems scaling. However, with all the calls to the Vector class, even this version of our loop printer adds a significant amount of overhead to our multithreaded program. In addition, it still takes longer to add strings to these vectors and then dump them out than to simply call the System.out.println() method. However, the difference between our thread-safe and our thread-unsafe versions of this class is important. Table 9.4 lists the results that we obtained for both cases.

Table 9-4.  Scalability of the LoopPrinter Classes

 

Number of Threads

Total Time

Loop Time

Basic

1

100/100

96/98

Thread-safe loop printer

1

125.4/126.0

116.7/119.7

 

2

79.0/97.8

70.3/91.9

 

4

55.5/82.5

47.2/76.7

 

8

46.6/84.2

38.2/78.3

 

12

48.2/86.9

39.5/80.0

Thread-unsafe loop printer

1

125.1/121.0

116.3/111.3

 

2

77.9/92.7

69.4/85.3

 

4

55.3/79.0

47.0/67.1

 

8

45.6/78.2

37.0/64.9

 

12

47.7/78.2

39.1/64.9

The first set of numbers in this table results from running 200 iterations with 200 rows and 1000 columns and printing out every 100th row. The second set of results shows what happens when we print out every 20th row instead. By the time that we print out every 20th row, the amount of extra code prevents any reasonable scaling at all. This is clearly a case where careful design and use of an unsynchronized class can have a big benefit.

We realize that this technique is at odds with our previous admonishments to produce thread-safe code. We still recommend that you always start with threadsafe code. In cases like this, however, when you take the extra care necessary to ensure that you use the thread-unsafe code correctly, the scaling benefits may outweigh the extra effort required to code carefully enough to prevent any race conditions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset