Appendix D: An MPI Tutorial

In this appendix we present a tutorial on the use of MPI on a small Beowulf cluster composed of Unix or Linux computers.1 This follows ourphilosophy of “learning while doing.” Our presentation is meant to help the user from the ground up, something that might not be needed if you were working at a central computing center with a reasonable level of support. Although your problem is still to take the program you have written to generate the bifurcation plot for bug populations and run different ranges of µ values simultaneously on several CPUs, in a more immediate sense your task is to get the experience of running MPI, to understand some of the MPI commands within the programs, and then to run a timing experiment. In §D.9 at the end of the appendix we give a listing and a brief description of the MPI commands and data types. General information about MPI is given in [MPI], detailed information about the syntax of MPI commands appears in [MPI2], and other useful material can be found in [MPImis]. The standard reference on the C language is [K&R 88], although we prefer [OR]. MPI is very much the standard software protocol for parallel computing and is at a higher level than its predecessor PVM [PVM] (which has its own tutorial on the CD (available online: http://press.princeton.edu/landau_survey/)).

WHILE IN THE PAST we have run Java programs with a version of MPI, the difference in communication protocols used by MPI and Java have led to poor performance or to additional complications needed to improve performance [Fox 03]. Inaddition, you usuallywould not bother parallelizing a program unless it requires very large amounts of computing time, and those types of programs are usually written in Fortran or C (both for historical reasons and because Java is slower). So it makes sense for us to use Fortran or C for our MPI examples. We will use C because it is similar to Java.

D.1 Running on a Beowulf

A Beowulf cluster is a collection of independent computers each with its own memory and operating system that are connected to each other by a fast communication network over which messages are exchanged among processors. MPI is a library of commands that make communication between programs running on the different computers possible. The messages are sent as data contained in arrays. Because different processors do not directly access the memory on some other computer, when a variable is changed on one computer, it is not changed automatically in the other copies of the program running on other processors. This is an example of where MPI comes into play.

image

Figure D.1 A schematic view of a cluster (cloud) connected to front-end machines (box).

In Figure D.1 we show a typical, but not universal, configuration for a Beowulf cluster. Almost all clusters have the common feature of using MPI for communication among computers and Unix/Linux for the operating system. The cluster in Figure D.1 is shown within a cloud. The cloud symbolizes the grouping and connection of what are still independent computers communicating via MPI (the lines). The MPI_COMM_WORLD within the cloud is an MPI data type containing all the processors that are allowed to communicate with each other (in this case six). The box in Figure D.1 represents the front end or submit hosts. These are the computers from which users submit their jobs to the Beowulf and later work with the output from the Beowulf. We have placed the front-end computers outside the Beowulf cloud, although they could be within. This type of configuration frees the Beowulf cluster from administrative chores so that it can concentrate on number crunching, and is useful when there are multiple users on the Beowulf.

Finally, note that we have placed the letters “Sched” within the front-end machines. This represents a configuration in which these computers are also running some type of a scheduler, grid engine, or queueing system that oversees the running of jobs submitted to MPI by a number of users. For instance, if we have a cluster of 20 computers and user A requests 10 machines and user B requests 8 machines, then the grid engine will permit both users to run simultaneously and assign their jobs to different computers. However, if user A has requested 16 machines and user B 8, then the grid engine will make one of the users wait until the other finishes their work.

Some setup is required before you can run MPI on several computers at once. If someone has already done this for you, then you may skip the rest of this section and move on to §D.3. Our instructions have been runon a cluster of Sun computers running Solaris Unix (in a later section we discuss how to do this using the Torque scheduler on a Linux system). You will have to change the computer names and such for your purposes, but the steps should remain the same.

•  First you need to have an active account on each of the computers in the Beowulf cluster.

•  Open a shell on one of the machines designated for users to sign onto (the front end). You do not have to be sitting at the front end but instead can use ssh or telnet. Make the directory mpi in your home directory:

image

•  You need to have your Beowulf account configured so that Unix can find the MPI commands that you issue from the command line or from your programs. When you log onto the computer, the operating system reads a configuration file .cshrc residing in your home directory. It contains the places where the operating system looks for commands. (We are assuming here that you are using either cshell or tcshell, if not, then modify your .login, which should work regardless of the shell.) When a file name begins with a dot, it is usually hidden from view when you list the files, but it can be seen with the command ls -la. The list of places where Unix looks for commands is an environmental variable called PATH, and it should include the current version of the mpich-n.m/bin directory where the scripts for MPI reside. For us this is

  /usr/local/cluster/mpich-1.2.6/bin This should be in your PATH

  Here the directory name cluster and 1.2.6 may need to be replaced by the name and number on your local system.

•  Because the .cshrc file controls your environment, having an error in this file can lead to a nonfunctional computer. And since the format is rather detailed and unforgiving, it is easy to make mistakes. So before you work on your existing .cshrc file, make a backup copy of it:

  > cp .cshrc .cshrc_bk

  You can use this backup file as a reference or copy it back to .cshrc if things get to be too much of a mess. If you have really messed things up, your system administrator may have to copy the file back for you.

•  Edit your .cshrc file so that it contains a line in which setenv PATH includes /usr/local/cluster/mpich-1.2.6/bin. If you do not have a .cshrc file, just create one. Find a line containing setenv PATH and add this in after one of the colons, making sure to separate the path names with colons. As an example, the .cshrc file for user rubin is

image

•  If you are editing your .login file, enter as the last line in the file:

  set path = $path /usr/local/cluster/mpich-1.2.6/bin

•  Because dot files are read by the system when you first log on, you will have to log off and back on for your changes to take effect. (Alternatively, you can use the source command to avoid logging off and on.) Once you have logged back on, check the values of the PATH environmental variable:

image

•  Let us now take a look at what is done to the computers to have them run as a Beowulf cluster. On Unix systems the “slash” directory / is the root or top directory. Change the directory to /

image

You should see files there, such as the kernel and the devices, that are part of the operating system. You should not modify these files, as that could cause real problems (it is the sort of thing that hackers and system administrators do).

•  MPI is a localaddition to the operating system. We have MPI and the Sun Grid Engine (SGE) in the /usr/local/cluster directory. Here the first / indicates the root directory and usr is the directory name under the root. Change the directory to /usr/local/cluster, or wherever MPI is kept on your system, and notice the directories scripts and mpich-1.2.6 (or maybe just a link to mpich). Feel free to explore these directories. The directory scripts contains various scripts designed to make running your MPI programs easier. (Scripts are small programs containing shell commands that are executed in order when the file is run.)

•  In the mpich-1.2.6 directory you will notice that there are examples in C, C++, and Fortran. Feel free to copy these to your home directory and try them:

  > cp -r examples /home/userid/mpi

where userid is your name. We encourage you to try out the examples, although some may need modification to work on your local system.

•  Further documentation can be found in

image

•  Copy the script run_mpi.sh from the Codes/MPIcodes directory on the CD (available online: http://press.princeton.edu/landau_survey/) to your personal mpi directory. This script contains the commands needed to run a program on the cluster.

•  Copy the file /usr/local/cluster/mpich/share/machines.solaris to your home directory and examine it. (The solaris extender is there because we are using the Solaris version of the Unix operating system on our Beowulf; you may need to change this for your local system.) This file contains a list of all the computers that are on the Beowulf cluster and available to MPI for use (though there is no guarantee that all the machines are operative):

image

D.2 Running MPI

If you are the only one working on a Beowulf cluster, then it may make sense to submit your jobs directly to MPI. However, if there is the possibility that a number of people may be using the cluster, or that you may be submitting a number of jobs to the cluster, then it is a good idea to use some kind of a queue management system to look after your jobs. This can avoid the inefficiency of having different jobs compete with each other for time and memory or having the entire cluster “hang” because a job has requested a processor that is not available. In this section we describe the use of the Sun Grid Engine [SGE]. In a later section we describe the use of the Torque/Portable Batch System (PBS) scheduler on a Linux system; the two are similar in purpose and commands, work under many operating systems, and are free.

On the left in Figure D.2 is a schematic view of how a C program containing MPI commands is executed. On the right in this figure is a schematic view of how a scheduling system takes an executable program and runs it under MPI on several systems. When a program is submitted to a cluster via a management system, the system installs a copy of the same program on each computer assigned to run the program.

There are a number of scripts that interpret the MPI commands you give within your programs (the commands are not part of the standard Fortran or C language), and then call the standard compilers. These scripts are called wrappers because they surround the standard compilers as a way of extending them to include MPI commands:

mpicc

C compiler

mpicxx C++ compiler

mpif77

Fortran 77 compiler

mpif90 Fortran 90 compiler

mpiifort Intel Fortran compilers

mpiicc

Intel C compiler

Typically you compile your programs on the front end of the Beowulf, or the master machines, but not on the execution nodes. You use these commands just as you use regular compiler commands, only now you may include MPI commands in your source program:

image

Figure D.2 Left: A schematic view of the MPI command MPI_COMM contained within the C program MPI.c. On the outer wrapper, the program is compiled with the shell command mpicc, which expands the MPI commands and invokes the C compiler cc. Right: A schematic view of how a scheduler runs the executable program MPIpi.o under MPI and on several Linux CPUs.

image

D.2.1 MPI under the SGE Queueing System

Table D.1 lists some of the key number of Sun grid engine commands used to execute compiled programs. Other queueing systems have similar commands. The usual method of executing a program at the prompt runs only on the local computer. In order to run on a number of machines, the program needs to be submitted to the queue management system. We show this in Listing D.1, which uses the run_mpi.sh script and the qsub command to submit jobs to run in batch mode:

image

This command returns a job ID number, which you should record to keep track of your program. Note that in order for this script to work, both runMPI.sh and the program name must be in the current directory. If you need to pass parameters to your program, place the program name as well as the parameters in quotes:

image

TABLE D.1
Some Common SGE Commands

Command

Action

qsub myscript

Submit batch script or job myscript

qhost

Show job/host status

qalter <job_id>

Change parameters for job in queue

qdel job_id

Remove job_id

qstat

Display status of batch jobs

qstat -f

Full listing for qstat

qstat -u <username>

User only for qstat

qmon

X-Window front end (integrated functionality)

image

Listing D.1 The script runMPI.sh used to run an MPI program.

D.2.1.1 STATUS OF SUBMITTED PROGRAMS

After your program is successfully submitted, SGE places it in a queue where it waits for the requested number of processors to become available. SGE then executes the program on the cluster and directs the output to a file in the output subdirectory within your home directory. The program itself uses MPI and C/Fortran commands. In order to check the status of your submitted program, use qstat along with your job ID number:

image

job-ID prior name user state submit/start at queue master ja-task-ID

1263 0 Test_MPI_J dhertel qw 07/20/2005 12:13:51

This is a typical qstat output. The qw in the state column indicates that the program is in the queue and waiting to be executed.

image

Here the program has been assigned a set of nodes (eigenN is the name of the computers), with the last column indicating whether that node is a master, host, slave, or guest (to be discussed further in §D.3.1). At this point the state column will have either a t, indicating transfer, or an r, indicating running.

The output from your run is sent to the file Test_MPI.<jobID>.out in the output subdirectory within your home directory. Error messages are sent to a corresponding file in the error subdirectory. Of course you can still output to a file in the current working directory, as well as input from a file.

D.2.2 MPI Under the Torque/PBS Queueing System

Most Beowulf clusters use Linux, a version of the Unix operating system that runs on PC architecture. Apopular, commercially supported version of Linux that runs well for CPUs using 64-bit words is SUSE[SUSE]. We have used this setup with MPI libraries, Intel compilers, and the cluster edition of the Math Kernel Library [Intel]. Although we could run the SGE scheduler and resource manager on this system, the compilers come with the Torque open source resource manager [Torque], which works quite well. Torque is based on and uses the same commands as Portable Batch System [PBS]. In this section we give a tutorial on the use of Torque for submitting jobs to MPI.2 As we shall see, the steps and commands are very similar to those for SGE and follow the system outlined in Figure D.2.

D.2.2.1 RUNNING SERIAL JOBS WITH TORQUE

Sometimes you may have a serial job that runs too long for you to wait for its completion, or maybe you want to submit a number of long jobs and not have them compete with each other for resources. Either case can be handled by using the queueing system usually used for parallel processing, only now with multiple jobs on just one computer. In this case there are no MPI commands to deal with, but just three torque commands and a shell script that initiates your program:

qsub

Submit jobs to queue via Torque

qstat

Check status of jobs in queue

qdel

Delete a submitted job

script

Shell script that initiates program

Note that you cannot give Torque a complied binary program to run but rather just a shell script3 that calls your executable. This is probably for security and reliability. For example:

image

Here is a simple script1 for initiating the serial job in the file SerProg (you should copy this into a file so you can use it):

image

Observe the #!/bin/bash statementat the beginning of the script. Astatement of this form is required to tell the operating system which shell (command line interpreter) to use. This line is for the bash shell, with other choices including tcsh and ksh. The next command cd $PBS_O_WORKDIR (which you should not modify) sets a PBS environmental variable so that the script looks in the current working directory.

The last command, ./SerProg, contains the name of the file you wish to execute and is the only line you should modify. (The ./ means the current directory, but you can also use ../SerProg for a file one directory up.)

As we indicated before, once you have submitted a job to Torque, you can log off (to get a mug of your favorite beverage) and the job will remain in the compute queue or continue to run. Alternatively, you may want to submit the job to a cluster machine different from the one on which you are working, log off from that one, and then continue working on your present machine. The job will run remotely with other users still able to run on that remote machine. (Other than a possible slowdown in speed, they may not even realize that they are sharing it with you.) Before you log off, it is probably a good idea to determine the status of your jobs in the queue. This is done with the qstat command:

image

This output indicates that Justin’s job has id 170 (the .phy indicates the server phy), that it was submitted via the script ScriptName, that it is in the state R for running (Q if queued, E if executing), and that it is in the batch queue (default). You can delete your job by issuing the qdel command with your JobID.

If your program normally outputs to a file, then it will still output to that file under Torque. However, if your program outputs to stdout (the screen), the output will be redirected to the file ScriptName.oJobID and any errors will be redirected to the file ScriptName.eJobID. Here ScriptName is the shell script you used to submit the job, and JobID is the ID given to it by Torque. For example, here is the output from the long list command ll:

image

D.2.3 Running Parallel Jobs with Torque

The basic Torque steps for submitting a script and ascertaining its status are the same for parallel jobs as for serial jobs, only now the script must have more commands in it and must call MPI. The first thing that must be done is to create a Multiprocessors Daemon (MPD) secretword (not your password) so that Torque can keep multiple jobs separate from all the others:

image

Note that you do not really need to remember your secretword because Torque uses it “behind the scenes”; you just need to have it stored in the file .mpd.conf in your home directory. (This file is not normally visible because it begins with a dot.)

image

Listing D.2 The script TorqueScript.sh used to submit an MPI program.

In Listing D.2 and on the CD (available online: http://press.princeton.edu/landau_survey/) we give the script TorqueScript.sh used to submit MPI jobs to the Torque scheduler. The script is submitted to Torque from a shell via the qsub command:

image

> qsub TorqueScript.sh

Observe again that the script must start with the line #!/bin/bash, to indicate which shell it should run under, and that the lines beginning with #PBS are commands to the Torque scheduler, not comments! The lines

myprog=MPIpi

mpirun -rssh -n $NP $myprog

in the script run the compiled version of the program MPIpi.c, which is also on the CD (available online: http://press.princeton.edu/landau_survey/). Here NP is the total number of processors you want to run on, and its value is written into the script. You will need to change MPIpi to the name of the compiled program you wish to run. (As long as the line $PBS_O_WORKDIR precedes this one, Torque knows that your program is in the working directory.) The line

#PBS -l nodes=2:ppn=2

tells Torqueto reservetwo computers (nodes) with two processors per node (ppn) for a total of four processors. This value of ppn is appropriate for dual-core computers with two CPUs on their chips. If you have computers with four cores on each chip, then you should set ppn=4. If you want to use only one processor per machine, for example, to gauge the speedup from multiple cores, then set ppn=1. Even though we have reserved nodes*ppn processors for Torque to use, the actual number of processors used by MPI is given by the variable NP in the call

mpirun -r ssh -nNP myprog

Accordingly, we must set the value for NP as nodes*ppn within the script. The maximum wall clock time that your job can run is set to 15 min via

#PBS -l walltime=00:15:00

This is actual run time, not the total time in the queue, and so this clock does not start ticking until the job starts executing. In general it is a good idea to use a walltime command with about twice the time you expect your job to run just in case something goes wrong. (Not only is this a nice thing to do for others, but it can also keep you from wasting a finite resource or waiting around forever.) Next observe that the mpirun command, which starts MPI, has the argument -r ssh. This is required on our installation for the machines to be able to communicate with each other using ssh and scp rather than the default rsh and rcp. The latter are less secure. Finally, the script ends with exit 0. This gives the script exit status 0 and thus provides a graceful ending. Graceless endings may not give the operating system enough time to clear your output files from buffers, which means that you may not see all your output!

D.3 Your First MPI Program: MPIhello.c

Listing D.3 gives the simple MPI program MPIhello.c. It has each of the processors print Hello World, followed by the processor’s rank. Compile MPIhello.c using:

image

After successful compilation, an executable file hello should be in the directory in which you did the compilation. The program is executed via the script run_mpi.sh either directly or by use of the management command qsub:

image

This script sets up the running of the program on the 10 processors, with processor 0 the host and processors 1–9 the guests.

image

Listing D.3 MPIhello.c gets each processor to say hello via MPI.

D.3.1 MPIhello.c Explained

Here is what is contained in MPIhello.c:

•  The inclusion of MPI headers via the #include “mpi.h” statement on lines 2–3. These are short files that assist the C compiler by telling it the type of arguments that MPI functions use for input and output without giving any details about the functions. (In Fortran we used include “/usr/local/cluster/mpich-2.1.6/include/mpif.h” after the program line.)

•  The main method is declared with an int main(int argc, char *argv[]) statement, where argc is a pointer to the number of arguments and argv is a pointer to the argument vector passed to main when you run the program from a shell. (Pointers are variable types that give the locations in memory where the values of the variables reside rather than the variables’ actual values.) These arguments are passed to MPI to tell it how many processors you desire.

•  The int myrank statement declares the variable myrank, which stands for the rank of the computer. Each processor running the program is assigned a unique number called its rank by MPI. This is how you tell the difference among identical programs running on different CPUs.

•  The processor that executes the highest level of the program is called the host or master, and all other machines are called guestsor slaves. The host always has myrank = 0, while all the other processors, based on who responds first, have their processor numbers assigned to myrank. This means that myrank= 1 for the first guest to respond, 2 for the second, and so on. Giving each processor a unique value for myrank is a critical element in parallel processing.

•  The MPI_init() and MPI_Finalize() commands in MPIhello.c initialize and terminate MPI, respectively.AllMPI programsmust havethese lines,withthe MPI commands always placed between them. The MPI_Init(&argv, &argc) function call takes two arguments, both beginning with a & that indicates a pointer. These arguments are used for communication between the operating system and MPI.

•  The MPI_Comm_rank(MPI_COMM_WORLD, &myrank) call returns a different value for rank for each processor running the program. The first argument is a predefined constant telling MPI which grouping of processors to communicate with. Unless you have set up groups of processors, just use the default communicator MPI_COMM_WORLD. The second argument is an integer that is returned with the rank of the individual program.

When MPIhello.c is executed, each processor prints its rank to the screen. Notice that it does not print the ranks in order and that the order will probably be different each time you run the program. Take a look at the output (in the file output/MPI_job-xxxx). It should look something like this:

image

If the processing order matters for proper execution, call MPI_Barrier(MPI_COMM_WORLD) to synchronize the processors. It is similar to inserting a starting line at a relay race; a processor stops and waits at this line until all the other processors reach it, and then they all set off at the same time. However, modern programming practice suggests that you try to design programs so that the processors do not have to synchronize often. Having a processor stop and wait obviously slows down the number crunching and consequently removes some of the advantage of parallel computing. However, as a scientist it is more important to have correct results than fast ones, and so do not hesitate to insert barriers if needed.

Exercise: Modify MPIhello.c so that only the guest processors say hello. Hint: What do the guest processors all have in common?

image

D.3.2 Send/Receive Messages: MPImessage2.c

Sending and receiving data constitute the heart of parallel computing. Guest processors need to transmit the data they have processed back to the host, and the host has to assemble the data and then assign new work to the guests. An important aspect of MPI communication is that if one processor sends data, another processor must receive those data. Otherwise, the sending processor may wait indefinitely for a signal that its data have been received or the receiving processor may wait indefinitely until it receives the data it is programmed to expect.

Argument Name

Use in MPI_Send and MPI_Recv

msg

Pointer (& in front) to array to be sent/received

msg_size

Size of array sent; may be bigger than actual size

MPI_TYPE

Predefined constant indicating variable type within array, other possible constants: MPI_INTEGER, MPI_DOUBLE

dest

Rank of processor receiving message

tag

Number that uniquely identifies a message

comm

A communicator, for example, predefined constant MPI_COMM_WORLD

source

Rank of processor sending message; if receiving messages from any source, use predefined constant MPI_ANY_SOURCE

status

Pointer to variable type MPI_Status containing status info

image

Listing D.4 MPImessage2.c uses MPI commands to both send and receive messages. Note the possibility of blocking,in which the program waits for a message.

There is a basic MPI command MPI_Send to send a message from a source node, and another basic command MPI_Recv is needed for a destination node to receive it. The message itself must be an array even if there is only one element in the array. We see these commands in use in MPImessage2.c in Listing D.4. This program accomplishes the same thing as MPIhello.c but with send and receive commands.

The host sends the message and prints out a message, while the guests print out when they receive a message. The forms of the commands are

image

The arguments and their descriptions are given in §D.3.2. The criteria for successfully sending and receiving a message are

1.  The sender must specify a valid destination rank, and the processor with that rank must call MPI_recv.

2.  The receiver must specify a valid source rank or MPI_ANY_SOURCE.

3.  The send and receive communicators must be the same.

4.  The tags must match.

5.  The receiver’s message array must be large enough to hold the array.

Exercise: Modify MPImessage2.c so that all processors say hello.

image

D.3.3 Receive More Messages: MPImessage3.c

image

Listing D.5 MPImessage3.c contains MPI commands that have each guest processor send a message to the host processor who then prints out the rank of that guest.

A bit more advanced use of message passing is shown by MPImessage3.c in Listing D.5. Here each guest sends a message to the host who then prints out the rank of the guest that sent the message. The host loops through all the guests since otherwise it would stop looking for more messages after the first one arrives. The host calls MPI_Comm_size to determine the number of processors.

D.3.4 Broadcast Messages

If we used the same technique to send a message from one node to several other nodes, we would have to loop over calls to MPI_Send. In MPIpi.c in Listing D.6, we see an easy way to send a message to all the other nodes.

image

Listing D.6 MPIpi.c uses a number of processors to compute π by a Monte Carlo rejection (stone throwing).

This simple program computes π in parallel using the Monte Carlo “stone throwing” technique discussed in Chapter 5, “Monte Carlo Simulation.” Notice the new MPI commands:

•  MPI_Wtime is used to return the wall time in seconds (the time as given by a clock on the wall). This is useful when computing speedup curves (Figure D.3).

image

Figure D.3 Execution time versus number of processors. Left: For the calculation of π with MPIpi.c. Right: For the solution of an eigenvalue problem with TuneMPI.c. Note that the single-processor result here does not include the overhead for running MPI.

•  MPI_Bcast sends out data from one processor to all the others. In our case the host broadcasts the number of iterations to the guests, which in turn replace their current values of n with the one received from the host.

•  MPI_Allreduce is a glorified broadcast command. It collects the values of the variable mypi from each of the processors, performs an operation on them with MPI_SUM, and then broadcasts the result via the variable pi.

D.3.5 Exercise

Onthe left in Figure D.3weshowour results for the speedup obtained by calculating π in parallel with MPIpi.c. This exercise leads you through the steps required to obtain your own speedup curve:

1.  Two versions of a parallel program are possible. In the active host version the host acts just like a guest and does some work. In the lazy host version the host does no work but instead just controls the action. Does MPIpi.c contain an active or a lazy host? Change MPIpi.c to the other version and record the difference in execution times.

2.  Make a plot of the time versus the number of processors required for the calculation of π.

3.  Make a speedup plot, that is, a graph of the computation time divided by the time for one processor versus the number of processors.

4.  Record how long each of your runs takes and how accurate the answers are. Does round-off error enter in? What could you do to get a more accurate value for π?

image

D.4 Parallel Tuning

Recall the Tune program with which we experimented in Chapter 14, “High-Performance Computing Hardware, Tuning, and Parallel Computing,” to determine how memory access for a large matrix affects the running time of programs. You may also recall that as the size of the matrix was made larger, the execution time increased more rapidly than the number of operations the program had to perform, with the increase coming from the time it took to transfer the needed matrix elements in and out of central memory.

Because parallel programming on a multiprocessor also involves a good deal of data transfer, the Tune program is also a good teaching tool for seeing how communication costs affect parallel computations. Listing D.7 gives the program TuneMPI.c, which is a modified version of the Tune program in which each row of the large-matrix multiplication is performed on a different processor using MPI:

image

Here the arrows indicate how each row of H is multiplied by the single column of Ψ, with the multiplication of each row performed on a different processor (rank). The assignment of rows to processors continues until we run out of processors, and then it starts all over again. Since this multiplication is repeated for a number of iterations, this is the most computationally intensive part of the program, and so it makes sense to parallelize it.

On the right in Figure D.3 is the speedup curve we obtained by running TuneMPI.c. However, even if the matrix is large, the Tune program is not computationally intensive enough to overcome the cost of communication among nodes inherent in parallel computing. Consequently, to increase computing time we have inserted an inner for loop over k that takes up time but accomplishes nothing (we’ve all had days like that). Slowing down the program should help make the speedup curve more realistic.

image

image

image

Listing D.7 The C program TuneMPI.c is a parallel version of Tune.java,which we used to test the effects of various optimization modifications.

D.4.0.1 TUNEMPI.C EXERCISE

1.  Compile TuneMPI.c:

image

Here -lm loads the math library and -o places the object in TuneMPI. This is the base program. It will use one processor as the host and another one to do the work.

2.  To determine the speedup with multiple processors, you need to change the run_mpi.sh script. Open it with an editor and find a line of the form

image

The last number on this line tells the cluster the maximum number of processors to use. Change this to the number of processors you want to use. Use a number from 2 to 10; starting with one processor leads to an error message, as that leaves no processor to do the work. After changing run_mpi.sh, run the program on the cluster. With the SEG management system this is done via

image

3.  You are already familiar with the scalar version of the Tune program. Find the scalar version of Tune.c (and add the extra lines to slow the program down) or modify the present one so that it runs on only one processor. Run the scalar version of TuneMPI and record the time it takes. Because there is overhead associated with running MPI, we expect the scalar program to be faster than an MPI program running on a single processor.

4.  Open another window and watch the processing of your MPI jobs on the host computer. Check that all the temporary files are removed.

5.  You now need to collect data for a plot of running time versus number of machines. Make sure your matrix size is large, say, with N=200 and up. Run TuneMPI on a variable number of machines, starting at 2, until you find no appreciable speedup (or an actual slowdown) with an increasing number of machines.

6.  Warning: While you will do no harm running on the Beowulf when others are also running on it, in order to get meaningful, repeatable speedup graphs, you should have the cluster all to yourself. Otherwise, the time it takes to switch jobs around and to set up and drop communications may slow down your runs significantly. A management system should help with this. If you are permitted to log in directly to the Beowulf machines, you can check what is happening via who:

image

7.  Increase the matrix size in steps and record how this affects the speedups. Remember, once the code is communications-bound, distributing it over many processors probably will make it run slower, not faster!

D.5 A String Vibrating in Parallel

The program MPIstring.c given in Listing D.8 is a parallel version of the solution of the wave equation (eqstring.c) discussed in Chapter 18, “PDE Waves: String, Wave Packet, and Electromagnetic.” The algorithm calculates the future ([2]) displacement of a given string element from the present ([1]) displacements immediately to the left and right of that section, as well as the present and past ([0]) displacements of that element. The program is parallelized by assigning different sections of the string to different nodes.

image

image

Listing D.8 MPIstring.c solves the wave equation for a string using several processors via MPI commands.

Notice how the MPI_Recv() and MPI_Send() commands require a pointer as their first argument, or an array element. When sending more than one element of an array to MPI_Send(), send a pointer to the first element of the array as the first argument, and then the number of elements as the second argument. Observe near the end of the program how the MPI_Send() call is used to send len + 1 elements of the 2-D array x[][], starting with the element x[2][0]. MPI sends these elements in the order in which they are stored in memory. In C, arrays are stored in row-major order with the first element of the second row immediately following the last element of the first row, and so on. Accordingly, this MPI_Send() call sends len +1 elements of row 2, starting with column 0, which means all of row 2. If we had specified len + 5 elements instead, MPI would have sent all of row 2 plus the first four elements of row 3.

In MPIstring.c, the calculated future position of the string is stored in row 2 of x[3][len + 1], with different sections of the string stored in different columns. Row 1 stores the present positions, and row 0 stores the past positions. This is different from the column-based algorithm used in the serial program eqstring.c, following the original Fortran program, which was column-, rather than row-based. This was changed because MPI reads data by rows. The initial displacement of the string is given by the user-supplied function init_string(). Because the first time step requires only the initial displacement, no message passing is necessary. For later times, each node sends the displacement of its rightmost point to the node with the next highest rank. This means that the rightmost node (rank = numprocs –1) does not send data and that the master (rank = 0) does not receive data. Communication is established via the MPI_Sendrecv() command, with the different sends and receives using tags to ensure proper delivery of the messages.

Next in the program, the nodes (representing string segments) send to and receive data from the segment to their right. All these sends and receives have a tag of 2. After every 50 iterations, the master collects the displacement of each segment of the string and outputs it. Each slave sends the data for the future time with a tag of 3. The master first outputs its own data and then calls MPI_Recv() for each node, one at a time, printing the data it receives.

D.5.1 MPIstring.c Exercise

1.  Ensure that the input data (maxt, L, scale, skip) in MPIstring.c are the same as those in eqstring.c.

2.  Ensure that init_string() returns the initial configuration used in eqstring.c.

3.  Compile and run eqstring.c via

image

4.  Run both programs to ensure that they produce the same output. (In Unix this is easy to check with the diff command.)

image

image

Listing D.9 The MPI program MPIdeadlock.c illustrates deadlock (waiting to receive). The code MPIdeadlock-fixed.c in Listing D.6 removes the block.

D.6 Deadlock

It is important to avoid deadlock when using the MPI_Send() and MPI_Recv() commands. Deadlock occurs when one or more nodes wait for a nonoccurring event to take place. This can arise if each node waits to receive a message that is not sent. Compile and execute deadlock.c:

image

Take note of the job ID returned, which we will call xxxx. Wait a few seconds and then look at the output of the program:

image

The output should list how many nodes (slots) were assigned to the job. Because all these nodes are now deadlocked, we have to cancel this job:

image

There are a number of ways to avoid deadlock. The program MPIstring.c used the function MPI_Sendrecv() to handle much of the message passing, and this does not cause deadlock. It is possible to use MPI_Send() and MPI_Recv(), but you should be careful to avoid deadlock, as we do in MPIdeadlock-fixed.c in Listing D.6.

image

D.6.1 Nonblocking Communication

MPI_Send() and MPI_Recv(), as we have said, are susceptible to deadlock because they block the program from continuing until the send or receive is finished. This method of message passing is called blocking communication. One way to avoid deadlock is to use nonblocking communication such as MPI_Isend(), which returns before the send is complete and thus frees up the node. Likewise, a node can call MPI_Irecv(), which does not wait for the receive to be completed. Note that a node can receive a message with MPI_Recv() even if it was sent using MPI_Isend(), and similarly, receive a message using MPI_Irecv() even if it was sent with MPI_Send().

There are two ways to determine whether a nonblocking send or receive is finished. One is to call MPI_Test(). It is your choice as to whether you want to wait for the communication to be completed (e.g., to ensure that all string segments are at current time and not past time). To wait, call MPI_Wait(), which blocks execution until communication is complete. When you start a nonblocking send or receive, you get a request handle of data type MPI_Request to identify the communication you may need to wait for. A disadvantage of using nonblocking communication is that you have to ensure that you do not use the data being communicated until the communication has been completed. You can check for this using MPI_Test() or wait for completion using MPI_Wait().

Exercise: Rewrite MPIdeadlock.c so that it avoids deadlock by using nonblock-ing communication. Hint: Replace MPI_Recv() by MPI_Irecv().

image

D.6.2 Collective Communication

MPI contains several commands that automatically and simultaneously exchange messages among multiple nodes. This is called collective communication, in contrast to point-to-point communication between two nodes. The program MPIpi.c has already introduced the MPI_Reduce() command. It receives data from multiple nodes, performs an operation on the data, and stores the result on one node. The program tuneMPI.c used a similar function MPI_Allreduce() that does the same thing but stores the result on every node. The latter program also used MPI_Bcast(), which allows a node to send the same message to multiple nodes.

Collective commands communicate among a group of nodes specified by a communicator, such as MPI_COMM_WORLD. For example, in MPIpi.c we called MPI_Reduce() to receive results from every node, including itself. Other collective communication functions include MPI_Scatter(), which has one node send messages to every other node. This is similar to MPI_Bcast(), but the former sends a different message to each node by breaking up an array into pieces of specified lengths and sending the pieces to nodes. Likewise, MPI_Gather() gathers data from every node (including the root node) and places it in an array, with data from node 0 placed first, followed by node 1, and so on. A similar function, MPI_Allgather(), stores the data on every node rather than just the root node.

D.7 Bootable Cluster CD image

One of the difficulties in learning how to parallel compute is the need for a parallel computer. Even though there may be many computers around that you may be able to use, knitting them all together into a parallel machine takes time and effort. However, if your interest is in learning about and experiencing distributed parallel computing, and not in setting up one of the fastest research machines in the world, then there is an easy way. It is called a bootable cluster CD (BCCD) and is a file on a CD. When you start your computer with the CD in place, you are given the option of having the computer ignore your regular operating system and instead boot from the CD into a preconfigured distributed computing environment. The new system does not change your system but rather is a nondestructive overlay on top of the existing hardware that runs a full-fledged parallel computing environment on just about any workstation-class system, including Macs. You boot up every machine you wish to have on your cluster this way, and if needed, set up a domain name system (DNS) and dynamic host configuration protocol (DHCP) servers, which are also included. Did we mention that the system is free? [BCCD]

D.8 Parallel Computing Exercises

1.  Bifurcation plot: If you have not yet done so, take the program you wrote to generate the bifurcation plot for bug populations and run different ranges of μ values simultaneously on several CPUs.

2.  Processor ring: Write a program in which

a.  a set of processors are arranged in a ring.

b.  each processor stores its rank in MPI_COMM_WORLD in an integer.

c.  each processor passes this rank on to its neighbor on the right.

d.  each processor keeps passing until it receives its rank back.

3.  Ping pong: Write a program in which two processors repeatedly pass a message back and forth. Insert timing calls to measure the time taken for one message and determine how the time taken varies with the size of the message.

4.  Broadcast: Have processor 1 send the same message to all the other processors and then receive messages of the same length from all the other processors. How does the time taken vary with the size of the messages and the number of processors?

D.9 List of MPI Commands

MPI Data Types and Operators

MPI defines some of its own data types. The following are data types used as arguments to MPI commands.

•  MPI_Comm: A communicator, used to specify group of nodes, most commonly MPI_COMM_WORLD for all the nodes.

•  MPI_Status: A variable holding status information returned by functions such as MPI_Recv().

•  MPI_Datatype: A predefined constant indicating the type of data being passed in a function such as MPI_Send() (see below).

•  MPI_O: A predefined constant indicating the operation to be performed on data in functions such as MPI_Reduce() (see below).

•  MPI_Request: A request handle to identify a nonblocking send or receive, for example, when using MPI_Wait() or MPI_Test().

Predefined Constants: MPI_Op and MPI_Datatype

For a more complete list of constants used in MPI, see http://www-unix.mcs.anl.gov/mpi/www/www3/Constants.html.

MPI_OP

Description

MPI_Datatype

C Data Type

MPI_MAX

Maximum

MPI_CHAR

char

MPI_MIN

Minimum

MPI_SHORT

short

MPI_SUM

Sum

MPI_INT

int

MPI_PROD

Product

MPI_LONG

long

MPI_LAND

Logical and

MPI_FLOAT

float

MPI_BAND

Bitwise and

MPI_DOUBLE

double

MPI_LOR

Logical or

MPI_UNSIGNED_CHAR

unsigned char

MPI_BOR

Bitwise or

MPI_UNSIGNED_SHORT

unsigned short

MPI_LXOR

Logical exclusive or

MPI_UNSIGNED

unsigned int

MPI_BXOR

Bitwise exclusive or

MPI_UNSIGNED_LONG

unsigned long

MPI_MINLOC

Find node’s min and rank

MPI_MAXLOC

Find node’s max and rank

MPI Commands

Below we list and identify the MPI commands used in this appendix. For the syntax of each command, along with many more MPI commands, see http://www-unix.mcs.anl.gov/mpi/www/, where each command is a hyperlink to a full description.

Basic MPI Commands

•  MPI_Send: Sends a message.

•  MPI_Recv: Receives a message.

•  MPI_Sendrecv: Sends and receives a message.

•  MPI_Init: Starts MPI at the beginning of the program.

•  MPI_Finalize: Stops MPI at the end of the program.

•  MPI_Comm_rank: Determines a node’s rank.

•  MPI_Comm_size: Determines the number of nodes in the communicator.

•  MPI_Get_processor_name: Determines the name of the processor.

•  MPI_Wtime: Returns the wall time in seconds since an arbitrary time in the past.

•  MPI_Barrier: Blocks until all the nodes have called this function.

Collective Communication

•  MPI_Reduce: Performs an operation on all copies of a variable and stores the result on a single node.

•  MPI_Allreduce: Like MPI_Reduce, but stores the result on all the nodes.

•  MPI_Gather: Gathers data from a group of nodes and stores them on one node.

•  MPI_Allgather: Like MPI_Gather but stores the result on all the nodes.

•  MPI_Scatter: Sends different data to all the other nodes (opposite of MPI_Gather).

•  MPI_Bcast: Sends the same message to all the other processors.

Nonblocking Communication

•  MPI_Isend: Begins a nonblocking send.

•  MPI_Irecv: Begins a nonblocking receive.

•  MPI_Wait: Waits for an MPI send or receive to be completed.

•  MPI_Test: Tests for the completion of a send or receive.

1 This material was developed with the help of Kristopher Wieland, Kevin Kyle, Dona Hertel, and Phil Carter. Some of the other materials derive from class notes from the Ohio Super Computer Center, which were written in part by Steve Gordon.

2 We thank Justin Elser for setting up this system and for preparing the original form of this section.

3 Recall that a shell script is just a file containing commands that would otherwise be entered at a shell’s prompt. When the file name is given execution permission and entered as a command, all the commands within the file are executed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset