Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Dense Matrix Algorithms

Algorithms involving matrices and vectors are applied in several numerical and non-numerical contexts. This chapter discusses some key algorithms for dense or full matrices that have no or few known usable zero entries. We deal specifically with square matrices for pedagogical reasons, but the algorithms in this chapter, wherever applicable, can easily be adapted for rectangular matrices as well.

Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition (Section 3.2.2). Depending on the computation at hand, the decomposition may be induced by partitioning the input, the output, or the intermediate data. Section 3.4.1 describes in detail the various schemes of partitioning matrices for parallel computation. The algorithms discussed in this chapter use one- and two-dimensional block, cyclic, and block-cyclic partitionings. For the sake of brevity, we will henceforth refer to one- and two-dimensional partitionings as 1-D and 2-D partitionings, respectively.

Another characteristic of most of the algorithms described in this chapter is that they use one task per process. As a result of a one-to-one mapping of tasks to processes, we do not usually refer to the tasks explicitly and decompose or partition the problem directly into processes.

Matrix-Vector Multiplication

This section addresses the problem of multiplying a dense n × n matrix A with an n × 1 vector x to yield the n × 1 result vector y. Algorithm 8.1 shows a serial algorithm for this problem. The sequential algorithm requires n² multiplications and additions. Assuming that a multiplication and addition pair takes unit time, the sequential run time is

Equation 8.1.

At least three distinct parallel formulations of matrix-vector multiplication are possible, depending on whether rowwise 1-D, columnwise 1-D, or a 2-D partitioning is used.

Example 8.1. A serial algorithm for multiplying an n × n matrix A with an n × 1 vector x to yield an n × 1 product vector y.

1.   procedure MAT_VECT ( A, x, y) 
2.   begin 
3.      for i := 0 to n − 1 do 
4.      begin 
5.         y[i]:=0; 
6.         for j := 0 to n − 1 do 
7.             y[i] := y[i] + A[i, j] × x[j]; 
8.      endfor; 
9.   end MAT_VECT

Rowwise 1-D Partitioning

This section details the parallel algorithm for matrix-vector multiplication using rowwise block 1-D partitioning. The parallel algorithm for columnwise block 1-D partitioning is similar (Problem 8.2) and has a similar expression for parallel run time. Figure 8.1 describes the distribution and movement of data for matrix-vector multiplication with block 1-D partitioning.

Figure 8.1. Multiplication of an n × n matrix with an n × 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.

One Row Per Process

First, consider the case in which the n × n matrix is partitioned among n processes so that each process stores one complete row of the matrix. The n × 1 vector x is distributed such that each process owns one of its elements. The initial distribution of the matrix and the vector for rowwise block 1-D partitioning is shown in Figure 8.1(a). Process P_i initially owns x[i] and A[i, 0], A[i, 1], ..., A[i, n−1] and is responsible for computing y[i]. Vector x is multiplied with each row of the matrix (Algorithm 8.1); hence, every process needs the entire vector. Since each process starts with only one element of x, an all-to-all broadcast is required to distribute all the elements to all the processes. Figure 8.1(b) illustrates this communication step. After the vector x is distributed among the processes (Figure 8.1(c)), process P_i computes (lines 6 and 7 of Algorithm 8.1). As Figure 8.1(d) shows, the result vector y is stored exactly the way the starting vector x was stored.

Parallel Run Time. Starting with one vector element per process, the all-to-all broadcast of the vector elements among n processes requires time Θ(n) on any architecture (Table 4.1). The multiplication of a single row of A with x is also performed by each process in time Θ(n). Thus, the entire procedure is completed by n processes in time Θ(n), resulting in a process-time product of Θ(n²). The parallel algorithm is cost-optimal because the complexity of the serial algorithm is Θ(n²).

Using Fewer than n Processes

Consider the case in which p processes are used such that p < n, and the matrix is partitioned among the processes by using block 1-D partitioning. Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p. Since the vector x must be multiplied with each row of the matrix, every process needs the entire vector (that is, all the portions belonging to separate processes). This again requires an all-to-all broadcast as shown in Figure 8.1(b) and (c). The all-to-all broadcast takes place among p processes and involves messages of size n/p. After this communication step, each process multiplies its n/p rows with the vector x to produce n/p elements of the result vector. Figure 8.1(d) shows that the result vector y is distributed in the same format as that of the starting vector x.

Parallel Run Time. According to Table 4.1, an all-to-all broadcast of messages of size n/p among p processes takes time t_s log p + t_w(n/ p)( p − 1). For large p, this can be approximated by t_s log p + t_wn. After the communication, each process spends time n²/p multiplying its n/p rows with the vector. Thus, the parallel run time of this procedure is

Equation 8.2.

The process-time product for this parallel formulation is n² + t_s p log p + t_wnp. The algorithm is cost-optimal for p = O(n).

Scalability Analysis. We now derive the isoefficiency function for matrix-vector multiplication along the lines of the analysis in Section 5.4.2 by considering the terms of the overhead function one at a time. Consider the parallel run time given by Equation 8.2 for the hypercube architecture. The relation T_o = pT_P − W gives the following expression for the overhead function of matrix-vector multiplication on a hypercube with block 1-D partitioning:

Equation 8.3.

Recall from Chapter 5 that the central relation that determines the isoefficiency function of a parallel algorithm is W = KT_o (Equation 5.14), where K = E/(1 − E) and E is the desired efficiency. Rewriting this relation for matrix-vector multiplication, first with only the t_s term of T_o,

Equation 8.4.

Equation 8.4 gives the isoefficiency term with respect to message startup time. Similarly, for the t_w term of the overhead function,

Since W = n² (Equation 8.1), we derive an expression for W in terms of p, K , and t_w (that is, the isoefficiency function due to t_w) as follows:

Equation 8.5.

Now consider the degree of concurrency of this parallel algorithm. Using 1-D partitioning, a maximum of n processes can be used to multiply an n × n matrix with an n × 1 vector. In other words, p is O(n), which yields the following condition:

Equation 8.6.

The overall asymptotic isoefficiency function can be determined by comparing Equations 8.4, 8.5, and 8.6. Among the three, Equations 8.5 and 8.6 give the highest asymptotic rate at which the problem size must increase with the number of processes to maintain a fixed efficiency. This rate of Θ(p²) is the asymptotic isoefficiency function of the parallel matrix-vector multiplication algorithm with 1-D partitioning.

2-D Partitioning

This section discusses parallel matrix-vector multiplication for the case in which the matrix is distributed among the processes using a block 2-D partitioning. Figure 8.2 shows the distribution of the matrix and the distribution and movement of vectors among the processes.

Figure 8.2. Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n² if the matrix size is n × n.

One Element Per Process

We start with the simple case in which an n × n matrix is partitioned among n² processes such that each process owns a single element. The n × 1 vector x is distributed only in the last column of n processes, each of which owns one element of the vector. Since the algorithm multiplies the elements of the vector x with the corresponding elements in each row of the matrix, the vector must be distributed such that the ith element of the vector is available to the ith element of each row of the matrix. The communication steps for this are shown in Figure 8.2(a) and (b). Notice the similarity of Figure 8.2 to Figure 8.1. Before the multiplication, the elements of the matrix and the vector must be in the same relative locations as in Figure 8.1(c). However, the vector communication steps differ between various partitioning strategies. With 1-D partitioning, the elements of the vector cross only the horizontal partition-boundaries (Figure 8.1), but for 2-D partitioning, the vector elements cross both horizontal and vertical partition-boundaries (Figure 8.2).

As Figure 8.2(a) shows, the first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. Often, the vector is stored along the diagonal instead of the last column, in which case this step is not required. The second step copies the vector elements from each diagonal process to all the processes in the corresponding column. As Figure 8.2(b) shows, this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. After these two communication steps, each process multiplies its matrix element with the corresponding element of x. To obtain the result vector y, the products computed for each row must be added, leaving the sums in the last column of processes. Figure 8.2(c) shows this step, which requires an all-to-one reduction (Section 4.1) in each row with the last process of the row as the destination. The parallel matrix-vector multiplication is complete after the reduction step.

Parallel Run Time. Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row. Each of these operations takes time Θ(log n). Since each process performs a single multiplication in constant time, the overall parallel run time of this algorithm is Θ(n). The cost (process-time product) is Θ(n² log n); hence, the algorithm is not cost-optimal.

Using Fewer than n² Processes

A cost-optimal parallel implementation of matrix-vector multiplication with block 2-D partitioning of the matrix can be obtained if the granularity of computation at each process is increased by using fewer than n² processes.

Consider a logical two-dimensional mesh of p processes in which each process owns an block of the matrix. The vector is distributed in portions of elements in the last process-column only. Figure 8.2 also illustrates the initial data-mapping and the various communication steps for this case. The entire vector must be distributed on each row of processes before the multiplication can be performed. First, the vector is aligned along the main diagonal. For this, each process in the rightmost column sends its vector elements to the diagonal process in its row. Then a columnwise one-to-all broadcast of these elements takes place. Each process then performs n²/p multiplications and locally adds the sets of products. At the end of this step, as shown in Figure 8.2(c), each process has partial sums that must be accumulated along each row to obtain the result vector. Hence, the last step of the algorithm is an all-to-one reduction of the values in each row, with the rightmost process of the row as the destination.

Parallel Run Time. The first step of sending a message of size from the rightmost process of a row to the diagonal process (Figure 8.2(a)) takes time . We can perform the columnwise one-to-all broadcast in at most time by using the procedure described in Section 4.1.3. Ignoring the time to perform additions, the final rowwise all-to-one reduction also takes the same amount of time. Assuming that a multiplication and addition pair takes unit time, each process spends approximately n²/p time in computation. Thus, the parallel run time for this procedure is as follows:

Equation 8.7.

Scalability Analysis. By using Equations 8.1 and 8.7, and applying the relation T_o = pT_p − W (Equation 5.1), we get the following expression for the overhead function of this parallel algorithm:

Equation 8.8.

We now perform an approximate isoefficiency analysis along the lines of Section 5.4.2 by considering the terms of the overhead function one at a time (see Problem 8.4 for a more precise isoefficiency analysis). For the t_s term of the overhead function, Equation 5.14 yields

Equation 8.9.

Equation 8.9 gives the isoefficiency term with respect to the message startup time. We can obtain the isoefficiency function due to t_w by balancing the term log p with the problem size n². Using the isoefficiency relation of Equation 5.14, we get the following:

Equation 8.10.

Finally, considering that the degree of concurrency of 2-D partitioning is n² (that is, a maximum of n² processes can be used), we arrive at the following relation:

Equation 8.11.

Among Equations 8.9, 8.10, and 8.11, the one with the largest right-hand side expression determines the overall isoefficiency function of this parallel algorithm. To simplify the analysis, we ignore the impact of the constants and consider only the asymptotic rate of the growth of problem size that is necessary to maintain constant efficiency. The asymptotic isoefficiency term due to t_w (Equation 8.10) clearly dominates the ones due to t_s (Equation 8.9) and due to concurrency (Equation 8.11). Therefore, the overall asymptotic isoefficiency function is given by Θ(p log² p).

The isoefficiency function also determines the criterion for cost-optimality (Section 5.4.3). With an isoefficiency function of Θ(p log² p), the maximum number of processes that can be used cost-optimally for a given problem size W is determined by the following relations:

Equation 8.12.

Ignoring the lower-order terms,

Substituting log n for log p in Equation 8.12,

Equation 8.13.

The right-hand side of Equation 8.13 gives an asymptotic upper bound on the number of processes that can be used cost-optimally for an n × n matrix-vector multiplication with a 2-D partitioning of the matrix.

Comparison of 1-D and 2-D Partitionings

A comparison of Equations 8.2 and 8.7 shows that matrix-vector multiplication is faster with block 2-D partitioning of the matrix than with block 1-D partitioning for the same number of processes. If the number of processes is greater than n, then the 1-D partitioning cannot be used. However, even if the number of processes is less than or equal to n, the analysis in this section suggests that 2-D partitioning is preferable.

Among the two partitioning schemes, 2-D partitioning has a better (smaller) asymptotic isoefficiency function. Thus, matrix-vector multiplication is more scalable with 2-D partitioning; that is, it can deliver the same efficiency on more processes with 2-D partitioning than with 1-D partitioning.

Matrix-Matrix Multiplication

This section discusses parallel algorithms for multiplying two n × n dense, square matrices A and B to yield the product matrix C = A × B. All parallel matrix multiplication algorithms in this chapter are based on the conventional serial algorithm shown in Algorithm 8.2. If we assume that an addition and multiplication pair (line 8) takes unit time, then the sequential run time of this algorithm is n³. Matrix multiplication algorithms with better asymptotic sequential complexities are available, for example Strassen’s algorithm. However, for the sake of simplicity, in this book we assume that the conventional algorithm is the best available serial algorithm. Problem 8.5 explores the performance of parallel matrix multiplication regarding Strassen’s method as the base algorithm.

Example 8.2. The conventional serial algorithm for multiplication of two n × n matrices.

1.   procedure MAT_MULT (A, B, C) 
2.   begin 
3.      for i := 0 to n − 1 do 
4.          for j := 0 to n − 1 do 
5.              begin 
6.                 C[i, j] := 0; 
7.                 for k := 0 to n − 1 do 
8.                     C[i, j] := C[i, j] + A[i, k] × B[k, j]; 
9.              endfor; 
10.  end MAT_MULT

Example 8.3. The block matrix multiplication algorithm for n × n matrices with a block size of (n/q) × (n/q).

1.   procedure BLOCK_MAT_MULT (A, B, C) 
2.   begin 
3.      for i := 0 to q − 1 do 
4.          for j := 0 to q − 1 do 
5.              begin 
6.                 Initialize all elements of C_i_,_j to zero; 
7.                 for k := 0 to q − 1 do 
8.                     C_i_,_j := C_i_,_j + A_i_,_k × B_k_,_j; 
9.              endfor; 
10.  end BLOCK_MAT_MULT

A concept that is useful in matrix multiplication as well as in a variety of other matrix algorithms is that of block matrix operations. We can often express a matrix computation involving scalar algebraic operations on all its elements in terms of identical matrix algebraic operations on blocks or submatrices of the original matrix. Such algebraic operations on the submatrices are called block matrix operations. For example, an n × n matrix A can be regarded as a q × q array of blocks A_i_,_j (0 ≤ i, j < q) such that each block is an (n/q) × (n/q) submatrix. The matrix multiplication algorithm in Algorithm 8.2 can then be rewritten as Algorithm 8.3, in which the multiplication and addition operations on line 8 are matrix multiplication and matrix addition, respectively. Not only are the final results of Algorithm 8.2 and 8.3 identical, but so are the total numbers of scalar additions and multiplications performed by each. Algorithm 8.2 performs n³ additions and multiplications, and Algorithm 8.3 performs q ³ matrix multiplications, each involving (n/q) × (n/q) matrices and requiring (n/q)³ additions and multiplications. We can use p processes to implement the block version of matrix multiplication in parallel by choosing and computing a distinct C_i_,_j block at each process.

In the following sections, we describe a few ways of parallelizing Algorithm 8.3. Each of the following parallel matrix multiplication algorithms uses a block 2-D partitioning of the matrices.

A Simple Parallel Algorithm

Consider two n × n matrices A and B partitioned into p blocks A_i_,_j and B_i_,_j of size each. These blocks are mapped onto a logical mesh of processes. The processes are labeled from P_0,0 to . Process P_i,j initially stores A_i_,_j and B_i_,_j and computes block C_i_,_j of the result matrix. Computing submatrix C_i_,_j requires all submatrices A_i_,_k and B_k_,_j for . To acquire all the required blocks, an all-to-all broadcast of matrix A’s blocks is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. After P_i_,_j acquires and , it performs the submatrix multiplication and addition step of lines 7 and 8 in Algorithm 8.3.

Performance and Scalability Analysis. The algorithm requires two all-to-all broadcast steps (each consisting of concurrent broadcasts in all rows and columns of the process mesh) among groups of processes. The messages consist of submatrices of n²/p elements. From Table 4.1, the total communication time is . After the communication step, each process computes a submatrix C_i_,_j, which requires multiplications of submatrices (lines 7 and 8 of Algorithm 8.3 with . This takes a total of time . Thus, the parallel run time is approximately

Equation 8.14.

The process-time product is , and the parallel algorithm is cost-optimal for p = O(n²).

The isoefficiency functions due to t_s and t_w are t_s p log p and 8(t_w)³p³/², respectively. Hence, the overall isoefficiency function due to the communication overhead is Θ(p^3/2). This algorithm can use a maximum of n² processes; hence, p ≤ n² or n³ ≥ p³/². Therefore, the isoefficiency function due to concurrency is also Θ(p^3/2).

A notable drawback of this algorithm is its excessive memory requirements. At the end of the communication phase, each process has blocks of both matrices A and B. Since each block requires Θ(n²/p) memory, each process requires Θ memory. The total memory requirement over all the processes is Θ, which is times the memory requirement of the sequential algorithm.

Cannon’s Algorithm

Cannon’s algorithm is a memory-efficient version of the simple algorithm presented in Section 8.2.1. To study this algorithm, we again partition matrices A and B into p square blocks. We label the processes from P_0,0 to , and initially assign submatrices A_i_,_j and B_i_,_j to process P_i_,_j. Although every process in the i th row requires all submatrices , it is possible to schedule the computations of the processes of the ith row such that, at any given time, each process is using a different A_i_,_k. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh A_i_,_k after each rotation. If an identical schedule is applied to the columns, then no process holds more than one block of each matrix at any time, and the total memory requirement of the algorithm over all the processes is Θ(n²). Cannon’s algorithm is based on this idea. The scheduling for the multiplication of submatrices on separate processes in Cannon’s algorithm is illustrated in Figure 8.3 for 16 processes.

Figure 8.3. The communication steps in Cannon’s algorithm on 16 processes.

The first communication step of the algorithm aligns the blocks of A and B in such a way that each process multiplies its local submatrices. As Figure 8.3(a) shows, this alignment is achieved for matrix A by shifting all submatrices A_i_,_j to the left (with wraparound) by i steps. Similarly, as shown in Figure 8.3(b), all submatrices B_i_,_j are shifted up (with wraparound) by j steps. These are circular shift operations (Section 4.6) in each row and column of processes, which leave process P_i_,_j with submatrices and . Figure 8.3(c) shows the blocks of A and B after the initial alignment, when each process is ready for the first submatrix multiplication. After a submatrix multiplication step, each block of A moves one step left and each block of B moves one step up (again with wraparound), as shown in Figure 8.3(d). A sequence of such submatrix multiplications and single-step shifts pairs up each A_i_,_k and B_k_,_j for at P_i_,_j. This completes the multiplication of matrices A and B.

Performance Analysis. The initial alignment of the two matrices (Figure 8.3(a) and (b)) involves a rowwise and a columnwise circular shift. In any of these shifts, the maximum distance over which a block shifts is . The two shift operations require a total of time 2(t_s + t_wn²/p) (Table 4.1). Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time t_s + t_wn²/p. Thus, the total communication time (for both matrices) during this phase of the algorithm is . For large enough p on a network with sufficient bandwidth, the communication time for the initial alignment can be disregarded in comparison with the time spent in communication during the compute-and-shift phase.

Each process performs multiplications of submatrices. Assuming that a multiplication and addition pair takes unit time, the total time that each process spends in computation is n³/p. Thus, the approximate overall parallel run time of this algorithm is

Equation 8.15.

The cost-optimality condition for Cannon’s algorithm is identical to that for the simple algorithm presented in Section 8.2.1. As in the simple algorithm, the isoefficiency function of Cannon’s algorithm is Θ(p^3/2).

The DNS Algorithm

The matrix multiplication algorithms presented so far use block 2-D partitioning of the input and the output matrices and use a maximum of n² processes for n × n matrices. As a result, these algorithms have a parallel run time of Ω(n) because there are Θ(n³) operations in the serial algorithm. We now present a parallel algorithm based on partitioning intermediate data that can use up to n³ processes and that performs matrix multiplication in time Θ(log n) by using Ω(n³/log n) processes. This algorithm is known as the DNS algorithm because it is due to Dekel, Nassimi, and Sahni.

We first introduce the basic idea, without concern for inter-process communication. Assume that n³ processes are available for multiplying two n × n matrices. These processes are arranged in a three-dimensional n × n × n logical array. Since the matrix multiplication algorithm performs n³ scalar multiplications, each of the n³ processes is assigned a single scalar multiplication. The processes are labeled according to their location in the array, and the multiplication A[i, k] × B[k, j] is assigned to process P_i_,_j_,_k (0 ≤ i, j, k < n). After each process performs a single multiplication, the contents of P_i_,_j_,0, P_i_,_j_,1, ..., P_i_,_j_,_n₋₁ are added to obtain C [i, j]. The additions for all C [i, j] can be carried out simultaneously in log n steps each. Thus, it takes one step to multiply and log n steps to add; that is, it takes time Θ(log n) to multiply the n × n matrices by this algorithm.

We now describe a practical parallel implementation of matrix multiplication based on this idea. As Figure 8.4 shows, the process arrangement can be visualized as n planes of n × n processes each. Each plane corresponds to a different value of k. Initially, as shown in Figure 8.4(a), the matrices are distributed among the n² processes of the plane corresponding to k = 0 at the base of the three-dimensional process array. Process P_i_,_j_,0 initially owns A[i, j] and B[i, j].

The communication steps in the DNS algorithm while multiplying 4 × 4 matrices A and B on 64 processes. The shaded processes in part (c) store elements of the first row of A and the shaded processes in part (d) store elements of the first column of B.

Figure 8.4. The communication steps in the DNS algorithm while multiplying 4 × 4 matrices A and B on 64 processes. The shaded processes in part (c) store elements of the first row of A and the shaded processes in part (d) store elements of the first column of B.

The vertical column of processes P_i_,_j_,∗ computes the dot product of row A[i, ∗] and column B[∗, j]. Therefore, rows of A and columns of B need to be moved appropriately so that each vertical column of processes P_i_,_j_,∗ has row A[i, ∗] and column B[∗, j]. More precisely, process P_i_,_j_,_k should have A[i, k] and B[k, j].

The communication pattern for distributing the elements of matrix A among the processes is shown in Figure 8.4(a)–(c). First, each column of A moves to a different plane such that the j th column occupies the same position in the plane corresponding to k = j as it initially did in the plane corresponding to k = 0. The distribution of A after moving A[i, j] from P_i_,_j_,0 to P_i_,_j_,_j is shown in Figure 8.4(b). Now all the columns of A are replicated n times in their respective planes by a parallel one-to-all broadcast along the j axis. The result of this step is shown in Figure 8.4(c), in which the n processes P_i_,0,_j, P_i_,1,_j, ..., P_i_,_n_−1,_j receive a copy of A[i, j] from P_i_,_j_,_j. At this point, each vertical column of processes P_i_,_j_,∗ has row A[i, ∗]. More precisely, process P_i_,_j_,_k has A[i, k].

For matrix B, the communication steps are similar, but the roles of i and j in process subscripts are switched. In the first one-to-one communication step, B[i, j] is moved from P_i_,_j_,0 to P_i_,_j_,_i. Then it is broadcast from P_i_,_j_,_i among P_0,_j_,_i, P_1,_j_,_i, ..., P_n_−1,_j_,_i. The distribution of B after this one-to-all broadcast along the i axis is shown in Figure 8.4(d). At this point, each vertical column of processes P_i_,_j_,∗ has column B[∗, j]. Now process P_i_,_j_,_k has B[k, j], in addition to A[i, k].

After these communication steps, A[i, k] and B[k, j] are multiplied at P_i_,_j_,_k. Now each element C[i, j] of the product matrix is obtained by an all-to-one reduction along the k axis. During this step, process P_i_,_j_,0 accumulates the results of the multiplication from processes P_i_,_j_,1, ..., P_i_,_j_,_n₋₁. Figure 8.4 shows this step for C[0, 0].

The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A and along the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed within groups of n processes and take time Θ(log n). Thus, the parallel run time for multiplying two n × n matrices using the DNS algorithm on n³ processes is Θ(log n).

DNS Algorithm with Fewer than n³ Processes

The DNS algorithm is not cost-optimal for n³ processes, since its process-time product of Θ(n³ log n) exceeds the Θ(n³) sequential complexity of matrix multiplication. We now present a cost-optimal version of this algorithm that uses fewer than n³ processes. Another variant of the DNS algorithm that uses fewer than n³ processes is described in Problem 8.6.

Assume that the number of processes p is equal to q³ for some q < n. To implement the DNS algorithm, the two matrices are partitioned into blocks of size (n/q) × (n/q). Each matrix can thus be regarded as a q × q two-dimensional square array of blocks. The implementation of this algorithm on q³ processes is very similar to that on n³ processes. The only difference is that now we operate on blocks rather than on individual elements. Since 1 ≤ q ≤ n, the number of processes can vary between 1 and n³.

Performance Analysis. The first one-to-one communication step is performed for both A and B, and takes time t_s + t_w(n/q)² for each matrix. The second step of one-to-all broadcast is also performed for both matrices and takes time t_s log q + t_w(n/q)² log q for each matrix. The final all-to-one reduction is performed only once (for matrix C)and takes time t_s log q + t_w(n/q)² log q. The multiplication of (n/q) × (n/q) submatrices by each process takes time (n/q)³. We can ignore the communication time for the first one-to-one communication step because it is much smaller than the communication time of one-to-all broadcasts and all-to-one reduction. We can also ignore the computation time for addition in the final reduction phase because it is of a smaller order of magnitude than the computation time for multiplying the submatrices. With these assumptions, we get the following approximate expression for the parallel run time of the DNS algorithm:

Since q = p^1/3, we get

Equation 8.16.

The total cost of this parallel algorithm is n³ + t_s p log p + t_wn² p¹/³ log p. The isoefficiency function is Θ(p(log p)³). The algorithm is cost-optimal for n³ = Ω(p(log p)³), or p = O(n³/(log n)³).

Solving a System of Linear Equations

This section discusses the problem of solving a system of linear equations of the form

In matrix notation, this system is written as Ax = b. Here A is a dense n × n matrix of coefficients such that A[i, j] = a_i_,_j, b is an n × 1 vector [b₀, b₁, ..., b_n₋₁]^T, and x is the desired solution vector [x₀, x₁, ..., x_n₋₁]^T. We will make all subsequent references to a_i_,_j by A[i, j] and x_i by x [i].

A system of equations Ax = b is usually solved in two stages. First, through a series of algebraic manipulations, the original system of equations is reduced to an upper-triangular system of the form

We write this as Ux = y, where U is a unit upper-triangular matrix – one in which all subdiagonal entries are zero and all principal diagonal entries are equal to one. Formally, U[i, j] = 0 if i > j, otherwise U[i, j] = u_i_,_j. Furthermore, U[i, i] = 1 for 0 ≤ i < n. In the second stage of solving a system of linear equations, the upper-triangular system is solved for the variables in reverse order from x[n − 1] to x[0] by a procedure known as back-substitution (Section 8.3.3).

We discuss parallel formulations of the classical Gaussian elimination method for upper-triangularization in Sections 8.3.1 and 8.3.2. In Section 8.3.1, we describe a straightforward Gaussian elimination algorithm assuming that the coefficient matrix is nonsingular, and its rows and columns are permuted in a way that the algorithm is numerically stable. Section 8.3.2 discusses the case in which a numerically stable solution of the system of equations requires permuting the columns of the matrix during the execution of the Gaussian elimination algorithm.

Although we discuss Gaussian elimination in the context of upper-triangularization, a similar procedure can be used to factorize matrix A as the product of a lower-triangular matrix L and a unit upper-triangular matrix U so that A = L × U. This factorization is commonly referred to as LU factorization. Performing LU factorization (rather than upper-triangularization) is particularly useful if multiple systems of equations with the same left-hand side Ax need to be solved. Algorithm 3.3 gives a procedure for column-oriented LU factorization.

A Simple Gaussian Elimination Algorithm

The serial Gaussian elimination algorithm has three nested loops. Several variations of the algorithm exist, depending on the order in which the loops are arranged. Algorithm 8.4 shows one variation of Gaussian elimination, which we will adopt for parallel implementation in the remainder of this section. This program converts a system of linear equations Ax = b to a unit upper-triangular system Ux = y. We assume that the matrix U shares storage with A and overwrites the upper-triangular portion of A. The element A[k, j] computed on line 6 of Algorithm 8.4 is actually U [k, j]. Similarly, the element A[k, k] equated to 1 on line 8 is U [k, k]. Algorithm 8.4 assumes that A[k, k] ≠ 0 when it is used as a divisor on lines 6 and 7.

Example 8.4. A serial Gaussian elimination algorithm that converts the system of linear equations Ax = b to a unit upper-triangular system Ux = y. The matrix U occupies the upper-triangular locations of A. This algorithm assumes that A[k, k] ≠ 0 when it is used as a divisor on lines 6 and 7.

1.   procedure GAUSSIAN_ELIMINATION (A, b, y) 
2.   begin 
3.      for k := 0 to n − 1 do           /* Outer loop */ 
4.      begin 
5.         for j := k + 1 to n − 1 do 
6.             A[k, j] := A[k, j]/A[k, k];  /* Division step */ 
7.         y[k] := b[k]/A[k, k]; 
8.         A[k, k] := 1; 
9.         for i := k + 1 to n − 1 do 
10.        begin 
11.           for j := k + 1 to n − 1 do 
12.               A[i, j] := A[i, j] − A[i, k] × A[k, j]; /* Elimination step */ 
13.           b[i] := b[i] − A[i, k] × y[k]; 
14.           A[i, k] := 0; 
15.        endfor;          /* Line 9 */ 
16.     endfor;             /* Line 3 */ 
17.  end GAUSSIAN_ELIMINATION

In this section, we will concentrate only on the operations on matrix A in Algorithm 8.4. The operations on vector b on lines 7 and 13 of the program are straightforward to implement. Hence, in the rest of the section, we will ignore these steps. If the steps on lines 7, 8, 13, and 14 are not performed, then Algorithm 8.4 leads to the LU factorization of A as a product L × U. After the termination of the procedure, L is stored in the lower-triangular part of A, and U occupies the locations above the principal diagonal.

For k varying from 0 to n − 1, the Gaussian elimination procedure systematically eliminates variable x[k] from equations k + 1 to n − 1 so that the matrix of coefficients becomes upper-triangular. As shown in Algorithm 8.4, in the kth iteration of the outer loop (starting on line 3), an appropriate multiple of the kth equation is subtracted from each of the equations k + 1 to n − 1 (loop starting on line 9). The multiples of the kth equation (or the kth row of matrix A) are chosen such that the kth coefficient becomes zero in equations k + 1 to n − 1 eliminating x [k] from these equations. A typical computation of the Gaussian elimination procedure in the kth iteration of the outer loop is shown in Figure 8.5. The kth iteration of the outer loop does not involve any computation on rows 1 to k − 1 or columns 1 to k − 1. Thus, at this stage, only the lower-right (n − k) × (n − k) submatrix of A (the shaded portion in Figure 8.5) is computationally active.

Figure 8.5. A typical computation in Gaussian elimination.

Gaussian elimination involves approximately n²/2 divisions (line 6) and approximately (n³/3) − (n²/2) subtractions and multiplications (line 12). In this section, we assume that each scalar arithmetic operation takes unit time. With this assumption, the sequential run time of the procedure is approximately 2n³/3 (for large n); that is,

Equation 8.17.

Parallel Implementation with 1-D Partitioning

We now consider a parallel implementation of Algorithm 8.4, in which the coefficient matrix is rowwise 1-D partitioned among the processes. A parallel implementation of this algorithm with columnwise 1-D partitioning is very similar, and its details can be worked out based on the implementation using rowwise 1-D partitioning (Problems 8.8 and 8.9).

We first consider the case in which one row is assigned to each process, and the n × n coefficient matrix A is partitioned along the rows among n processes labeled from P₀ to P_n₋₁. In this mapping, process P_i initially stores elements A[i, j] for 0 ≤ j < n. Figure 8.6 illustrates this mapping of the matrix onto the processes for n = 8. The figure also illustrates the computation and communication that take place in the iteration of the outer loop when k = 3.

Figure 8.6. Gaussian elimination steps during the iteration corresponding to k = 3 for an 8 × 8 matrix partitioned rowwise among eight processes.

Algorithm 8.4 and Figure 8.5 show that A[k, k + 1], A[k, k + 2], ..., A[k, n − 1] are divided by A[k, k] (line 6) at the beginning of the kth iteration. All matrix elements participating in this operation (shown by the shaded portion of the matrix in Figure 8.6(a)) belong to the same process. So this step does not require any communication. In the second computation step of the algorithm (the elimination step of line 12), the modified (after division) elements of the kth row are used by all other rows of the active part of the matrix. As Figure 8.6(b) shows, this requires a one-to-all broadcast of the active part of the kth row to the processes storing rows k + 1 to n − 1. Finally, the computation A[i, j] := A[i, j] − A[i, k] × A[k, j] takes place in the remaining active portion of the matrix, which is shown shaded in Figure 8.6(c).

The computation step corresponding to Figure 8.6(a) in the kth iteration requires n − k − 1 divisions at process P_k. Similarly, the computation step of Figure 8.6(c) involves n − k − 1 multiplications and subtractions in the kth iteration at all processes P_i , such that k < i < n. Assuming a single arithmetic operation takes unit time, the total time spent in computation in the kth iteration is 3(n − k − 1). Note that when P_k is performing the divisions, the remaining p − 1 processes are idle, and while processes P_k + 1, ..., P_n₋₁ are performing the elimination step, processes P₀, ..., P_k are idle. Thus, the total time spent during the computation steps shown in Figures 8.6(a) and (c) in this parallel implementation of Gaussian elimination is , which is equal to 3n(n − 1)/2.

The communication step of Figure 8.6(b) takes time (t_s +t_w(n −k −1)) log n (Table 4.1). Hence, the total communication time over all iterations is log n, which is equal to t_s n log n + t_w(n(n − 1)/2) log n. The overall parallel run time of this algorithm is

Equation 8.18.

Since the number of processes is n, the cost, or the process-time product, is Θ(n3 log n) due to the term associated with t_w in Equation 8.18. This cost is asymptotically higher than the sequential run time of this algorithm (Equation 8.17). Hence, this parallel implementation is not cost-optimal.

Pipelined Communication and Computation. We now present a parallel implementation of Gaussian elimination that is cost-optimal on n processes.

In the parallel Gaussian elimination algorithm just presented, the n iterations of the outer loop of Algorithm 8.4 execute sequentially. At any given time, all processes work on the same iteration. The (k + 1)th iteration starts only after all the computation and communication for the kth iteration is complete. The performance of the algorithm can be improved substantially if the processes work asynchronously; that is, no process waits for the others to finish an iteration before starting the next one. We call this the asynchronous or pipelined version of Gaussian elimination. Figure 8.7 illustrates the pipelined Algorithm 8.4 for a 5 × 5 matrix partitioned along the rows onto a logical linear array of five processes.

Figure 8.7. Pipelined Gaussian elimination on a 5 × 5 matrix partitioned with one row per process.

During the kth iteration of Algorithm 8.4, process P_k broadcasts part of the kth row of the matrix to processes P_k₊₁, ..., P_n₋₁ (Figure 8.6(b)). Assuming that the processes form a logical linear array, and P_k₊₁ is the first process to receive the kth row from process P_k. Then process P_k₊₁ must forward this data to P_k₊₂. However, after forwarding the kth row to P_k₊₂, process P_k₊₁ need not wait to perform the elimination step (line 12) until all the processes up to P_n₋₁ have received the kth row. Similarly, P_k₊₂ can start its computation as soon as it has forwarded the kth row to P_k₊₃, and so on. Meanwhile, after completing the computation for the kth iteration, P_k₊₁ can perform the division step (line 6), and start the broadcast of the (k + 1)th row by sending it to P_k₊₂.

In pipelined Gaussian elimination, each process independently performs the following sequence of actions repeatedly until all n iterations are complete. For the sake of simplicity, we assume that steps (1) and (2) take the same amount of time (this assumption does not affect the analysis):

If a process has any data destined for other processes, it sends those data to the appropriate process.
If the process can perform some computation using the data it has, it does so.
Otherwise, the process waits to receive data to be used for one of the above actions.

Figure 8.7 shows the 16 steps in the pipelined parallel execution of Gaussian elimination for a 5 × 5 matrix partitioned along the rows among five processes. As Figure 8.7(a) shows, the first step is to perform the division on row 0 at process P₀. The modified row 0 is then sent to P₁ (Figure 8.7(b)), which forwards it to P₂ (Figure 8.7(c)). Now P₁ is free to perform the elimination step using row 0 (Figure 8.7(d)). In the next step (Figure 8.7(e)), P₂ performs the elimination step using row 0. In the same step, P₁, having finished its computation for iteration 0, starts the division step of iteration 1. At any given time, different stages of the same iteration can be active on different processes. For instance, in Figure 8.7(h), process P₂ performs the elimination step of iteration 1 while processes P₃ and P₄ are engaged in communication for the same iteration. Furthermore, more than one iteration may be active simultaneously on different processes. For instance, in Figure 8.7(i), process P₂ is performing the division step of iteration 2 while process P₃ is performing the elimination step of iteration 1.

We now show that, unlike the synchronous algorithm in which all processes work on the same iteration at a time, the pipelined or the asynchronous version of Gaussian elimination is cost-optimal. As Figure 8.7 shows, the initiation of consecutive iterations of the outer loop of Algorithm 8.4 is separated by a constant number of steps. A total of n such iterations are initiated. The last iteration modifies only the bottom-right corner element of the coefficient matrix; hence, it completes in a constant time after its initiation. Thus, the total number of steps in the entire pipelined procedure is Θ(n) (Problem 8.7). In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. Each of these operations take O(n) time. Hence, the entire procedure consists of Θ(n) steps of O(n) complexity each, and its parallel run time is O(n²). Since n processes are used, the cost is O(n³), which is of the same order as the sequential complexity of Gaussian elimination. Hence, the pipelined version of parallel Gaussian elimination with 1-D partitioning of the coefficient matrix is cost-optimal.

Block 1-D Partitioning with Fewer than n Processes. The preceding pipelined implementation of parallel Gaussian elimination can be easily adapted for the case in which n > p. Consider an n × n matrix partitIoned among p processes (p < n) such that each process is assigned n/p contiguous rows of the matrix. Figure 8.8 illustrates the communication steps in a typical iteration of Gaussian elimination with such a mapping. As the figure shows, the kth iteration of the algorithm requires that the active part of the kth row be sent to the processes storing rows k + 1, k + 2, ..., n − 1.

The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix distributed among four processes using block 1-D partitioning.

Figure 8.8. The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix distributed among four processes using block 1-D partitioning.

Figure 8.9(a) shows that, with block 1-D partitioning, a process with all rows belonging to the active part of the matrix performs (n − k − 1)n/p multiplications and subtractions during the elimination step of the kth iteration. Note that in the last (n/p) − 1 iterations, no process has all active rows, but we ignore this anomaly. If the pipelined version of the algorithm is used, then the number of arithmetic operations on a maximally-loaded process in the kth iteration (2(n − k − 1)n/p) is much higher than the number of words communicated (n − k − 1) by a process in the same iteration. Thus, for sufficiently large values of n with respect to p, computation dominates communication in each iteration. Assuming that each scalar multiplication and subtraction pair takes unit time, the total parallel run time of this algorithm (ignoring communication overhead) is , which is approximately equal to n³/p.

Computation load on different processes in block and cyclic 1-D partitioning of an 8 × 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.

Figure 8.9. Computation load on different processes in block and cyclic 1-D partitioning of an 8 × 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.

The process-time product of this algorithm is n³, even if the communication costs are ignored. Thus, the cost of the parallel algorithm is higher than the sequential run time (Equation 8.17) by a factor of 3/2. This inefficiency of Gaussian elimination with block 1-D partitioning is due to process idling resulting from an uneven load distribution. As Figure 8.9(a) shows for an 8 × 8 matrix and four processes, during the iteration corresponding to k = 3 (in the outer loop of Algorithm 8.4), one process is completely idle, one is partially loaded, and only two processes are fully active. By the time half of the iterations of the outer loop are over, only half the processes are active. The remaining idle processes make the parallel algorithm costlier than the sequential algorithm.

This problem can be alleviated if the matrix is partitioned among the processes using cyclic 1-D mapping as shown in Figure 8.9(b). With the cyclic 1-D partitioning, the difference between the computational loads of a maximally loaded process and the least loaded process in any iteration is of at most one row (that is, O(n) arithmetic operations). Since there are n iterations, the cumulative overhead due to process idling is only O(n² p) with a cyclic mapping, compared to Θ(n³) with a block mapping (Problem 8.12).

Parallel Implementation with 2-D Partitioning

We now describe a parallel implementation of Algorithm 8.4 in which the n × n matrix A is mapped onto an n × n mesh of processes such that process P_i_,_j initially stores A[i_, j]. The communication and computation steps in the iteration of the outer loop corresponding to k = 3 are illustrated in Figure 8.10 for n = 8. Algorithm 8.4 and Figures 8.5 and 8.10 show that in the kth iteration of the outer loop, A[k, k] is required by processes P_k_,_k₊₁, P_k_,_k₊₂, ..., P_k_,_n₋₁ to divide A[k, k + 1], A[k, k + 2], ..., A[k, n − 1], respectively. After the division on line 6, the modified elements of the kth row are used to perform the elimination step by all the other rows in the active part of the matrix. The modified (after the division on line 6) elements of the kth row are used by all other rows of the active part of the matrix. Similarly, the elements of the kth column are used by all other columns of the active part of the matrix for the elimination step. As Figure 8.10 shows, the communication in the kth iteration requires a one-to-all broadcast of A[i, k] along the i th row (Figure 8.10(a)) for k ≤ i < n, and a one-to-all broadcast of A[k, j] along the j th column (Figure 8.10(c)) for k < j < n. Just like the 1-D partitioning case, a non-cost-optimal parallel formulation results if these broadcasts are performed synchronously on all processes (Problem 8.11).

Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix on 64 processes arranged in a logical two-dimensional mesh.

Figure 8.10. Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix on 64 processes arranged in a logical two-dimensional mesh.

Pipelined Communication and Computation. Based on our experience with Gaussian elimination using 1-D partitioning of the coefficient matrix, we develop a pipelined version of the algorithm using 2-D partitioning.

As Figure 8.10 shows, in the kth iteration of the outer loop (lines 3–16 of Algorithm 8.4), A[k, k] is sent to the right from P_k_,_k to P_k_,_k₊₁ to P_k_,_k₊₂, and so on, until it reaches P_k_,_n₋₁. Process P_k_,_k₊₁ performs the division A[k, k + 1]/A[k, k] as soon as it receives A[k, k] from P_k_,_k. It does not have to wait for A[k, k] to reach all the way up to P_k_,_n₋₁ before performing its local computation. Similarly, any subsequent process P_k_,_j of the kth row can perform its division as soon as it receives A[k, k]. After performing the division, A[k, j] is ready to be communicated downward in the j th column. As A[k, j] moves down, each process it passes is free to use it for computation. Processes in the j th column need not wait until A[k, j] reaches the last process of the column. Thus, P_i_,_j performs the elimination step A[i, j] := A[i, j] − A[i, k] × A[k, j] as soon as A[i, k] and A[k, j] are available. Since some processes perform the computation for a given iteration earlier than other processes, they start working on subsequent iterations sooner.

The communication and computation can be pipelined in several ways. We present one such scheme in Figure 8.11. In Figure 8.11(a), the iteration of the outer loop for k = 0 starts at process P_0,0, when P_0,0 sends A[0, 0] to P_0,1. Upon receiving A[0, 0], P_0,1 computes A[0, 1] := A[0, 1]/A[0, 0] (Figure 8.11(b)). Now P_0,1 forwards A[0, 0] to P_0,2 and also sends the updated A[0, 1] down to P_1,1 (Figure 8.11(c)). At the same time, P_1,0 sends A[1, 0] to P_1,1. Having received A[0, 1] and A[1, 0], P_1,1 performs the elimination step A[1, 1] := A[1, 1] − A[1, 0] × A[0, 1], and having received A[0, 0], P_0,2 performs the division step A[0, 2] := A[0, 2]/A[0, 0] (Figure 8.11(d)). After this computation step, another set of processes (that is, processes P_0,2, P_1,1, and P_2,0) is ready to initiate communication (Figure 8.11(e)).

Figure 8.11. Pipelined Gaussian elimination for a 5 × 5 matrix with 25 processes.

All processes performing communication or computation during a particular iteration lie along a diagonal in the bottom-left to top-right direction (for example, P_0,2, P_1,1, and P_2,0 performing communication in Figure 8.11(e) and P_0,3, P_1,2, and P_2,1 performing computation in Figure 8.11(f)). As the parallel algorithm progresses, this diagonal moves toward the bottom-right corner of the logical 2-D mesh. Thus, the computation and communication for each iteration moves through the mesh from top-left to bottom-right as a “front.” After the front corresponding to a certain iteration passes through a process, the process is free to perform subsequent iterations. For instance, in Figure 8.11(g), after the front for k = 0 has passed P_1,1, it initiates the iteration for k = 1 by sending A[1, 1] to P_1,2. This initiates a front for k = 1, which closely follows the front for k = 0. Similarly, a third front for k = 2 starts at P_2,2 (Figure 8.11(m)). Thus, multiple fronts that correspond to different iterations are active simultaneously.

Every step of an iteration, such as division, elimination, or transmitting a value to a neighboring process, is a constant-time operation. Therefore, a front moves a single step closer to the bottom-right corner of the matrix in constant time (equivalent to two steps of Figure 8.11). The front for k = 0 takes time Θ(n) to reach P_n₋_1,n₋₁ after its initiation at P_0,0. The algorithm initiates n fronts for the n iterations of the outer loop. Each front lags behind the previous one by a single step. Thus, the last front passes the bottom-right corner of the matrix Θ(n) steps after the first one. The total time elapsed between the first front starting at P_0,0 and the last one finishing is Θ(n). The procedure is complete after the last front passes the bottom-right corner of the matrix; hence, the total parallel run time is Θ(n). Since n² process are used, the cost of the pipelined version of Gaussian elimination is Θ(n³), which is the same as the sequential run time of the algorithm. Hence, the pipelined version of Gaussian elimination with 2-D partitioning is cost-optimal.

2-D Partitioning with Fewer than n² Processes. Consider the case in which p processes are used so that p < n² and the matrix is mapped onto a mesh by using block 2-D partitioning. Figure 8.12 illustrates that a typical parallel Gaussian iteration involves a rowwise and a columnwise communication of values. Figure 8.13(a) illustrates the load distribution in block 2-D mapping for n = 8 and p = 16.

The communication steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix on 16 processes of a two-dimensional mesh.

Figure 8.12. The communication steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 × 8 matrix on 16 processes of a two-dimensional mesh.

Computational load on different processes in block and cyclic 2-D mappings of an 8 × 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3.

Figure 8.13. Computational load on different processes in block and cyclic 2-D mappings of an 8 × 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3.

Figures 8.12 and 8.13(a) show that a process containing a completely active part of the matrix performs n²/p multiplications and subtractions, and communicates words along its row and its column (ignoring the fact that in the last iterations, the active part of the matrix becomes smaller than the size of a block, and no process contains a completely active part of the matrix). If the pipelined version of the algorithm is used, the number of arithmetic operations per process (2n²/p) is an order of magnitude higher than the number of words communicated per process () in each iteration. Thus, for sufficiently large values of n² with respect to p, the communication in each iteration is dominated by computation. Ignoring the communication cost and assuming that each scalar arithmetic operation takes unit time, the total parallel run time of this algorithm is (2n²/p) × n, which is equal to 2n³/p. The process-time product is 2n³, which is three times the cost of the serial algorithm (Equation 8.17). As a result, there is an upper bound of 1/3 on the efficiency of the parallel algorithm.

As in the case of a block 1-D mapping, the inefficiency of Gaussian elimination with a block 2-D partitioning of the matrix is due to process idling resulting from an uneven load distribution. Figure 8.13(a) shows the active part of an 8 × 8 matrix of coefficients in the iteration of the outer loop for k = 3 when the matrix is block 2-D partitioned among 16 processes. As shown in the figure, seven out of 16 processes are fully idle, five are partially loaded, and only four are fully active. By the time half of the iterations of the outer loop have been completed, only one-fourth of the processes are active. The remaining idle processes make the parallel algorithm much costlier than the sequential algorithm.

This problem can be alleviated if the matrix is partitioned in a 2-D cyclic fashion as shown in Figure 8.13(b). With the cyclic 2-D partitioning, the maximum difference in computational load between any two processes in any iteration is that of one row and one column update. For example, in Figure 8.13(b), n²/p matrix elements are active in the bottom-right process, and (n − 1)²/p elements are active in the top-left process. The difference in workload between any two processes is at most Θ() in any iteration, which contributes Θ() to the overhead function. Since there are n iterations, the cumulative overhead due to process idling is only Θ with cyclic mapping in contrast to Θ(n³) with block mapping (Problem 8.12). In practical parallel implementations of Gaussian elimination and LU factorization, a block-cyclic mapping is used to reduce the overhead due to message startup time associated with a pure cyclic mapping and to obtain better serial CPU utilization by performing block-matrix operations (Problem 8.15).

From the discussion in this section, we conclude that pipelined parallel Gaussian elimination for an n × n matrix takes time Θ(n³/p) on p processes with both 1-D and 2-D partitioning schemes. 2-D partitioning can use more processes (O(n²)) than 1-D partitioning (O(n)) for an n × n coefficient matrix. Hence, an implementation with 2-D partitioning is more scalable.

Gaussian Elimination with Partial Pivoting

The Gaussian elimination algorithm in Algorithm 8.4 fails if any diagonal entry A[k, k] of the matrix of coefficients is close or equal to zero. To avoid this problem and to ensure the numerical stability of the algorithm, a technique called partial pivoting is used. At the beginning of the outer loop in the kth iteration, this method selects a column i (called the pivot column) such that A[k, i] is the largest in magnitude among all A[k, j] such that k ≤ j < n. It then exchanges the kth and the i th columns before starting the iteration. These columns can either be exchanged explicitly by physically moving them into each other’s locations, or they can be exchanged implicitly by simply maintaining an n × 1 permutation vector to keep track of the new indices of the columns of A. If partial pivoting is performed with an implicit exchange of column indices, then the factors L and U are not exactly triangular matrices, but columnwise permutations of triangular matrices.

Assuming that columns are exchanged explicitly, the value of A[k, k] used as the divisor on line 6 of Algorithm 8.4 (after exchanging columns k and i) is greater than or equal to any A[k, j] that it divides in the kth iteration. Partial pivoting in Algorithm 8.4 results in a unit upper-triangular matrix in which all elements above the principal diagonal have an absolute value of less than one.

1-D Partitioning

Performing partial pivoting is straightforward with rowwise partitioning as discussed in Section 8.3.1. Before performing the divide operation in the kth iteration, the process storing the kth row makes a comparison pass over the active portion of this row, and selects the element with the largest absolute value as the divisor. This element determines the pivot column, and all processes must know the index of this column. This information can be passed on to the rest of the processes along with the modified (after the division) elements of the kth row. The combined pivot-search and division step takes time Θ(n − k − 1) in the kth iteration, as in case of Gaussian elimination without pivoting. Thus, partial pivoting has no significant effect on the performance of Algorithm 8.4 if the coefficient matrix is partitioned along the rows.

Now consider a columnwise 1-D partitioning of the coefficient matrix. In the absence of pivoting, parallel implementations of Gaussian elimination with rowwise and columnwise 1-D partitioning are almost identical (Problem 8.9). However, the two are significantly different if partial pivoting is performed.

The first difference is that, unlike rowwise partitioning, the pivot search is distributed in columnwise partitioning. If the matrix size is n × n and the number of processes is p, then the pivot search in columnwise partitioning involves two steps. During pivot search for the kth iteration, first each process determines the maximum of the n/p (or fewer) elements of the kth row that it stores. The next step is to find the maximum of the resulting p (or fewer) values, and to distribute the maximum among all processes. Each pivot search takes time Θ(n/p) + Θ(log p). For sufficiently large values of n with respect to p, this is less than the time Θ(n) it takes to perform a pivot search with rowwise partitioning. This seems to suggest that a columnwise partitioning is better for partial pivoting that a rowwise partitioning. However, the following factors favor rowwise partitioning.

Figure 8.7 shows how communication and computation “fronts” move from top to bottom in the pipelined version of Gaussian elimination with rowwise 1-D partitioning. Similarly, the communication and computation fronts move from left to right in case of columnwise 1-D partitioning. This means that the (k + 1)th row is not ready for pivot search for the (k + 1)th iteration (that is, it is not fully updated) until the front corresponding to the kth iteration reaches the rightmost process. As a result, the (k + 1)th iteration cannot start until the entire kth iteration is complete. This effectively eliminates pipelining, and we are therefore forced to use the synchronous version with poor efficiency.

While performing partial pivoting, columns of the coefficient matrix may or may not be explicitly exchanged. In either case, the performance of Algorithm 8.4 is adversely affected with columnwise 1-D partitioning. Recall that cyclic or block-cyclic mappings result in a better load balance in Gaussian elimination than a block mapping. A cyclic mapping ensures that the active portion of the matrix is almost uniformly distributed among the processes at every stage of Gaussian elimination. If pivot columns are not exchanged explicitly, then this condition may cease to hold. After a pivot column is used, it no longer stays in the active portion of the matrix. As a result of pivoting without explicit exchange, columns are arbitrarily removed from the different processes’ active portions of the matrix. This randomness may disturb the uniform distribution of the active portion. On the other hand, if columns belonging to different processes are exchanged explicitly, then this exchange requires communication between the processes. A rowwise 1-D partitioning neither requires communication for exchanging columns, nor does it lose the load-balance if columns are not exchanged explicitly.

2-D Partitioning

In the case of 2-D partitioning of the coefficient matrix, partial pivoting seriously restricts pipelining, although it does not completely eliminate it. Recall that in the pipelined version of Gaussian elimination with 2-D partitioning, fronts corresponding to various iterations move from top-left to bottom-right. The pivot search for the (k + 1)th iteration can commence as soon as the front corresponding to the kth iteration has moved past the diagonal of the active matrix joining its top-right and bottom-left corners.

Thus, partial pivoting may lead to considerable performance degradation in parallel Gaussian elimination with 2-D partitioning. If numerical considerations allow, it may be possible to reduce the performance loss due to partial pivoting. We can restrict the search for the pivot in the kth iteration to a band of q columns (instead of all n − k columns). In this case, the i th column is selected as the pivot in the kth iteration if A[k, i] is the largest element in a band of q elements of the active part of the i th row. This restricted partial pivoting not only reduces the communication cost, but also permits limited pipelining. By restricting the number of columns for pivot search to q, an iteration can start as soon as the previous iteration has updated the first q + 1 columns.

Another way to get around the loss of pipelining due to partial pivoting in Gaussian elimination with 2-D partitioning is to use fast algorithms for one-to-all broadcast, such as those described in Section 4.7.1. With 2-D partitioning of the n × n coefficient matrix on p processes, a process spends time Θ() in communication in each iteration of the pipelined version of Gaussian elimination. Disregarding the message startup time t_s, a non-pipelined version that performs explicit one-to-all broadcasts using the algorithm of Section 4.1 spends time Θ(() log p) communicating in each iteration. This communication time is higher than that of the pipelined version. The one-to-all broadcast algorithms described in Section 4.7.1 take time Θ() in each iteration (disregarding the startup time). This time is asymptotically equal to the per-iteration communication time of the pipelined algorithm. Hence, using a smart algorithm to perform one-to-all broadcast, even non-pipelined parallel Gaussian elimination can attain performance comparable to that of the pipelined algorithm. However, the one-to-all broadcast algorithms described in Section 4.7.1 split a message into smaller parts and route them separately. For these algorithms to be effective, the sizes of the messages should be large enough; that is, n should be large compared to p.

Although pipelining and pivoting do not go together in Gaussian elimination with 2-D partitioning, the discussion of 2-D partitioning in this section is still useful. With some modification, it applies to the Cholesky factorization algorithm (Algorithm 8.6 in Problem 8.16), which does not require pivoting. Cholesky factorization applies only to symmetric, positive definite matrices. A real n × n matrix A is positive definite if x^T Ax > 0 for any n × 1 nonzero, real vector x. The communication pattern in Cholesky factorization is quite similar to that of Gaussian elimination (Problem 8.16), except that, due to symmetric lower and upper-triangular halves in the matrix, Cholesky factorization uses only one triangular half of the matrix.

Solving a Triangular System: Back-Substitution

We now briefly discuss the second stage of solving a system of linear equations. After the full matrix A has been reduced to an upper-triangular matrix U with ones along the principal diagonal, we perform back-substitution to determine the vector x. A sequential back-substitution algorithm for solving an upper-triangular system of equations Ux = y is shown in Algorithm 8.5.

Starting with the last equation, each iteration of the main loop (lines 3–8) of Algorithm 8.5 computes the values of a variable and substitutes the variable’s value back into the remaining equations. The program performs approximately n²/2 multiplications and subtractions. Note that the number of arithmetic operations in back-substitution is less than that in Gaussian elimination by a factor of Θ(n). Hence, if back-substitution is used in conjunction with Gaussian elimination, it is best to use the matrix partitioning scheme that is the most efficient for parallel Gaussian elimination.

Example 8.5. A serial algorithm for back-substitution. U is an upper-triangular matrix with all entries of the principal diagonal equal to one, and all subdiagonal entries equal to zero.

1.   procedure BACK_SUBSTITUTION (U, x, y) 
2.   begin 
3.      for k := n − 1 downto 0 do /* Main loop */ 
4.          begin 
5.             x[k] := y[k]; 
6.             for i := k − 1 downto 0 do 
7.                 y[i] := y[i] − x[k] × U[i, k]; 
8.          endfor; 
9.   end BACK_SUBSTITUTION

Consider a rowwise block 1-D mapping of the n × n matrix U onto p processes. Let the vector y be distributed uniformly among all the processes. The value of the variable solved in a typical iteration of the main loop (line 3) must be sent to all the processes with equations involving that variable. This communication can be pipelined (Problem 8.22). If so, the time to perform the computations of an iteration dominates the time that a process spends in communication in an iteration. In every iteration of a pipelined implementation, a process receives (or generates) the value of a variable and sends that value to another process. Using the value of the variable solved in the current iteration, a process also performs up to n/p multiplications and subtractions (lines 6 and 7). Hence, each step of a pipelined implementation requires a constant amount of time for communication and time Θ(n/p) for computation. The algorithm terminates in Θ(n) steps (Problem 8.22), and the parallel run time of the entire algorithm is Θ(n²/p).

If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation. Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time Θ (Problem 8.22). Thus, the cost of parallel back-substitution with 2-D mapping is Θ. The algorithm is not cost-optimal because its sequential cost is only Θ(n²). However, the entire process of solving the linear system, including upper-triangularization using Gaussian elimination, is still cost-optimal for because the sequential complexity of the entire process is Θ(n³).

Numerical Considerations in Solving Systems of Linear Equations

A system of linear equations of the form Ax = b can be solved by using a factorization algorithm to express A as the product of a lower-triangular matrix L, and a unit upper-triangular matrix U. The system of equations is then rewritten as LU x = b, and is solved in two steps. First, the lower-triangular system Ly = b is solved for y. Second, the upper-triangular system Ux = y is solved for x.

The Gaussian elimination algorithm given in Algorithm 8.4 effectively factorizes A into L and U. However, it also solves the lower-triangular system Ly = b on the fly by means of steps on lines 7 and 13. Algorithm 8.4 gives what is called a row-oriented Gaussian elimination algorithm. In this algorithm, multiples of rows are subtracted from other rows. If partial pivoting, as described in Section 8.3.2, is incorporated into this algorithm, then the resulting upper-triangular matrix U has all its elements less than or equal to one in magnitude. The lower-triangular matrix L, whether implicit or explicit, may have elements with larger numerical values. While solving the system Ax = b, the triangular system Ly = b is solved first. If L contains large elements, then rounding errors can occur while solving for y due to the finite precision of floating-point numbers in the computer. These errors in y are propagated through the solution of Ux = y.

An alternate form of Gaussian elimination is the column-oriented form that can be obtained from Algorithm 8.4 by reversing the roles of rows and columns. In the column-oriented algorithm, multiples of columns are subtracted from other columns, pivot search is also performed along the columns, and numerical stability is guaranteed by row interchanges, if needed. All elements of the lower-triangular matrix L generated by the column-oriented algorithm have a magnitude less than or equal to one. This minimizes numerical error while solving Ly = b, and results in a significantly smaller error in the overall solution than the row-oriented algorithm. Algorithm 3.3 gives a procedure for column-oriented LU factorization.

From a practical point of view, the column-oriented Gaussian elimination algorithm is more useful than the row-oriented algorithm. We have chosen to present the row-oriented algorithm in detail in this chapter because it is more intuitive. It is easy to see that the system of linear equations resulting from the subtraction of a multiple of an equation from other equations is equivalent to the original system. The entire discussion on the row-oriented algorithm of Algorithm 8.4 presented in this section applies to the column-oriented algorithm with the roles of rows and columns reversed. For example, columnwise 1-D partitioning is more suitable than rowwise 1-D partitioning for the column-oriented algorithm with partial pivoting.

Bibliographic Remarks

Matrix transposition with 1-D partitioning is essentially an all-to-all personalized communication problem [Ede89]. Hence, all the references in Chapter 4 for all-to-all personalized communication apply directly to matrix transposition. The recursive transposition algorithm, popularly known as RTA, was first reported by Eklundh [Ekl72]. Its adaptations for hypercubes have been described by Bertsekas and Tsitsiklis [BT97], Fox and Furmanski [FF86], Johnsson [Joh87], and McBryan and Van de Velde [MdV87] for one-port communication on each process. Johnsson [Joh87] also discusses parallel RTA for hypercubes that permit simultaneous communication on all channels. Further improvements on the hypercube RTA have been suggested by Ho and Raghunath [HR91], Johnsson and Ho [JH88], Johnsson [Joh90], and Stout and Wagar [SW87].

A number of sources of parallel dense linear algebra algorithms, including those for matrix-vector multiplication and matrix multiplication, are available [CAHH91, GPS90, GL96a, Joh87, Mod88, OS85]. Since dense matrix multiplication is highly computationally intensive, there has been a great deal of interest in developing parallel formulations of this algorithm and in testing its performance on various parallel architectures [Akl89, Ber89, CAHH91, Can69, Cha79, CS88, DNS81, dV89, FJL⁺88, FOH87, GK91, GL96a, Hip89, HJE91, Joh87, PV80, Tic88]. Some of the early parallel formulations of matrix multiplication were developed by Cannon [Can69], Dekel, Nassimi, and Sahni [DNS81], and Fox et al. [FOH87]. Variants and improvements of these algorithms have been presented by Berntsen [Ber89], and by Ho, Johnsson, and Edelman [HJE91]. In particular, Berntsen [Ber89] presents an algorithm that has strictly smaller communication overhead than Cannon’s algorithm, but has a smaller degree of concurrency. Ho, Johnsson, and Edelman [HJE91] present another variant of Cannon’s algorithm for a hypercube that permits communication on all channels simultaneously. This algorithm, while reducing communication, also reduces the degree of concurrency. Gupta and Kumar [GK91] present a detailed scalability analysis of several matrix multiplication algorithms. They present an analysis to determine the best algorithm to multiply two n × n matrices on a p-process hypercube for different ranges of n, p and the hardware-related constants. They also show that the improvements suggested by Berntsen and Ho et al. do not improve the overall scalability of matrix multiplication on a hypercube.

Parallel algorithms for LU factorization and solving dense systems of linear equations have been discussed by several researchers [Ber84, BT97, CG87, Cha87, Dav86, DHvdV93, FJL⁺88, Gei85, GH86, GPS90, GR88, Joh87, LD90, Lei92, Mod88, Mol86 OR88, Ort88, OS86, PR85, Rob90, Saa86, Vav89]. Geist and Heath [GH85, GH86], and Heath [Hea85] specifically concentrate on parallel dense Cholesky factorization. Parallel algorithms for solving triangular systems have also been studied in detail [EHHR88, HR88, LC88, LC89, RO88, Rom87]. Demmel, Heath, and van der Vorst [DHvdV93] present a comprehensive survey of parallel matrix computations considering numerical implications in detail.

A portable software implementation of all matrix and vector operations discussed in this chapter, and many more, is available as PBLAS (parallel basic linear algebra subroutines) [C⁺95]. The ScaLAPACK library [B⁺97] uses PBLAS to implement a variety of linear algebra routines of practical importance, including procedures for various methods of matrix factorizations and solving systems of linear equations.

Problems

8.1 Consider the two algorithms for all-to-all personalized communication in Section 4.5.3. Which method would you use on a 64-node parallel computer with Θ(p) bisection width for transposing a 1024 × 1024 matrix with the 1-D partitioning if t_s = 100µs and t_w = 1µs? Why?
8.2 Describe a parallel formulation of matrix-vector multiplication in which the matrix is 1-D block-partitioned along the columns and the vector is equally partitioned among all the processes. Show that the parallel run time is the same as in case of rowwise 1-D block partitioning.
Hint: The basic communication operation used in the case of columnwise 1-D partitioning is all-to-all reduction, as opposed to all-to-all broadcast in the case of rowwise 1-D partitioning. Problem 4.8 describes all-to-all reduction.
8.3 Section 8.1.2 describes and analyzes matrix-vector multiplication with 2-D partitioning. If , then suggest ways of improving the parallel run time to . Is the improved method more scalable than the one used in Section 8.1.2?
8.4 The overhead function for multiplying an n × n 2-D partitioned matrix with an n × 1 vector using p processes is (Equation 8.8). Substituting this expression in Equation 5.14 yields a quadratic equation in n. Using this equation, determine the precise isoefficiency function for the parallel algorithm and compare it with Equations 8.9 and 8.10. Does this comparison alter the conclusion that the term associated with t_w is responsible for the overall isoefficiency function of this parallel algorithm?
8.5 Strassen’s method [AHU74, CLR90] for matrix multiplication is an algorithm based on the divide-and-conquer technique. The sequential complexity of multiplying two n × n matrices using Strassen’s algorithm is Θ(n^2.81). Consider the simple matrix multiplication algorithm (Section 8.2.1) for multiplying two n × n matrices using p processes. Assume that the submatrices are multiplied using Strassen’s algorithm at each process. Derive an expression for the parallel run time of this algorithm. Is the parallel algorithm cost-optimal?
8.6 (DNS algorithm with fewer than n³ processes [DNS81]) Section 8.2.3 describes a parallel formulation of the DNS algorithm that uses fewer than n³ processes. Another variation of this algorithm works with p = n²q processes, where 1 ≤ q ≤ n. Here the process arrangement is regarded as a q × q × q logical three-dimensional array of “superprocesses,” in which each superprocess is an (n/q) × (n/q) mesh of processes. This variant can be viewed as identical to the block variant described in Section 8.2.3, except that the role of each process is now assumed by an (n/q) × (n/q) logical mesh of processes. This means that each block multiplication of (n/q) × (n/q) submatrices is performed in parallel by (n/q)² processes rather than by a single process. Any of the algorithms described in Sections 8.2.1 or 8.2.2 can be used to perform this multiplication.
Derive an expression for the parallel run time for this variant of the DNS algorithm in terms of n, p, t_s, and t_w. Compare the expression with Equation 8.16. Discuss the relative merits and drawbacks of the two variations of the DNS algorithm for fewer than n³ processes.
8.7 Figure 8.7 shows that the pipelined version of Gaussian elimination requires 16 steps for a 5 × 5 matrix partitioned rowwise on five processes. Show that, in general, the algorithm illustrated in this figure completes in 4(n − 1) steps for an n × n matrix partitioned rowwise with one row assigned to each process.
8.8 Describe in detail a parallel implementation of the Gaussian elimination algorithm of Algorithm 8.4 without pivoting if the n × n coefficient matrix is partitioned columnwise among p processes. Consider both pipelined and non-pipelined implementations. Also consider the cases p = n and p < n.
Hint: The parallel implementation of Gaussian elimination described in Section 8.3.1 shows horizontal and vertical communication on a logical two-dimensional mesh of processes (Figure 8.12). A rowwise partitioning requires only the vertical part of this communication. Similarly, columnwise partitioning performs only the horizontal part of this communication.
8.9 Derive expressions for the parallel run times of all the implementations in Problem 8.8. Is the run time of any of these parallel implementations significantly different from the corresponding implementation with rowwise 1-D partitioning?
8.10 Rework Problem 8.9 with partial pivoting. In which implementations are the parallel run times significantly different for rowwise and columnwise partitioning?
8.11 Show that Gaussian elimination on an n × n matrix 2-D partitioned on an n × n logical mesh of processes is not cost-optimal if the 2n one-to-all broadcasts are performed synchronously.
8.12 Show that the cumulative idle time over all the processes in the Gaussian elimination algorithm is Θ(n³) for a block mapping, whether the n × n matrix is partitioned along one or both dimensions. Show that this idle time is reduced to Θ(n² p) for cyclic 1-D mapping and Θ for cyclic 2-D mapping.
8.13 Prove that the isoefficiency function of the asynchronous version of the Gaussian elimination with 2-D mapping is Θ(p^3/2) if pivoting is not performed.
8.14 Derive precise expressions for the parallel run time of Gaussian elimination with and without partial pivoting if the n × n matrix of coefficients is partitioned among p processes of a logical square two-dimensional mesh in the following formats:
1. Rowwise block 1-D partitioning.
2. Rowwise cyclic 1-D partitioning.
3. Columnwise block 1-D partitioning.
4. Columnwise cyclic 1-D partitioning.
8.15 Rewrite Algorithm 8.4 in terms of block matrix operations as discussed at the beginning of Section 8.2. Consider Gaussian elimination of an n × n matrix partitioned into a q × q array of submatrices, where each submatrix is of size of n/q × n/q. This array of blocks is mapped onto a logical mesh of processes in a cyclic manner, resulting in a 2-D block cyclic mapping of the original matrix onto the mesh. Assume that and that n is divisible by q, which in turn is divisible by . Derive expressions for the parallel run time for both synchronous and pipelined versions of Gaussian elimination.
Hint: The division step A[k, j] := A[k, j]/A[k, k] is replaced by submatrix operation , where A_k_,_k is the lower triangular part of the kth diagonal submatrix.

8.16 (Cholesky factorization) Algorithm 8.6 describes a row-oriented version of the Cholesky factorization algorithm for factorizing a symmetric positive definite matrix into the form A = U^TU. Cholesky factorization does not require pivoting. Describe a pipelined parallel formulation of this algorithm that uses 2-D partitioning of the matrix on a square mesh of processes. Draw a picture similar to Figure 8.11.

Example 8.6. A row-oriented Cholesky factorization algorithm.

1.   procedure CHOLESKY (A) 
2.   begin 
3.      for k := 0 to n − 1 do 
4.          begin 
5.             
6.             for j := k + 1 to n − 1 do 
7.                 A[k, j] := A[k, j]/A[k, k]; 
8.             for i := k + 1 to n − 1 do 
9.                 for j := i to n − 1 do 
10.                    A[i, j] := A[i, j] − A[k, i] × A[k, j]; 
11.         endfor;       /*Line3*/ 
12.  end CHOLESKY

8.17 (Scaled speedup) Scaled speedup (Section 5.7) is defined as the speedup obtained when the problem size is increased linearly with the number of processes; that is, if W is chosen as a base problem size for a single process, then
Equation 8.19.
For the simple matrix multiplication algorithm described in Section 8.2.1, plot the standard and scaled speedup curves for the base problem of multiplying 16 × 16 matrices. Use p = 1, 4, 16, 64, and 256. Assume that t_s = 10 and t_w = 1 in Equation 8.14.
8.18 Plot a third speedup curve for Problem 8.17, in which the problem size is scaled up according to the isoefficiency function, which is Θ(p^3/2). Use the same values of t_s and t_w.
Hint: The scaled speedup under this method of scaling is
8.19 Plot the efficiency curves for the simple matrix multiplication algorithm corresponding to the standard speedup curve (Problem 8.17), the scaled speedup curve (Problem 8.17), and the speedup curve when the problem size is increased according to the isoefficiency function (Problem 8.18).
8.20 A drawback of increasing the number of processes without increasing the total workload is that the speedup does not increase linearly with the number of processes, and the efficiency drops monotonically. Based on your experience with Problems 8.17 and 8.19, discuss whether using scaled speedup instead of standard speedup solves the problem in general. What can you say about the isoefficiency function of a parallel algorithm whose scaled speedup curve matches the speedup curve determined by increasing the problem size according to the isoefficiency function?
8.21 (Time-constrained scaling) Assume that t_s = 10 and t_w = 1 in the expression of parallel execution time (Equation 8.14) of the matrix-multiplication algorithm discussed in Section 8.2.1. For p = 1, 4, 16, 64, 256, 1024, and 4096, what is the largest problem that can be solved if the total run time is not to exceed 512 time units? In general, is it possible to solve an arbitrarily large problem in a fixed amount of time, provided that an unlimited number of processes is available? Give a brief explanation.
8.22 Describe a pipelined algorithm for performing back-substitution to solve a triangular system of equations of the form Ux = y, where the n × n unit upper-triangular matrix U is 2-D partitioned onto an n × n mesh of processes. Give an expression for the parallel run time of the algorithm. Modify the algorithm to work on fewer than n² processes, and derive an expression for the parallel execution time of the modified algorithm.
8.23 Consider the parallel algorithm given in Algorithm 8.7 for multiplying two n × n matrices A and B to obtain the product matrix C. Assume that it takes time t_local for a memory read or write operation on a matrix element and time t_c to add and multiply two numbers. Determine the parallel run time for this algorithm on an n²-processor CREW PRAM. Is this parallel algorithm cost-optimal?

8.24 Assuming that concurrent read accesses to a memory location are serialized on an EREW PRAM, derive the parallel run time of the algorithm given in Algorithm 8.7 on an n²-processor EREW PRAM. Is this algorithm cost-optimal on an EREW PRAM?

Example 8.7. An algorithm for multiplying two n × n matrices A and B on a CREW PRAM, yielding matrix C = A × B.

1.   procedure MAT_MULT_CREW_PRAM (A, B, C, n) 
2.   begin 
3.      Organize the n² processes into a logical mesh of n × n; 
4.      for each process P_i_,_j do 
5.      begin 
6.         C[i, j] := 0; 
7.         for k := 0 to n − 1 do 
8.             C[i, j] := C[i, j] + A[i, k] × B[k, j]; 
9.      endfor; 
10.  end MAT_MULT_CREW_PRAM

8.25 Consider a shared-address-space parallel computer with n² processors. Assume that each processor has some local memory, and A[i, j] and B [i, j] are stored in the local memory of processor P_i_,_j. Furthermore, processor P_i_,_j computes C [i, j] in its local memory. Assume that it takes time t_local + t_w to perform a read or write operation on nonlocal memory and time t_local on local memory. Derive an expression for the parallel run time of the algorithm in Algorithm 8.7 on this parallel computer.
8.26 Algorithm 8.7 can be modified so that the parallel run time on an EREW PRAM is less than that in Problem 8.24. The modified program is shown in Algorithm 8.8. What is the parallel run time of Algorithm 8.8 on an EREW PRAM and a shared-address-space parallel computer with memory access times as described in Problems 8.24 and 8.25? Is the algorithm cost-optimal on these architectures?

8.27 Consider an implementation of Algorithm 8.8 on a shared-address-space parallel computer with fewer than n² (say, p) processors and with memory access times as described in Problem 8.25. What is the parallel runtime?

Example 8.8. An algorithm for multiplying two n × n matrices A and B on an EREW PRAM, yielding matrix C = A × B.

1.   procedure MAT_MULT_EREW_PRAM (A, B, C, n) 
2.   begin 
3.      Organize the n² processes into a logical mesh of n × n; 
4.      for each process P_i_,_j do 
5.      begin 
6.         C[i, j] := 0; 
7.         for k := 0 to n − 1 do 
8.             C[i, j] := C[i, j] + 
                          A[i, (i + j + k) mod n] × B[(i + j + k) mod n, j]; 
9.      endfor; 
10.  end MAT_MULT_EREW_PRAM

8.28 Consider the implementation of the parallel matrix multiplication algorithm presented in Section 8.2.1 on a shared-address-space computer with memory access times as given in Problem 8.25. In this algorithm, each processor first receives all the data it needs into its local memory, and then performs the computation. Derive the parallel run time of this algorithm. Compare the performance of this algorithm with that in Problem 8.27.
8.29 Use the results of Problems 8.23–8.28 to comment on the viability of the PRAM model as a platform for parallel algorithm design. Also comment on the relevance of the message-passing model for shared-address-space computers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Dense Matrix Algorithms

Create new playlist

Sign In

Sign Up

Chapter 8. Dense Matrix Algorithms

Matrix-Vector Multiplication

Rowwise 1-D Partitioning

One Row Per Process

Using Fewer than n Processes

2-D Partitioning

One Element Per Process

Using Fewer than n2 Processes

Comparison of 1-D and 2-D Partitionings

Matrix-Matrix Multiplication

A Simple Parallel Algorithm

Cannon’s Algorithm

The DNS Algorithm

DNS Algorithm with Fewer than n3 Processes

Solving a System of Linear Equations

A Simple Gaussian Elimination Algorithm

Parallel Implementation with 1-D Partitioning

Parallel Implementation with 2-D Partitioning

Gaussian Elimination with Partial Pivoting

1-D Partitioning

2-D Partitioning

Solving a Triangular System: Back-Substitution

Numerical Considerations in Solving Systems of Linear Equations

Bibliographic Remarks

Problems

Table of Contents for
8. Dense Matrix Algorithms

Using Fewer than n² Processes

DNS Algorithm with Fewer than n³ Processes