Matrix multiplication is commonly used to demonstrate program optimization in high-performance computing. In its most basic form, matrix multiplication is a toy program consisting of a single multiply-accumulate statement buried inside a triply nested loop. Unlike most toy programs, however, matrix multiplication has a more serious side. First, this operation is the single most important building block for dense linear algebra computations. Second, if you work at it hard enough, matrix multiplication can run near the peak performance of most modern processors. Therefore, it presents many opportunities for performance optimization, and making it fast has practical value.
In this discussion, we will start with a simple matrix multiplication function for a CPU. We will convert the code into OpenCL. And then we will explore a series of code transformations that take advantage of the OpenCL memory model to optimize the program. Along the way, we will report performance results for the multiplication of square matrices of order 1000. All the results in this chapter were generated on an Intel Core Duo CPU (T8300) running at 2.4GHz and a GeForce 8600M GT GPU from NVIDIA. The OpenCL release used in the benchmarks was provided by Apple as part of the Snow Leopard release of OS X.
A dense matrix is a rectangular array of numbers. It is called “dense” because the numbers are almost always non-zero; hence there is no reason to keep track of zeros to reduce storage or accelerate computations. Throughout this discussion, we will consider three matrices: A (N by P), B (P by M), and C (N by M). Multiplication of A times B with the result added into C is mathematically defined as
0 ≤ i ≤ N
0 ≤ j ≤ M
Matrix multiplication, as shown in Figure 21.1 for the (i, j)th element of C, consists of a dot product of the ith row of A with the jth column of B. A program to carry out this computation is shown in Listing 21.1. It consists of a double nested loop to run over the elements of the C matrix. Inside is an additional loop to compute the appropriate dot product.
In the listing we represent the matrices as one-dimension arrays with explicit index algebra to map the pair of indices onto a single index. This is a common technique when dealing with multidimension arrays. The program using GCC with default compiler settings ran at 167 MFLOPS.
void seq_mat_mul(
int Mdim, int Ndim, int Pdim, // matrix dimensions
float *A, float *B, float *C) // C = C + A * B
{
int i, j, k;
float tmp;
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
for(k=0;k<Pdim;k++){
// C[i][j]+=A[i][k]* B[k][j];
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
}
The matrix multiplication program from Listing 21.1 can be directly converted into an OpenCL kernel. The result is shown in Listing 21.2. Each work-item is assigned an element of the product matrix to compute. The outer two loops over i and j are deleted and replaced with calls to a function to find the global ID for the work-item in each of the two dimensions. Care is taken to make sure the resulting work-item IDs fit within the C matrix, at which point the dot product is carried out to compute the matrix product for element i, j of the C matrix. Note that all three matrices are left in global memory. The resulting performance on the GPU was 511 MFLOPS, and the identical code on the CPU was 744 MFLOPS.
const char *C_elem_KernelSource = "
"
"__kernel mmul(
"
" const int Mdim,
"
" const int Ndim,
"
" const int Pdim,
"
" __global float* A,
"
" __global float* B,
"
" __global float* C)
"
"{
"
" int k;
"
" int i = get_global_id(0);
"
" int j = get_global_id(1);
"
" float tmp;
"
" if( (i < Ndim) && (j <Mdim))
"
" {
"
" tmp = 0.0;
"
" for(k=0;k<Pdim;k++)
"
" tmp += A[i*Ndim+k] * B[k*Pdim+j];
"
" C[i*Ndim+j] = tmp;
"
" }
"
"}
"
"
";
The host code for the OpenCL matrix multiplication program is provided in Listing 21.3. The first available platform and a device are selected. While the host code from the listing shows a CPU device, this same host program can be used with a GPU by replacing the constant CL_DEVICE_TYPE_CPU
in the call to clGetDeviceIDs()
with CL_DEVICE_TYPE_GPU
. A command-queue is established with profiling enabled through the CL_QUEUE_PROFILING_ENABLE
property followed by the definition of the matrices on the host. In this program, two basic matrix manipulation functions are used to initialize the matrices and then later test the results. Because the contents of these functions are simple and not illustrative of the concepts discussed in this chapter, these functions are not included in this listing.
The memory objects are set up by calls to clCreateBuffer()
. The program is then constructed from a string literal (the one in Listing 21.2). Throughout this listing, the error testing has been deleted in order to save space. An exception is made, however, for the case of building a program because it is important to understand how to display the output from the OpenCL compiler using a call to clGetProgramBuildInfo()
.
The kernel and the kernel arguments are then set up, after which the buffers associated with the matrices are submitted to the command-queue. The command-queue is in-order, so it is safe to assume that the memory objects are present in the global memory when the kernel is enqueued. The global dimensions for a two-dimension NDRange are set, while in this case the local dimensions are not set (i.e., they are set to NULL
). This was done so that the OpenCL implementation could choose the local dimensions, thereby adjusting to the needs of the CPU or the GPU compute device without programmer intervention.
The host waits for the kernel to complete, at which point timing data is extracted from the enqueue kernel event with calls to clGetEventProfilingInfo()
, once with CL_PROFILING_COMMAND_START
and then again with CL_PROFILING_COMMAND_END
. The memory object associated with the C matrix is then copied onto the host and the results tested. The host program finishes by cleaning up the environment and releasing various OpenCL objects.
#ifdef APPLE
#include <OpenCL/opencl.h>
#else
#include "CL/cl.h"
#endif
#define ORDER 1000
int main(int argc, char **argv)
{
float *A; // A matrix
float *B; // B matrix
float *C; // C matrix (C = A*B)
int Mdim, Ndim, Pdim; // A[N][P], B[P][M], C[N][M]
int err; // error code from OpenCL
int szA, szB, szC; // number of matrix elements
size_t global[DIM]; // global domain size
size_t local[DIM]; // local domain size
cl_device_id device_id; // compute device id
cl_context context; // compute context
cl_command_queue commands; // compute command queue
cl_program program; // compute program
cl_kernel kernel; // compute kernel
cl_uint nd; // Number of dims in NDRange
cl_mem a_in; // Memory object for A matrix
cl_mem b_in; // Memory object for B matrix
cl_mem c_out; // Memory Object for C matrix
int i;
Ndim = ORDER; Pdim = ORDER; Mdim = ORDER;
//-------------------------------------------------------------------
// Set up the OpenCL platform using whichever platform is "first"
//-------------------------------------------------------------------
cl_uint numPlatforms;
cl_platform_id firstPlatformId;
err = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms);
err = clGetDeviceIDs(firstPlatformId, CL_DEVICE_TYPE_CPU, 1,
&device_id, NULL);
cl_context_properties properties [] =
{
CL_CONTEXT_PLATFORM, (cl_context_properties)firstPlatformId,0
};
context = clCreateContext(properties, 1, &device_id, NULL, NULL,
&err);
commands = clCreateCommandQueue(context, device_id,
CL_QUEUE_PROFILING_ENABLE, &err);
// Set up matrices
szA = Ndim*Pdim; szB = Pdim*Mdim; szC = Ndim*Mdim;
A = (float *)malloc(szA*sizeof(float));
B = (float *)malloc(szB*sizeof(float));
C = (float *)malloc(szC*sizeof(float));
initmat(Mdim, Ndim, Pdim, A, B, C); // function to set matrices
// to known values.
//-------------------------------------------------------------------
// Set up the buffers, initialize matrices, and write them
// into global memory
//-------------------------------------------------------------------
a_in = clCreateBuffer(context, CL_MEM_READ_ONLY,
sizeof(float) * szA, NULL, NULL);
b_in = clCreateBuffer(context, CL_MEM_READ_ONLY,
sizeof(float) * szB, NULL, NULL);
c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
sizeof(float) * szC, NULL, NULL);
// Create the compute program from the source buffer
*program = clCreateProgramWithSource(context, 1,
(const char **) & C_elem_KernelSource, NULL, &err);
// Build the program
err = clBuildProgram(*program, 0, NULL, NULL, NULL, NULL);
if (err != CL_SUCCESS)
{
size_t len;
char buffer[2048];
printf("Error: Failed to build program executable!
");
clGetProgramBuildInfo(*program, device_id,
CL_PROGRAM_BUILD_LOG,
sizeof(buffer), buffer, &len);
printf("%s
", buffer);
return FAILURE;
}
// Create the compute kernel from the program
*kernel = clCreateKernel(*program, "mmul", &err);
// Set the arguments to our compute kernel
err = 0;
err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);
err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);
err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);
err |= clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);
err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);
err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);
// Write the A and B matrices into compute device memory
err = clEnqueueWriteBuffer(commands, a_in, CL_TRUE, 0,
sizeof(float) * szA, A, 0, NULL, NULL);
err = clEnqueueWriteBuffer(commands, b_in, CL_TRUE, 0,
sizeof(float) * szB, B, 0, NULL, NULL);
cl_event prof_event;
// Execute the kernel over the entire range of C matrix elements
global[0] =(size_t) Ndim; global[1] =(size_t) Mdim; *ndim = 2;
err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL,
global, NULL, 0, NULL, &prof_event);
// Wait for the commands to complete before reading back results
clFinish(commands);
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
size_t ret_size;
err = clGetEventProfilingInfo(prof_event,
CL_PROFILING_COMMAND_START,
sizeof(cl_ulong),
&ev_start_time,
NULL);
err = clGetEventProfilingInfo(prof_event,
CL_PROFILING_COMMAND_END,
sizeof(cl_ulong),
&ev_end_time,
NULL);
// Read back the results from the compute device
err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0,
sizeof(float) * szC, C, 0, NULL, NULL );
run_time = ev_end_time - ev_start_time;
results(Mdim, Ndim, Pdim, C, run_time);
clReleaseProgram(program);
clReleaseKernel(kernel);
clReleaseMemObject(a_in);
clReleaseMemObject(b_in);
clReleaseMemObject(c_out);
clReleaseCommandQueue(commands);
clReleaseContext(context);
}
In parallel programming, it is important to manage a computation to minimize parallel overhead. With order-1000 matrices, one work-item per matrix element results in a million work-items. This seems a bit excessive, so in the next version of the program, each work-item will compute a row of the matrix. The modifications required to support this change are illustrated in Figure 21.2. An entire row of the C matrix is computed using one row from the A matrix repeatedly while sweeping through the columns of the B matrix. The NDRange is changed from a 2D range set to match the dimensions of the C matrix to a 1D range set to the number of rows in the C matrix. At this point we switch our focus to the GPU, which in this case (as shown by a call to clGetDeviceInfo()
with CL_DEVICE_MAX_COMPUTE_UNITS
) has four compute units. Hence we set the work-group size to 250 and create four work-groups to cover the full size of the problem.
The kernel and changes to the host program required for this case are shown in Listing 21.4. The kernel is similar to the previous version except the code now includes a loop over the j index so the kernel marches across a row of C for each work-item. Note that as before the computation works with the memory objects directly from global memory. The resulting performance on the GPU is 258 MFLOPS, which is a considerable drop from our value of 511 MFLOPS. In a real project (as opposed to this simple example) different work-group sizes would be tried to improve performance and better align with the properties of the GPU. In this case, however, we will settle for 258 MFLOPS and proceed with further optimizations.
const char *C_row_KernelSource = "
"
"__kernel mmul(
"
" const int Mdim,
"
" const int Ndim,
"
" const int Pdim,
"
" __global float* A,
"
" __global float* B,
"
" __global float* C)
"
"{
"
" int k,j;
"
" int i = get_global_id(0);
"
" float tmp;
"
" if( (i < Ndim) )
"
" {
"
" for(j=0;j<Mdim;j++){
"
" tmp = 0.0;
"
" for(k=0;k<Pdim;k++)
"
" tmp += A[i*Ndim+k] * B[k*Pdim+j];
"
" C[i*Ndim+j] = tmp;
"
" }
"
" }
"
"}
"
"
";
//------------------------------------------------------------------
// Host code changes . . . change to a 1D NDRange and set local
// work-group size
//------------------------------------------------------------------
global[0] = (size_t) Ndim;
local[0] = (size_t) 250;
*ndim = 1;
err = clEnqueueNDRangeKernel(commands, kernel, nd, NULL, global,
local, 0, NULL, NULL);
The core of matrix multiplication is a multiply-accumulate computation. Most processors have sufficient bandwidth into the ALU to keep this computation running near peak performance, but only if the data movement costs can be hidden. Hence, the essence of optimizing matrix multiplication is to minimize data movement. Our matrix multiplication kernels up to this point have left all three matrices in global memory. This means the computation streams rows and columns through the memory hierarchy (global to private) repeatedly for each dot product.
We can reduce this memory traffic by recognizing that each work-item reuses the same row of A for each row of C that is updated. This is shown in Figure 21.3.
This version of the matrix multiplication program uses the same host code as the prior case. Even the kernel code is only slightly changed. In Listing 21.5, a float array of size 1000 is defined. Because this occurs inside the kernel, the memory is allocated in the private memory of the processing element that will run the kernel. Then, prior to computing the dot products for the elements of the C matrix, the required row of the A matrix is copied from global memory into private memory. The use of private memory for the row of A had a dramatic impact on performance, from our prior result of 258 MFLOPS to 873 MFLOPS.
const char *C_row_priv_KernelSource = "
"
"__kernel void mmul(
"
" const int Mdim,
"
" const int Ndim,
"
" const int Pdim,
"
" __global float* A,
"
" __global float* B,
"
" __global float* C)
"
"{
"
" int k,j;
"
" int i = get_global_id(0);
"
" float Awrk[1000];
"
" float tmp;
"
" if( (i < Ndim) )
"
" {
"
" for(k=0;k<Pdim;k++)
"
" Awrk[k] = A[i*Ndim+k];
"
"
"
" for(j=0;j<Mdim;j++){
"
" tmp = 0.0;
"
" for(k=0;k<Pdim;k++)
"
" tmp += Awrk[k] * B[k*Pdim+j];
"
" C[i*Ndim+j] = tmp;
"
" }
"
" }
"
"}
"
"
";
A careful consideration of the dot products in the matrix multiplication shows that while each work-item reuses its own unique row of A, all the work-items in a group repeatedly stream the same columns of B through the compute device in the course of updating a row of C. This is shown in Figure 21.4. We can reduce the overhead of moving data from global memory if the work-items comprising a work-group copy the columns of the matrix B into local memory before they start updating their rows of C.
To support this optimization, changes were needed in both the kernel and the host program. These changes are shown in Listing 21.6. The key change was to set up the local memory for each kernel. This is needed because dynamic memory allocation within a kernel is not supported in OpenCL. Hence, a memory object of the appropriate size had to be allocated on the host and then passed as an argument to the kernel. This is done by an extra call to clSetKernelArg()
. Then inside the kernel, the work-items copy a column of B into local memory. This is done in parallel using a cyclic distribution of the loop iterations. Before any work-item can proceed, all the copying must complete. Hence, the loop that carries out the copies is followed by a barrier(CLK_LOCAL_MEM_FENCE);
with the memory fence selected to ensure that each work-item in the group sees a consistent state for the local memory. Following the barrier, the kernel code is essentially the same as in the previous case. This modification had a dramatic impact on performance, taking the matrix multiplication from 873 MFLOPS to 2472 MFLOPS.
const char *C_row_priv_bloc_KernelSource = "
"
"__kernel void mmul(
"
" const int Mdim,
"
" const int Ndim,
"
" const int Pdim,
"
" __global float* A,
"
" __global float* B,
"
" __global float* C,
"
" __local float* Bwrk)
"
"{
"
" int k,j;
"
" int i = get_global_id(0);
"
" int iloc = get_local_id(0);
"
" int nloc = get_local_size(0);
"
" float Awrk[1000];
"
" float tmp;
"
" if( (i < Ndim) )
"
" {
"
" for(k=0;k<Pdim;k++)
"
" Awrk[k] = A[i*Ndim+k];
"
"
"
" for(j=0;j<Mdim;j++){
"
" for(k=iloc;k<Pdim;k=k+nloc)
"
" Bwrk[k] = B[k*Pdim+j];
"
" barrier(CLK_LOCAL_MEM_FENCE);
"
" tmp = 0.0;
"
" for(k=0;k<Pdim;k++)
"
" tmp += Awrk[k] * Bwrk[k];
"
" C[i*Ndim+j] = tmp;
"
" }
"
" }
"
"}
"
"
";
//------------------------------------------------------------------
// Host code modifications . . Added one more argument (6) to the
// list of args for the kernel. Note that an OpenCL memory object
// is not provided since this is to be used in local memory.
//------------------------------------------------------------------
err = 0;
err = clSetKernelArg(*kernel, 0, sizeof(int), &Mdim);
err |= clSetKernelArg(*kernel, 1, sizeof(int), &Ndim);
err |= clSetKernelArg(*kernel, 2, sizeof(int), &Pdim);
err != clSetKernelArg(*kernel, 3, sizeof(cl_mem), &a_in);
err |= clSetKernelArg(*kernel, 4, sizeof(cl_mem), &b_in);
err |= clSetKernelArg(*kernel, 5, sizeof(cl_mem), &c_out);
err |= clSetKernelArg(*kernel, 6, sizeof(float)*Pdim, NULL);
Performance results from this example are summarized in Table 21.1. Directly translating the program into OpenCL for the CPU resulted in a significant performance improvement. Most of this chapter, however, focused on the GPU. In particular, these results show the impact of optimizing how data is mapped onto global, private, and local memory.
There is an impressive 15-fold speedup for the optimized GPU code versus the CPU code. This speedup value, however, is misleading. Considerable effort was expended to optimize the code on the GPU, but no effort whatsoever was expended to optimize the sequential code on the CPU. This situation, inflating speedups by comparing to an unoptimized sequential code, happens all too often and has led to unrealistic expectations of the benefits of moving from a CPU to a GPU.
Consider the functions in Listing 21.7. These simple optimizations permute the order of the three nested loops. This serves to change the memory access patterns and hence reuse of data from the cache as the contents of the three matrices are streamed through the CPU. Following common practice in the linear algebra literature, we call the dot product algorithm ijk and the two permutations ikj and kij. We also changed the data type from single
to double
and used the O3 compiler switch to GCC. Performance in MFLOPS varied considerably with 272, 1130, and 481 for the kjk, ikj, and kij, respectively. Note that this is the most trivial of optimizations and many additional optimization are well known (cache blocking, TLB blocking, SSE). Still, with these optimizations, our more honest sequential CPU reference point becomes 1130 MFLOPS, which reduces the OpenCLGPU maximum speedup to a more realistic value of 2.2.
void mat_mul_ijk(int Mdim, int Ndim, int Pdim, double *A,
double *B, double *C)
{
int i, j, k;
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
for(k=0;k<Pdim;k++){
/* C(i,j) = sum(over k) A(i,k) * B(k,j) */
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}
}
}
}
void mat_mul_ikj(int Mdim, int Ndim, int Pdim, double *A,
double *B, double *C)
{
int i, j, k;
for (i=0; i<Ndim; i++){
for(k=0;k<Pdim;k++){
for (j=0; j<Mdim; j++){
/* C[i][j] += sum(over k) A[i][k] * B[k][j] */
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}
}
}
}
void mat_mul_kij(int Mdim, int Ndim, int Pdim, double *A,
double *B, double *C)
{
int i, j, k;
for(k=0;k<Pdim;k++){
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
/* C[i][j] += sum(over k) A[i][k] * B[k][j] */
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}
}
}
}