Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Basic OpenCL Examples

This chapter discusses some basic OpenCL examples, which allow us to summarize our understanding of the specification discussed in Chapter 2. These examples demonstrate the programming steps needed to write complete OpenCL applications. We also include an example using the C++ Wrapper API for developers who have a preference toward C++. The examples discussed here can serve as baselines to compare the optimized versions, which can be written after studying later chapters.

Keywords C++, example program, matrix multiplication, OpenCL

Introduction

In Chapter 2, we discussed the OpenCL specification and how it can be used to implement programs for heterogeneous platforms. Chapter 3 covered the architecture of some possible OpenCL targets. In this chapter, we discuss a few more complex examples, which build on the simple examples such as vector addition discussed in Chapter 2. We cover the implementation of both the host and the device code in a methodical manner.

The aim of this chapter is to give the reader more intuition of how OpenCL can be used to write data-parallel programs. The implementations in this chapter are complete OpenCL examples. However, they have not been tuned to take advantage of any particular device architecture. The aim is to provide the user with implementation guidelines for OpenCL applications and to discuss implementations that can serve as a baseline for the architecture-specific optimization of applications in later chapters.

Example applications

In this section, we discuss the implementation of some example OpenCL applications. The examples covered here include image rotation, matrix multiplication, and image convolution.

Simple Matrix Multiplication Example

A simple serial C implementation of matrix multiplication is shown here (remember that OpenCL host programs can be written in either C or using the OpenCL C++ Wrapper API). The code iterates over three nested for loops, multiplying Matrix A by Matrix B and storing the result in Matrix C. The two outer loops are used to iterate over each element of the output matrix. The innermost loop will iterate over the individual elements of the input matrices to calculate the result of each output location.

// Iterate over the rows of Matrix A

for(int i = 0; i < heightA; i++) {

// Iterate over the columns of Matrix B

for(int j = 0; j < widthB; j++) {

C[i][j] = 0;

// Multiply and accumulate the values in the current row

// of A and column of B

for(int k = 0; k < widthA; k++) {

C[i][j] += A[i][k] * B[k][j];

}

It is straightforward to map the serial implementation to OpenCL, as the two outer for-loops work independently of each other. This means that a separate work-item can be created for each output element of the matrix. The two outer for-loops are mapped to the two-dimensional range of work-items for the kernel.

The independence of output values inherent in matrix multiplication is shown in Figure 4.1. Each work-item reads in its own row of Matrix A and its column of Matrix B. The data being read is multiplied and written at the appropriate location of the output Matrix C.

B978012387766600027X/f04-01-9780123877666.jpg is missing

Figure 4.1

Each output value in a matrix multiplication is generated independently of all others.

// widthA = heightB for valid matrix multiplication

__kernel void simpleMultiply(

__global float* outputC,

int widthA,

int heightA,

int widthB,

int heightB,

__global float* inputA,

__global float* inputB) {

//Get global position in Y direction

int row = get_global_id(1);

//Get global position in X direction

int col = get_global_id(0);

float sum = 0.0f;

//Calculate result of one element of Matrix C

for (int i = 0; i < widthA; i++) {

sum += inputA[row*widthA+i] * inputB[i*widthB+col];

}

outputC[row*widthB+col] = sum;

}

Now that we have understood the implementation of the data-parallel kernel, we need to write the OpenCL API calls that move the data to the device. The implementation steps for the rest of the matrix multiplication application are summarized in Figure 4.2. We need to create a context for the device we wish to use. Using the context, we create the command queue, which is used to send commands to the device. Once the command queue is created, we can send the input data to the device, run the parallel kernel, and read the resultant output data back from the device.

B978012387766600027X/f04-02-9780123877666.jpg is missing

Figure 4.2

Programming steps to writing a complete OpenCL application.

Step 1: Set Up Environment

In this step, we declare a context, choose a device type, and create the context and a command queue. Throughout this example, the ciErrNum variable should always be checked to see if an error code is returned by the implementation.

cl_int ciErrNum;

// Use the first platform

cl_platform_id platform;

ciErrNum = clGetPlatformIDs(1, &platform, NULL);

// Use the first device

cl_device_id device;

ciErrNum = clGetDeviceIDs(

platform,

CL_DEVICE_TYPE_ALL,

1,

&device,

NULL);

cl_context_properties cps[3] = {

CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0};

// Create the context

cl_context ctx = clCreateContext(

cps,

1,

&device,

NULL,

&ciErrNum);

// Create the command queue

cl_command_queue myqueue = clCreateCommandQueue(

ctx,

device,

0,

&ciErrNum);

Step 2: Declare Buffers and Move Data

Declare buffers on the device and enqueue copies of input matrices to the device. Also declare the output buffer.

// We assume that A, B, C are float arrays which

// have been declared and initialized

// Allocate space for Matrix A on the device

cl_mem bufferA = clCreateBuffer(

ctx,

CL_MEM_READ_ONLY,

wA*hA*sizeof(float),

NULL,

&ciErrNum);

// Copy Matrix A to the device

ciErrNum = clEnqueueWriteBuffer(

myqueue,

bufferA,

CL_TRUE,

0,

wA*hA*sizeof(float),

(void *)A,

0,

NULL,

NULL);

// Allocate space for Matrix B on the device

cl_mem bufferB = clCreateBuffer(

ctx,

CL_MEM_READ_ONLY,

wB*hB*sizeof(float),

NULL,

&ciErrNum);

// Copy Matrix B to the device

ciErrNum = clEnqueueWriteBuffer(

myqueue,

bufferB,

CL_TRUE,

0,

wB*hB*sizeof(float),

(void *)B,

0,

NULL,

NULL);

// Allocate space for Matrix C on the device

cl_mem bufferC = clCreateBuffer(

ctx,

CL_MEM_READ_ONLY,

hA*wB*sizeof(float),

NULL,

&ciErrNum);

Step 3: Runtime Kernel Compilation

Compile the program from the kernel array, build the program, and define the kernel.

// We assume that the program source is stored in the variable

// ‘source’ and is NULL terminated

cl_program myprog = clCreateProgramWithSource (

ctx,

1,

(const char**)&source,

NULL,

&ciErrNum);

// Compile the program. Passing NULL for the ‘device_list’

// argument targets all devices in the context

ciErrNum = clBuildProgram(myprog, 0, NULL, NULL, NULL, NULL);

// Create the kernel

cl_kernel mykernel = clCreateKernel(

myprog,

“simpleMultiply”,

&ciErrNum);

Step 4: Run the Program

Set kernel arguments and the workgroup size. We can then enqueue kernel onto the command queue to execute on the device.

// Set the kernel arguments

clSetKernelArg(mykernel, 0, sizeof(cl_mem), (void *)&d_C);

clSetKernelArg(mykernel, 1, sizeof(cl_int), (void *)&wA);

clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&hA);

clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&wB);

clSetKernelArg(mykernel, 4, sizeof(cl_int), (void *)&hB);

clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&d_A);

clSetKernelArg(mykernel, 6, sizeof(cl_mem), (void *)&d_B);

// Set local and global workgroup sizes

//We assume the matrix dimensions are divisible by 16

size_t localws[2] = {16,16} ;

size_t globalws[2] = {wC, hC};

// Execute the kernel

ciErrNum = clEnqueueNDRangeKernel(

myqueue,

mykernel,

2,

NULL,

globalws,

localws,

0,

NULL,

NULL);

Step 5: Obtain Results to Host

After the program has run, we enqueue a read back of the result matrix from the device buffer to host memory.

// Read the output data back to the host

ciErrNum = clEnqueueReadBuffer(

myqueue,

d_C,

CL_TRUE,

0,

wC*hC*sizeof(float),

(void *)C,

0,

NULL,

NULL);

The steps outlined here show an OpenCL implementation of matrix multiplication that can be used as a baseline. In later chapters, we use our understanding of data-parallel architectures to improve the performance of particular data-parallel algorithms.

Image Rotation Example

Image rotation is a common image processing routine with applications in matching, alignment, and other image-based algorithms. The input to an image rotation routine is an image, the rotation angle θ, and a point about which rotation is done. The aim is to achieve the result shown in Figure 4.3. For the image rotation example, we use OpenCL's C++ Wrapper API.

B978012387766600027X/f04-03-9780123877666.jpg is missing

Figure 4.3

An image rotated by 45°. The output is the same size as the input, and the out of edge values are dropped.

The coordinates of a point (x₁, y₁) when rotated by an angle θ around (x₀, y₀) become (x₂, y₂), as shown by the following equation:

By rotating the image about the origin (0, 0), we get

To implement image rotation with openCL, we see that the calculations of the new (x, y) coordinate of each pixel in the input can be done independently. Each work-item will calculate the new position of a single pixel. In a manner similar to matrix multiplication, a work-item can obtain the location of its respective pixel using its global ID (as shown in Figure 4.4).

B978012387766600027X/f04-04-9780123877666.jpg is missing

Figure 4.4

Each element of the input image is handled by one work-item. Each work-item calculates its data's coordinates and writes image out.

The image rotation example is a good example of an input decomposition, meaning that an element of the input (in this case, an input image) is decomposed into a work-item. When an image is rotated, the new locations of some pixels may be outside the image if the input and output image sizes are the same (see Figure 4.3, in which the corners of the input would not have fit within the resultant image). For this reason, we need to check the bounds of the calculated output coordinates.

__kernel void img_rotate(

__global float* dest_data, __global float* src_data,

int W, int H,//Image Dimensions

float sinTheta, float cosTheta ) //Rotation Parameters

{

//Work-item gets its index within index space

const int ix = get_global_id(0);

const int iy = get_global_id(1);

//Calculate location of data to move into (ix,iy)

//Output decomposition as mentioned

float xpos = ((float)ix)*cosTheta + ((float)iy)*sinTheta;

float ypos = −1.0*((float)ix)*sinTheta + ((float)iy)*cosTheta;

//Bound Checking

if(((int)xpos>=0) && ((int)xpos< W) &&

((int)ypos>=0) && ((int)ypos< H))

{

// Read (ix,iy) src_data and store at (xpos,ypos) in

// dest_data

// In this case, because we rotating about the origin

// and there is no translation, we know that (xpos,ypos)

// will be unique for each input (ix,iy) and so each

// work-item can write its results independently

dest_data[(int)ypos*W+(int)xpos]= src_data[iy*W+ix];

}

As seen in the previous kernel code, image rotation is an embarrassingly parallel problem, in which each resulting pixel value is computed independently. The main steps for the host code are similar to those in Figure 4.2. For this example's host code, we can reuse a substantial amount of code from the previous matrix multiplication example, including the code that will create the context and the command queue.

To give the developer wider exposure to OpenCL, we write the host code for the image rotation example with the C++ bindings for OpenCL 1.1. The C++ bindings provide access to the low-level features of the original OpenCL C API. The C++ bindings are compatible with standard C++ compilers, and they are carefully designed to perform no memory allocation and offer full access to the features of OpenCL, without unnecessary masking of functionality.

More details about the OpenCL 1.1 specification's C++ Wrapper API can be found at www.khronos.org/registry/cl/specs/opencl-cplusplus-1.1.pdf.

The C++ header for OpenCL is obtained by including the header cl.hpp. The steps are shown in a similar manner to the matrix multiplication example in order to illustrate the close correspondence between the C API and the more concise C++ bindings.