In this section, we discuss the implementation of some example OpenCL applications. The examples covered here include image rotation, matrix multiplication, and image convolution.
Simple Matrix Multiplication Example
A simple serial C implementation of matrix multiplication is shown here (remember that OpenCL host programs can be written in either C or using the OpenCL C++ Wrapper API). The code iterates over three nested for loops, multiplying Matrix A by Matrix B and storing the result in Matrix C. The two outer loops are used to iterate over each element of the output matrix. The innermost loop will iterate over the individual elements of the input matrices to calculate the result of each output location.
// Iterate over the rows of Matrix A
for(int i = 0; i < heightA; i++) {
// Iterate over the columns of Matrix B
for(int j = 0; j < widthB; j++) {
// Multiply and accumulate the values in the current row
for(int k = 0; k < widthA; k++) {
C[i][j] += A[i][k] * B[k][j];
It is straightforward to map the serial implementation to OpenCL, as the two outer
for-loops work independently of each other. This means that a separate work-item can be created for each output element of the matrix. The two outer for-loops are mapped to the two-dimensional range of work-items for the kernel.
The independence of output values inherent in matrix multiplication is shown in
Figure 4.1. Each work-item reads in its own row of Matrix A and its column of Matrix B. The data being read is multiplied and written at the appropriate location of the output Matrix C.
// widthA = heightB for valid matrix multiplication
__kernel void simpleMultiply(
__global float* inputB) {
//Get global position in Y direction
int row = get_global_id(1);
//Get global position in X direction
int col = get_global_id(0);
//Calculate result of one element of Matrix C
for (int i = 0; i < widthA; i++) {
sum += inputA[row*widthA+i] * inputB[i*widthB+col];
outputC[row*widthB+col] = sum;
Now that we have understood the implementation of the data-parallel kernel, we need to write the OpenCL API calls that move the data to the device. The implementation steps for the rest of the matrix multiplication application are summarized in
Figure 4.2. We need to create a context for the device we wish to use. Using the context, we create the command queue, which is used to send commands to the device. Once the command queue is created, we can send the input data to the device, run the parallel kernel, and read the resultant output data back from the device.
Step 1: Set Up Environment
In this step, we declare a context, choose a device type, and create the context and a command queue. Throughout this example, the ciErrNum variable should always be checked to see if an error code is returned by the implementation.
// Use the first platform
ciErrNum = clGetPlatformIDs(1, &platform, NULL);
ciErrNum = clGetDeviceIDs(
cl_context_properties cps[3] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform, 0};
cl_context ctx = clCreateContext(
// Create the command queue
cl_command_queue myqueue = clCreateCommandQueue(
Step 2: Declare Buffers and Move Data
Declare buffers on the device and enqueue copies of input matrices to the device. Also declare the output buffer.
// We assume that A, B, C are float arrays which // have been declared and initialized
// Allocate space for Matrix A on the device
cl_mem bufferA = clCreateBuffer(
// Copy Matrix A to the device
ciErrNum = clEnqueueWriteBuffer(
// Allocate space for Matrix B on the device
cl_mem bufferB = clCreateBuffer(
// Copy Matrix B to the device
ciErrNum = clEnqueueWriteBuffer(
// Allocate space for Matrix C on the device
cl_mem bufferC = clCreateBuffer(
Step 3: Runtime Kernel Compilation
Compile the program from the kernel array, build the program, and define the kernel.
// We assume that the program source is stored in the variable
// ‘source’ and is NULL terminated
cl_program myprog = clCreateProgramWithSource (
// Compile the program. Passing NULL for the ‘device_list’
// argument targets all devices in the context
ciErrNum = clBuildProgram(myprog, 0, NULL, NULL, NULL, NULL);
cl_kernel mykernel = clCreateKernel(
Step 4: Run the Program
Set kernel arguments and the workgroup size. We can then enqueue kernel onto the command queue to execute on the device.
// Set the kernel arguments
clSetKernelArg(mykernel, 0, sizeof(cl_mem), (void *)&d_C);
clSetKernelArg(mykernel, 1, sizeof(cl_int), (void *)&wA);
clSetKernelArg(mykernel, 2, sizeof(cl_int), (void *)&hA);
clSetKernelArg(mykernel, 3, sizeof(cl_int), (void *)&wB);
clSetKernelArg(mykernel, 4, sizeof(cl_int), (void *)&hB);
clSetKernelArg(mykernel, 5, sizeof(cl_mem), (void *)&d_A);
clSetKernelArg(mykernel, 6, sizeof(cl_mem), (void *)&d_B);
// Set local and global workgroup sizes
//We assume the matrix dimensions are divisible by 16
size_t localws[2] = {16,16} ;
size_t globalws[2] = {wC, hC};
ciErrNum = clEnqueueNDRangeKernel(
Step 5: Obtain Results to Host
After the program has run, we enqueue a read back of the result matrix from the device buffer to host memory.
// Read the output data back to the host
ciErrNum = clEnqueueReadBuffer(
The steps outlined here show an OpenCL implementation of matrix multiplication that can be used as a baseline. In later chapters, we use our understanding of data-parallel architectures to improve the performance of particular data-parallel algorithms.
Image Rotation Example
Image rotation is a common image processing routine with applications in matching, alignment, and other image-based algorithms. The input to an image rotation routine is an image, the rotation angle θ, and a point about which rotation is done. The aim is to achieve the result shown in
Figure 4.3. For the image rotation example, we use OpenCL's C++ Wrapper API.
The coordinates of a point (
x1,
y1) when rotated by an angle θ around (
x0,
y0) become (
x2,
y2), as shown by the following equation:
By rotating the image about the origin (0, 0), we get
To implement image rotation with openCL, we see that the calculations of the new (
x,
y) coordinate of each pixel in the input can be done independently. Each work-item will calculate the new position of a single pixel. In a manner similar to matrix multiplication, a work-item can obtain the location of its respective pixel using its global ID (as shown in
Figure 4.4).
The image rotation example is a good example of an input decomposition, meaning that an element of the input (in this case, an input image) is decomposed into a work-item. When an image is rotated, the new locations of some pixels may be outside the image if the input and output image sizes are the same (see
Figure 4.3, in
which the corners of the input would not have fit within the resultant image). For this reason, we need to check the bounds of the calculated output coordinates.
__kernel void img_rotate(
__global float* dest_data, __global float* src_data,
int W, int H,//Image Dimensions
float sinTheta, float cosTheta ) //Rotation Parameters
{
//Work-item gets its index within index space
const int ix = get_global_id(0);
const int iy = get_global_id(1);
//Calculate location of data to move into (ix,iy)
//Output decomposition as mentioned
float xpos = ((float)ix)*cosTheta + ((float)iy)*sinTheta;
float ypos = −1.0*((float)ix)*sinTheta + ((float)iy)*cosTheta;
if(((int)xpos>=0) && ((int)xpos< W) &&
((int)ypos>=0) && ((int)ypos< H))
// Read (ix,iy) src_data and store at (xpos,ypos) in
// In this case, because we rotating about the origin
// and there is no translation, we know that (xpos,ypos)
// will be unique for each input (ix,iy) and so each
// work-item can write its results independently
dest_data[(int)ypos*W+(int)xpos]= src_data[iy*W+ix];
As seen in the previous kernel code, image rotation is an
embarrassingly parallel problem, in which each resulting pixel value is computed independently. The main steps for the host code are similar to those in
Figure 4.2. For this example's host code, we can reuse a substantial amount of code from the previous matrix multiplication example, including the code that will create the context and the command queue.
To give the developer wider exposure to OpenCL, we write the host code for the image rotation example with the C++ bindings for OpenCL 1.1. The C++ bindings provide access to the low-level features of the original OpenCL C API. The C++ bindings are compatible with standard C++ compilers, and they are carefully designed to perform no memory allocation and offer full access to the features of OpenCL, without unnecessary masking of functionality.
The C++ header for OpenCL is obtained by including the header
cl.hpp. The steps are shown in a similar manner to the matrix multiplication example in order to illustrate the close correspondence between the C API and the more concise C++ bindings.
Step 1: Set Up Environment
cl::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// Create a context with the first platform
cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM,
(cl_context_properties)(platforms[0])(), 0};
// Create a context using this platform for a GPU type device
cl::Context context(CL_DEVICE_TYPE_ALL, cps);
// Get device list from the context
cl::vector<cl::Device> devices =
context.getInfo<CL_CONTEXT_DEVICES>();
// Create a command queue on the first device
cl::CommandQueue queue = cl::CommandQueue(context,
Step 2: Declare Buffers and Move Data
// Create buffers for the input and output data (“W” and “H”
// are the width and height of the image, respectively)
cl::Buffer d_ip = cl::Buffer(context, CL_MEM_READ_ONLY,
cl::Buffer d_op = cl::Buffer(context, CL_MEM_WRITE_ONLY,
// Copy the input data to the device (assume that the input
// image is the array “ip”)
queue.enqueueWriteBuffer(d_ip, CL_TRUE, 0, W*H*
Step 3: Runtime Kernel Compilation
// Read in the program source
std::ifstream sourceFileName("img_rotate_kernel.cl");
std::string sourceFile(
std::istreambuf_iterator<char>(sourceFileName),
(std::istreambuf_iterator<char>()));
cl::Program::Sources rotn_source(1,
std::make_pair(sourceFile.c_str(),
cl::Program rotn_program(context, rotn_source);
rotn_program.build(devices);
cl::Kernel rotn_kernel(rotn_program, "img_rotate");
Step 4: Run the Program
// The angle of rotation is theta
float cos_theta = cos(theta);
float sin_theta = sin(theta);
// Set the kernel arguments
rotn_kernel.setArg(0, d_op);
rotn_kernel.setArg(1, d_ip);
rotn_kernel.setArg(2, W);
rotn_kernel.setArg(3, H);
rotn_kernel.setArg(4, cos_theta);
rotn_kernel.setArg(5, sin_theta);
// Set the size of the NDRange and workgroups
cl::NDRange globalws(W,H);
cl::NDRange localws(16,16);
queue.enqueueNDRangeKernel(rotn_kernel, cl::NullRange,
Step 5: Read Result Back to Host
// Read the output buffer back to the host
queue.enqueueReadBuffer(d_op, CL_TRUE, 0, W*H*sizeof(float), op);
As seen from the previous code, the C++ bindings maintain a close correspondence to the C API.
Image Convolution Example
In image processing, convolution is a commonly used algorithm that modifies the value of each pixel in an image by using information from neighboring pixels. A convolution kernel, or filter, describes how each pixel will be influenced by its neighbors. For example, a blurring kernel will take the weighted average of neighboring pixels so that large differences between pixel values are reduced. By using the same source image and changing only the filter, effects such as sharpening, blurring, edge enhancing, and embossing can be produced.
A convolution kernel works by iterating over each pixel in the source image. For each source pixel, the filter is centered over the pixel and the values of the filter
multiply the pixel values that they overlay. A sum of the products is then taken to produce a new pixel value.
Figure 4.5 provides a visual for this algorithm.
Figure 4.6B shows the effect of a blurring filter and
Figure 4.6C shows the effect of an edge-detection filter on the same source image seen in
Figure 4.6A.
The following code performs a convolution in C. The outer two loops iterate over the source image, selecting the next source pixel. At each source pixel, the filter is applied to the neighboring pixels.
// Iterate over the rows of the source image
for(int i = halfFilterWidth; i < rows - halfFilterWidth; i++) {
// Iterate over the columns of the source image
for(int j = halfFilterWidth; j < cols - halfFilterWidth; j++) {
sum = 0; // Reset sum for new source pixel
// Apply the filter to the neighborhood
for(int k = - halfFilterWidth; k <= halfFilterWidth; k++) {
for(int l = - halfFilterWidth; l <= halfFilterWidth; l++) {
sum += Image[i+k][j+l] *
Filter[k+ halfFilterWidth][l+ halfFilterWidth];
Step 1: Create Image and Buffer Objects
This example implements convolution using OpenCL images for the data type of the source and output images. Using images to represent the data has a number of advantages. For the convolution, work-items representing border pixels may read
out-of-bounds. Images supply a mechanism to automatically handle these accesses and return meaningful data.
The code begins by assuming that a context (context) and command queue (queue) have already been created, and that the source image (sourceImage), output image (outputImage), and filter (filter) have already been initialized on the host. The images both have dimensions width by height.
The first task is to allocate space for the source and output images and the filter on the device. Images require a format descriptor,
cl_image_format, to define the size and type of data that they store and the channel layout that they use to store it. The
image_channel_order field of the descriptor is where the channel layout is specified. Recall from
Chapter 2 that every element of an image stores data in up to four channels, with each channel specified by RGBA, respectively. An image that should hold four values in every image element should use
CL_RGBA for the channel order. However, if each work-item will only access a single value (e.g., a pixel from a grayscale image or an element of a matrix), the data can be specified to only use a single channel using
CL_R. This example assumes grayscale data and so only uses a single channel. The type of data is in the
image_channel_data_type field of the descriptor. Integers are specified by a combination of signedness and size. For example,
CL_SIGNED_INT32 is a 32-bit signed integer, and
CL_UNSIGNED_INT8 is the equivalent of an unsigned character in C. Floating point data is specified by
CL_FLOAT, and this is the type of data used in the example.
After creating the image format descriptor, memory objects are created to represent the images using clCreateImage2D(). A buffer is created for the filter and will eventually be used as constant memory.
// The convolution filter is 7x7
int filterSize = filterWidth*filterWidth; // Assume a square kernel
// The image format describes how the data will be stored in memory
format.image_channel_order = CL_R; // single channel
format.image_channel_data_type = CL_FLOAT; // float data type
// Create space for the source image on the device
cl_mem bufferSourceImage = clCreateImage2D(
// Create space for the output image on the device
cl_mem bufferOutputImage = clCreateImage2D(
// Create space for the 7x7 filter on the device
cl_mem bufferFilter = clCreateBuffer(
filterSize*sizeof(float),
Step 2: Write the Input Data
The call to
clEnqueueWriteImage() copies an image to a device. Unlike buffers, copying an image requires supplying a three-dimensional offset and region, which define the coordinates where the copy should begin and how far it should span, respectively.
The filter is copied using clEnqueueWriteBuffer(), as seen in previous examples.
// Copy the source image to the device
size_t origin[3] = {0, 0, 0}; // Offset within the image to copy from
size_t region[3] = {width, height, 1}; // Elements to per dimension
// Copy the 7x7 filter to the device
clEnqueueWriteBuffer(
filterSize*sizeof(float),
Step 3: Create Sampler Object
In OpenCL, samplers are objects that describe how to access an image. Samplers specify the type of coordinate system, what to do when out-of-bounds accesses occur, and whether or not to interpolate if an access lies between multiple indices. The format of the
clCreateSampler() API call is as follows:
cl_sampler clCreateSampler (
cl_bool normalized_coords,
cl_addressing_mode addressing_mode,
cl_filter_mode filter_mode,
The coordinate system can either be normalized (i.e., range from 0 to 1) or use standard indices. Setting the second argument to CL_TRUE enables normalized coordinates. Convolution does not use normalized coordinates, so the argument is set to FALSE.
OpenCL also allows a number of addressing modes to be used for handling out-of-bounds accesses. In the case of the convolution example, we use CL_ADDRESS_CLAMP_TO_EDGE to have any out-of-bounds access return the value on the border of the image, if the access went out-of-bounds. If CL_ADDRESS_CLAMP is used, the value produced by an out-of-bounds access is 0 for channels RG and B, and it returns either 0 or 1 for channel A (based on the image format). Other options are available when normalized coordinates are used.
The filter mode can be set to either access the closest pixel to a coordinate or interpolate between multiple pixel values if the coordinate lies somewhere in between.
// Create the image sampler
cl_sampler sampler = clCreateSampler(
CL_ADDRESS_CLAMP_TO_EDGE,
Step 4: Compile and Execute the Kernel
The steps to create and build a program, create a kernel, set the kernel arguments, and enqueue the kernel for execution are identical to those in the previous example. Unlike the reference C version, the OpenCL code using images should create as many work-items as there are pixels in the image. Any out-of-bounds accesses due to the filter size will be handled automatically, based on the sampler object.
Step 5: Read the Result
Reading the result back to the host is very similar to writing the image, except that a pointer to the location to store the output data on the host is supplied.
// Read the output image back to the host
The Convolution Kernel
The kernel is fairly straightforward if the reference C code is understood—each work-item executes the two innermost loops. Data reads from the source image must be performed using an OpenCL construct that is specific to the data type. For this example, read_imagef() is used, where f signifies that the data to be read is of type single precision floating point. Accesses to an image always return a four-element vector (one per channel), so pixel (the value returned by the image access) and sum (resultant data that gets copied to the output image) must both be declared as a float4. Writing to the output image uses a similar function, write_imagef(), and requires that the data be formatted correctly (as a float4). Writing does not support out-of-bounds accesses. If there is any chance that there are more work-items in either dimension of the NDRange than there are pixels, bounds checking should be done before writing the output data.
The filter is a perfect candidate for constant memory in this example because all work-items access the same element each iteration. Simply adding the keyword __constant in the signature of the function places the filter in constant memory.
void convolution(
__read_only image2d_t sourceImage,
__write_only image2d_t outputImage,
__constant float* filter,
{
// Store each work-item's unique row and column
int column = get_global_id(0);
int row = get_global_id(1);
// Half the width of the filter is needed for indexing
int halfWidth = (int)(filterWidth/2);
// All accesses to images return data as four-element vector
// (i.e., float4), although only the 'x' component will contain
// meaningful data in this code
float4 sum = {0.0f, 0.0f, 0.0f, 0.0f};
// Iterator for the filter
// Each work-item iterates around its local area based on the
int2 coords; // Coordinates for accessing the image
// Iterate the filter rows
for(int i = -halfWidth; i <= halfWidth; i++) {
// Iterate over the filter columns
for(int j = -halfWidth; j <= halfWidth; j++) {
// Read a pixel from the image. A single channel image
// stores the pixel in the 'x' coordinate of the returned
pixel = read_imagef(sourceImage, sampler, coords);
sum.x += pixel.x * filter[filterIdx++];
// Copy the data to the output image if the
// work-item is in bounds
if(myRow < rows && myCol < cols) {
write_imagef(outputImage, coords, sum);