Overview of Extension Mechanism
OpenCL defines three types of extensions:
• KHR extension: Formally ratified by the OpenCL working group and comes with a set of conformance tests that any application claiming to support the extension must have passed. All KHR extensions are included as part of the OpenCL specification. In general, the goal is to keep the set of KHR extensions small to avoid differences between implementations. Unfortunately, in some cases, a feature must be an extension because it is not possible to support on all platforms. DirectX interoperability is a good example because it is only relevant to OpenCL implementations that support Microsoft Windows. A KHR extension is assigned a unique string name of the form cl_khr_<name>, where name is the name given to the particular extension.
• EXT extension: Developed by one or more of the OpenCL working group members, but it has not been formally approved. There is no requirement for an EXT feature to pass conformance tests. An EXT extension is often used to provide early access to features being worked on by a subgroup of the OpenCL working group, many of which will appear in the specification core or as a KHR at a later date. An EXT extension is assigned a unique string name of the form cl_ext_<name>, where name is the name given to the particular extension.
• Vendor extension: Developed by a single vendor, most likely to expose some feature that is only accessible on the vendor's hardware or platform (and thus would
not be of general application). An example is AMD's Device Attribute Query, which allows the developer to query additional information specific to AMD devices. A Vendor extension is assigned a unique string name of the form
cl_<vendor>_<name>, where
vendor is a given vendor-defined string (e.g.,
amd is used by AMD), and
name is the name given to the particular extension.
Extensions can be associated with a given OpenCL platform and in those cases are likely to be always enabled. For example, AMD's Event Callback extension (cl_amd_event_callback) is always enabled. Other extensions can be specific to a particular set of devices (e.g., double precision support (cl_khr_fp64)).
In general, the application must query either the platform or a particular device to determine what extensions are supported. Using the C++ Wrapper API, the following code demonstrates how to query both the platform and a device for the extension information:
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::string platformExts = platforms[0].getInfo<CL_PLATFORM_EXTENSIONS>();
cl_context_properties cprops[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[1](), 0 };
cl::Context context(devType, cprops);
std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
// Check that double is supported
std::string deviceExts = devices[0].getInfo<CL_DEVICE_EXTENSIONS>();
Of course, because we now have a value of type std::string for platform and device extensions, we can simply use the method find(). For example, to check that a device supports 64-bit Atomics (defined with the name cl_khr_int64_base_atomics), you could use the following code:
bool has64Atomics = deviceExts.find(“cl_khr_int64”);
Because extensions are optional, there is no requirement for an implementation to provide any externally defined symbols that the host application can call. Accordingly, a host program cannot be statically linked against any extension APIs. Instead, we can query the particular function's address equipped with the following API:
void * clGetExtensionFunctionAdddress(const char *funcname);
This returns the address of the extension function named by
funcname. The pointer returned can be cast to a function pointer of the expected type.
Typedefs for each
function pointer are declared for all extensions that add API entry points. For example, the EXT extension Atomic Counters, exposed in the OpenCL extension header
cl_ext.h, provides low-latency atomic append/consume counters and defines the following:
typedef CL_API_ENTRY cl_counter_ext (CL_API_CALL * clCreateCounterEXT_fn)(
cl_counter_flags_amd /* flags */,
cl_int */* error_ret */) CL_API_SUFFIX__VERSION_1_0;
Then, assuming that a particular OpenCL device supports the extension cl_ext_atomic_counters, the following call will set clCreateCounterEXT_pfn to a non-NULL value:
clCreateCounterEXT_fn clCreateCounterEXT_pfn =
clGetExtensionFunctionAdddress(“clCreateCounterEXT”);
In practice, it is often useful to use the following macro to enable quick allocation of these functions:
#define __INIT_CL_EXT_FCN_PTR(name)
if(!##name_fn) {
##name_pfn = (##name_fn)
clGetExtensionFunctionAddress(#name);
To get the same behavior as the function pointer definition previously, we can write the following:
__INIT_CL_EXT_FCN_PTR(clCreateContextEXT)
The examples presented later in this chapter go a step further and use the OpenCL C++ Wrapper API, which, if a particular extension is enabled, directly exposes a class interface and implicitly takes care of allocating and querying function pointers (as required).
Once an application has verified that a particular extension is supported by a device and has allocated any required API function addresses, it can then use any corresponding OpenCL C source extensions. However, these extension must also be enabled, this time in the OpenCL C source itself. These features are controlled using a pragma directive. The pragma is of the form
#pragma OPENCL EXTENSION extension_name : behavior
where
extension_name is the corresponding name of the extension being enabled or disabled, such as
cl_khr_int64, and
behavior can be either
enable or
disable. The impact of enabling or disabling an extension follows from the position in the source
code of the last seen pragma for a given extension. For example, to enable 64-bit atomics in an OpenCL C source program, the following code would do the job:
#pragma OPENCL EXTENSION cl_khr_int64 : enable
Images are the one exception to the rules described previously. To check if a device supports this feature (which is optional), a separate device query is required:
bool hasImages = devices[0].getInfo<CL_DEVICE_IMAGE_SUPPORT>;
If the result of this call is true, then images are supported by the device in question and the OpenCL C source is not required to explicitly enable this feature.
There are many KHR, EXT, and vendor extensions—too many to discuss in detail in this chapter. The remainder of this chapter discusses two important extensions in wide use today:
• Device fission—the ability for some devices to be divided into smaller subdevices
• Double precision—the ability for some devices to support double as well as float data types
Device Fission
The EXT extension Device Fission (
Bellows et al., 2010) provides an interface for subdividing an OpenCL device into multiple subdevices. As an example, consider an AMD six-core Istanbul x86 CPU shown symbolically in
Figure 11.1A. We view all six cores as a single OpenCL device by default, as shown in
Figure 11.1B. Using Device Fission, the six cores can be subdivided into six OpenCL devices as shown in
Figure 11.1C, each capable of supporting one or more command queues. Because these command queues are asynchronous and run in their own threads, it is possible to use Device Fission to build a portable and powerful threading application based on task parallelism. To date, Device Fission is supported only on CPU-like devices, but in the future this functionality will spread across all devices in the platform, including GPUs. Next, we motivate and describe the basic use of Device Fission through an example.
An interesting application of Device Fission is to use it to build and study concurrent runtimes, similar to Microsoft's Concurrent Runtime (ConcRT) (
Microsoft, 2010) and Intel's Threading Building Blocks (TBB) (
Intel, 2011). We present a case study of building a simple C++ class that supports a parallel
for construct that distributes work across a set of devices created using Device Fission. Although this example has little in common with the kind of production functionality provided by runtimes such as ConcRT and TBB, it provides enough to show that with only a small amount of effort we can build portable, scalable, parallel applications that go beyond basic data-parallel computations. For this example, the goal is to implement a simple class
Parallel that provides a
parallelFor method to distribute work across x86
cores, each having a corresponding OpenCL device generated by Device Fission. The class
Parallel has the following public interface:
public:
static unsigned int atomicAdd(
volatile unsigned int *dest);
bool parallelFor(int range, std::function<void (int i)> f);
We demonstrate its use with a simple application that counts the number of prime numbers in an input array of randomly generated integers. The implementation uses the OpenCL C++ Wrapper API (
Gaster, 2010). To enable Device Fission, we simply need to define
USE_CL_DEVICE_FISSION and include
cl.hpp:#define USE_CL_DEVICE_FISSION 1
As described previously, we encapsulate the parallel features of our simple runtime in the class
Parallel. All OpenCL state for the program is held in the private part of this class definition. The default constructor initializes OpenCL and divides the CPU device into single core subdevices. As is standard with all OpenCL initialization, we need to first query a platform, then create a context (in this case, using
CL_DEVICE_TYPE_CPU), and finally inspect the list of devices (as shown in
Figure 11.1B, this will always be presented as a single element):
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties properties[] =
{
(cl_context_properties)(platforms[1])(),
context_ = cl::Context(CL_DEVICE_TYPE_CPU, properties);
std::vector<cl::Device> devices =
context_.getInfo<CL_CONTEXT_DEVICES>();
Before we can use the Device Fission extension, the program must check that it is an exported extension for the device:
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find(
"cl_ext_device_fission") == std::string::npos) {
std::cout << "Required that device support “
<< “cl_ext_device_extension"
Given an OpenCL
cl::Device, in this case
devices[0], the method
createSubDevices creates subdevices,
cl_int createSubDevices(
const cl_device_partition_property_ext * properties,
VECTOR_CLASS<Device>* devices)
which, given a list of partition properties (as defined in
Table 11.1), creates a set of subdevices, returned in the parameter devices.
Table 11.1 Subdevice Partition Properties
cl_device_partition_property_ext | Description |
---|
CL_DEVICE_PARTITION_EQUALLY_EXT | Split the aggregate device into as many smaller devices as can be created, each containing N compute units. The value N is passed as the value accompanying this property. If N does not divide evenly into CL_DEVICE_MAX_COMPUTE_UNITS, then the remaining compute units are not used. |
CL_DEVICE_PARTITION_BY_COUNTS_EXT | This property is followed by a CL_PARTITION_BY_COUNTS_LIST_END_EXT terminated list of compute unit counts. For each non-zero count M in the list, a subdevice is created with M compute units in it. CL_PARTITION_BY_COUNTS_LIST_END_EXT is defined to be 0. |
CL_DEVICE_PARTITION_BY_NAMES_EXT | This property is followed by a list of compute unit names. Each list starts with a CL_PARTITION_BY_NAMES_LIST_END_EXT terminated list of compute unit names. Compute unit names are integers that count up from zero to the number of compute units less 1. CL_PARTITION_BY_NAMES_LIST_END_EXT is defined to be −1. Only one subdevice may be created at a time with this selector. An individual compute unit name may not appear more than once in the subdevice description. |
CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT | Split the device into smaller aggregate devices containing one or more compute units that all share part of a cache hierarchy. The value accompanying this property may be drawn from the following CL_AFFINITY_DOMAIN list:CL_AFFINITY_DOMAIN_NUMA_EXT—Split the device into subdevices composed of compute units that share a NUMA band. CL_AFFINITY_DOMAIN_L4_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 4 data cache. CL_AFFINITY_DOMAIN_L3_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 3 data cache. CL_AFFINITY_DOMAIN_L2_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 2 data cache. CL_AFFINITY_DOMAIN_L1_CACHE_EXT—Split the device into subdevices composed of compute units that share a level 1 data cache. CL_AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT—Split the device along the next fissionable CL_AFFINITY_DOMAIN. The implementation shall find the first level along which the device or subdevice may be further subdivided in the order NUMA, L4, L3, L2, L1, and fission the device into subdevices composed of compute units that share memory subsystems at this level. The user may determine what happened by calling clGetDeviceInfo(CL_DEVICE_PARTITION_STYLE_EXT) on the subdevices. |
The following code then creates the subdevices, using the partition CL_DEVICE_PARTITION_EQUALLY_EXT to equally subdivide (in this case), producing a one-to-one mapping between subdevice and core:
cl_device_partition_property_ext subDeviceProperties[] =
{
CL_DEVICE_PARTITION_EQUALLY_EXT,
CL_PROPERTIES_LIST_END_EXT,
devices[0].createSubDevices(subDeviceProperties, &subDevices_);
if (subDevices_.size() <= 0) {
std::cout << "Failed to allocate sub-devices" << std::endl;
The following code concludes the default constructor definition by iterating through the list of subdevices, creating a command queue for each one:
for (auto i = subDevices_.begin(); i != subDevices_.end(); i++) {
queues_.push_back(cl::CommandQueue(context_, *i));
Figure 11.2 shows each CPU core paired with its corresponding command queue. It is important to note that commands submitted via each queue run asynchronously and concurrently with each other.
The implementation for
atomicAdd is straightforward and is provided later. Here, we focus on the definition of
parallelFor, which takes two arguments. The first argument represents the bounds of the iteration space, and the second argument is a function object that will be applied to each index across this space. This second argument is a native C++ function object, but OpenCL
clEnqueueNDRangeKernel operations are valid only for OpenCL C kernels, and so we need another approach to support native C++ functions. Fortunately, OpenCL has just such a function,
clEnqueueNativeKernel, which can enqueue a C function. With some careful marshaling, this can also be made to work for a C++ function. Native kernels were described in
Chapter 5.
Using clEnqeueNativeKernel, the definition of parallelFor is straightforward:
bool parallelFor(int range, std::function<void (int i)> f)
{
std::vector<cl::Event> events;
args[0] = reinterpret_cast<size_t>(&f);
for (int x = 0; x < range; x++) {
int numQueues =
range - x > queues_.size() ? queues_.size() : range - x;
while(numQueues > 0) {
args[1] = static_cast<size_t>(index++);
queues_[numQueues-1].enqueueNativeKernel(
std::make_pair(
static_cast<void *>(args),
cl::Event::waitForEvents(events);
The first part of the function sets up the argument, which in this case is the actual function we want to run. This is required because it is a C++ function object (often called a functor), and OpenCL is expecting a C function. Thus, we provide a wrapper function called
funcWrapper that takes as an argument a pointer to the functor. The wrapper function “unboxes” it and calls the function object when executed. The main body of the function is the loop executing the functor for the 0 … range, with the inner loop mapping some subset of this to the number of actual subdevices. Note that these are all submitted asynchronously, and each call returns an event that we wait on at the end of each set of submissions. A more optimized version might wait until a large number of submissions have happened and control a finer grain of control using
queue.flush() and so on. We leave this as an exercise for the reader.
Finally, we put this all together in
Listing 11.1, which shows the complete implementation of our primes checking example, including the main function. One interesting and important aspect of this example is that no OpenCL C device code is used, and it is simply using the OpenCL runtime as a portable threading library.
// Enable OpenCL C++ exceptions
#define __CL_ENABLE_EXCEPTIONS
#define USE_CL_DEVICE_FISSION 1
private:
std::vector<cl::Device> subDevices_;
std::vector<cl::CommandQueue> queues_;
static void CL_CALLBACK funcWrapper(void * a)
{
size_t * args = static_cast<size_t *>(a);
std::function<void (int i)> * f =
reinterpret_cast<std::function<void (int i)>*>(args[0]);
(*f)(static_cast<int>(args[1])); public:
{
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties properties[] =
{
(cl_context_properties)(platforms[1])(),
context_ = cl::Context(CL_DEVICE_TYPE_CPU, properties);
std::vector<cl::Device> devices =
context_.getInfo<CL_CONTEXT_DEVICES>();
// Check that device fission is supported
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find(
"cl_ext_device_fission") == std::string::npos) {
std::cout << "Required that device support “
<< “cl_ext_device_extension"
cl_device_partition_property_ext subDeviceProperties[] =
{
CL_DEVICE_PARTITION_EQUALLY_EXT,
CL_PROPERTIES_LIST_END_EXT,
devices[0].createSubDevices(subDeviceProperties, &subDevices_);
if (subDevices_.size() <= 0) {
std::cout << "Failed to allocate sub-devices" << std::endl;
for (auto i = subDevices_.begin(); i != subDevices_.end(); i++) {
queues_.push_back(cl::CommandQueue(context_, *i));
std::cout << "Number of sub-devices "
static unsigned int atomicAdd(
volatile unsigned int *dest)
return (unsigned int)_InterlockedExchangeAdd(
#else
return __sync_fetch_and_add(dest, inc);
#endif
bool parallelFor(int range, std::function<void (int i)> f)
{
std::vector<cl::Event> events;
args[0] = reinterpret_cast<size_t>(&f);
for (int x = 0; x < range; x++) {
int numQueues =
range - x > queues_.size() ? queues_.size() : range - x;
while(numQueues > 0) {
args[1] = static_cast<size_t>(index++);
queues_[numQueues-1].enqueueNativeKernel(
std::make_pair(
static_cast<void *>(args),
cl::Event::waitForEvents(events);
const unsigned int numNumbers = 1024;
main(int argc, char** argv)
{
volatile unsigned int numPrimes = 0;
int * numbers = new int[numNumbers];
for (size_t i = 0; i < numNumbers; ++i) {
try {
parallel.parallelFor(numNumbers, [numbers, &numPrimes] (int x) {
auto isPrime = [] (unsigned int n) -> bool {
for (unsigned int odd = 3;
odd <= static_cast<unsigned int>(
sqrtf(static_cast<float>(n)));
if (isPrime(numbers[x])) {
Parallel::atomicAdd(1, &numPrimes);
std::cout << "Number of primes found = " << numPrimes << std::endl;
Double Precision
Floating point formats were created to allow programmers to work with very large and very small non-integral data values. For many applications, single precision floating point does not provide enough range for the targeted application. Many applications (particularly in science and engineering) require part or all of a particular computation to use double precision. OpenCL does not require that a particular compute device support double precision. For critical applications in which double precision is required, OpenCL provides the optional
cl_khr_fp64 extension. This is enabled by including the following directive before any double precision use in an OpenCL C program:
#pragma cl_khr_fp64 : enable
Once enabled, the double precision support provides access to the following data types:
Type | Description |
---|
Double | Double precision floating point number |
double2 | 2-component double vector |
double3 | 3-component double vector |
double4 | 4-component double vector |
double8 | 8-component double vector |
double16 | 16-component double vector |
The double type conforms to the IEEE 754 double precision storage format.
There is a one-to-one mapping with the corresponding single precision float types. On an AMD CPU, when the OpenCL device is
CL_DEVICE_TYPE_CPU, vector types are mapped directly to SSE and AVX packed registers. In the case in which a vector type is larger than the underlying hardware vector, the resulting implementation is expanded into multiples of the hardware width. All of the conversion rules defined by the OpenCL specification for float types are also defined for doubles. The built-in math functions provided for float have been extended to include
appropriate versions that work on double, double2, double3, double4, double8, and double16 as arguments and return values.
As a simple example,
Listing 11.2 shows OpenCL C code to implement a block matrix multiple using double precision. Note that the first line of the OpenCL source file enables the extension.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#define AS(i, j) As[(j) * BLOCK_SIZE + (i)]
#define BS(i, j) Bs[(j) * BLOCK_SIZE + (i)]
{
int bx = get_group_id(0);
int by = get_group_id(1);
int tx = get_local_id(0);
int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a += aStep, b += bStep) {
// Load the matrices from device memory
// to local memory; each work-item loads
// one element of each matrix
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each work-item computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
// Write the block sub-matrix to device memory;
// each work-item writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
The host OpenCL program, written using the OpenCL C++ Wrapper API, is given in
Listing 11.3, with the shared header,
matrixmul.h, given in
Listing 11.4. The example is straightforward, but two points are worth discussing further. First, to avoid unexpected runtime errors, the application must check that the device supports the extension
cl_khr_fp64, and this is achieved with the following code:
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp64") ==
std::cout << "Required that device support cl_khr_fp64" << std::endl;
In this case, if the device does not support the extension, then the application simply exits. A more robust solution might drop back to the host to perform the computation if necessary. Second, this example uses the profiling API for command queues to collect information on the time to execute the matrix multiple.
#include <CL/cl.hpp> // C++ Wrapper API (no need to include cl.h)
template <size_t X, size_t Y>
public:
template <typename Functor>
Matrix(Functor init) { init(data_, X, Y); }
Matrix() { ::memset(data, ' ', X * Y * sizeof(cl_double)); }
Matrix<X,Y> operator -= (const Matrix<X,Y>& rhs)
{
for (size_t i = 0; i < (X * Y); ++i) {
data_[i] -= rhs.data_[i];
Matrix<X,Y> operator - (const Matrix<X,Y>& rhs)
{
Matrix<X,Y> result = *this;
bool operator == (const Matrix<X,Y>& rhs)
{
for (size_t i = 0; i < (X * Y); ++i) {
if (data_[i] != rhs.data_[i]) {
bool operator != (const Matrix<X,Y>& rhs)
bool compareL2fe(const Matrix<X,Y>& reference, cl_double epsilon) {
for (size_t i = 0; i < (X * Y); ++i) {
cl_double diff = reference.data_[i] - data_[i];
ref += reference.data_[i] * reference.data_[i];
cl_double normRef =::sqrt((double) ref);
if (::fabs((double) ref) < 1e-7f) {
cl_double normError = ::sqrtf((double) error);
error = normError / normRef;
void printOn(std::ostream& out)
{
for (size_t y = 0; y < Y; ++y) {
for (size_t x = 0; x < X; ++x) {
out << data_[y * X + x] << " ";
cl_double* data() { return data_; }
size_t size() const { return X * Y * sizeof(cl_double); }
randomInit(cl_double* data, size_t width, size_t height)
{
for (size_t i = 0; i < width*height; ++i) {
data[i] = ::rand() / (double) RAND_MAX;
nop(cl_double*, size_t, size_t)
computeGold(
cl_double* C, const cl_double* A, const cl_double* B,
cl_uint hA, cl_uint wA, cl_uint wB)
{
for (cl_uint i = 0; i < hA; ++i) { for (cl_uint j = 0; j < wB; ++j) {
for (cl_uint k = 0; k < wA; ++k) {
cl_double a = A[i * wA + k];
cl_double b = B[k * wB + j];
C[i * wB + j] = (cl_double)sum;
main(int argc, char** argv)
{
cl_device_type devType = CL_DEVICE_TYPE_CPU;
cl_int err, err1, err2, err3;
Matrix<WA,HA> A(&randomInit);
Matrix<WB,HB> B(&randomInit);
// Initialize the OpenCL runtime.
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties cprops[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[1](), 0 };
cl::Context context(devType, cprops);
std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
if (devices.size() == 0) {
std::cerr << "Device not available
";
// Check that double is supported
if (devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp64") ==
std::cout << "Required that device support cl_khr_fp64" << std::endl;
std::ifstream file("matrixmul_kernels.cl");
if (!file.is_open()) {
std::cerr << "We couldn't load CL source code
";
std::string prog(
std::istreambuf_iterator<char>(file), (std::istreambuf_iterator<char>()));
cl::Program::Sources source(
std::make_pair(prog.c_str(),prog.length()+1));
cl::Program program(context, source);
err = program.build(devices, "−I.");
if (err != CL_SUCCESS) {
str = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]);
std::cout << "Program Info: " << str;
cl::Kernel matrixMul(program, "matrixMul", &err);
if (err != CL_SUCCESS) {
std::cerr << "Could not create kernel "matmult"
";
cl::Buffer in0(context, CL_MEM_USE_HOST_PTR, A.size(), A.data(), &err1);
cl::Buffer in1(context, CL_MEM_USE_HOST_PTR, B.size(), B.data(), &err2);
cl::Buffer out(context, CL_MEM_ALLOC_HOST_PTR,
WC * HC * sizeof(cl_double), NULL, &err3);
if (err1 != CL_SUCCESS || err2 != CL_SUCCESS || err3 != CL_SUCCESS) {
std::cerr << "Could not create memory objects
";
err = matrixMul.setArg(0, out);
err |= matrixMul.setArg(1, in0);
err |= matrixMul.setArg(2, in1);
err |= matrixMul.setArg(3, WA);
err |= matrixMul.setArg(4, WB);
const size_t localSize = sizeof(double[BLOCK_SIZE][BLOCK_SIZE]);
err |= matrixMul.setArg(5, cl::__local(localSize));
err |= matrixMul.setArg(6, cl::__local(localSize));
if (err != CL_SUCCESS) {
std::cerr << "Could not set matrixMul's args
";
cl_command_queue_properties properties = CL_QUEUE_PROFILING_ENABLE;
cl::CommandQueue queue(context, devices[0], properties, &err);
if (err != CL_SUCCESS) {
std::cerr << "Could not create the command queue
";
err1 = queue.enqueueNDRangeKernel(
matrixMul, cl::NullRange,
cl::NDRange(BLOCK_SIZE, BLOCK_SIZE),
err2 = queue.enqueueReadBuffer(out, CL_TRUE, 0, C.size(), C.data());
if (err1 != CL_SUCCESS || err2 != CL_SUCCESS) {
std::cerr << "matrixMul execution failed
";
Matrix<WC, HC> reference(&nop);
computeGold(reference.data(), A.data(), B.data(), HA, WA, WB);
if (!C.compareL2fe(reference, 1e-6f)) {
Matrix<WC, HC> difference = reference - C;
difference.printOn(std::cout);
if ((properties & CL_QUEUE_PROFILING_ENABLE) != 0) {
cl_long start = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_long end = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << "Elapsed time: "
<< (double)(end − start) / 1e6
// (chosen as multiples of the thread block size for simplicity)
#define WA (3 * BLOCK_SIZE) // Matrix A width
#define HA (5 * BLOCK_SIZE) // Matrix A height
#define WB (8 * BLOCK_SIZE) // Matrix B width
#define HB WA // Matrix B height #define WC WB // Matrix C width
#define HC HA // Matrix C height