Chapter 8. OpenCL Case Study

Video Processing
This chapter describes a video processor, which is an application for decoding video, processing it with effects, and displaying it to the screen using OpenCL. We cover performance aspects such as how to efficiently load balance, transfer data, and manage the multiple parts of the operation along with how one can measure performance and profile OpenCL components are discussed.
Keywords DMA, event list chaining, OpenGL/OpenCL interrop, performance, profiling, video decoding

Introduction

In this chapter, we discuss an interesting application that allows us to explore a range of aspects of OpenCL. The application is a video processor that takes a video stream and performs image processing operations (effects) on its content using OpenCL. The resulting output is a video stream that can be presented to the display in real time. The application usage can scale when applying multiple effects to a video stream or when multiple input video streams are enhanced to combine multiple video streams together to produce a single output stream.
To accomplish this task, we consider various important techniques, including specialty hardware decoding with the OpenVideo Decode API, different methods to transfer decoded video frames to the compute device, advanced compute dependencies handled with event lists, gathering different performance metrics, and the sharing of buffers between OpenCL and OpenGL to display the video. This final sharing of buffers between APIs we term interoperability.
Performance is a core objective in this example as we examine the different ways to accomplish the task at hand. Tuning the performance for this application can be challenging because there are many aspects that can be made faster. However, the output rate of our video stream is our benchmark, with the goal of playing it back in real time.

Getting Video Frames

Before we discuss the OpenCL example, we first need to fulfill the input requirements of our video processor by providing an input video to the video processor. Here, we address the case in which the input video exists as a file on the computer in a compressed form. Once input data is read from disk, one frame of video in RGB format feeds each effect and processed output results. To go from the compressed video to our desired RGB format, the file must be decoded.

Decoding on the CPU

Many decoding libraries exist for the various platforms supporting video decode using the power of the CPU. Assuming one is using a GPU as an OpenCL compute device, which is common in a system in which video display is the final goal, we can choose to decode the video on the CPU and perform the processing of video effects on the GPU. This choice lends itself to a very simple form of heterogeneous computing shown in Figure 8.1, in which the partitioning of the work falls on the task boundary. One limitation inherent in adopting this style of heterogeneous computing is that performance will be limited by either the task's entire performance or the cost of transferring results from one device to another.
B9780123877666000311/f08-01-9780123877666.jpg is missing
Figure 8.1
Processing video by decoding on the CPU and then processing and displaying on the GPU
For this example, we choose to use the libvlc library from the VLC project. This library is not only well established, widely used, and high performance but also available on Linux, Windows, and OS X and it removes dependences on a particular platform. Working in OpenCL also has the benefit that the open standard has been implemented in Linux, Windows, and OS X. Later, we will see that the OpenCL code for this example will be the same on all three platforms (with a minor exception for the display code). The last major component of this example is the display of the output video to the user, and we use OpenGL for this purpose, continuing the use of open standards.
To decode a frame of video, we must first initialize the decoding framework and request that it open the file. The decode framework will then decode a frame of video in the designated format. The framework may have different methods for requesting a frame, the most basic of which is a straightforward blocking call approach that returns control to the caller once the frame is ready. Alternatively, the decoding framework may rely on callback functions that are issued after the decoded frame is ready and support asynchronous decode requests.
Using a callback methodology can provide benefits compared to a blocking call. However, regardless of the method selected, it is important to think about performance and how decoding can affect the time left for processing the video. If a programmer were to write this application as a single thread with a main loop, then the callback methodology would allow for some overlapping of video decoding and other work, only alerting our program when the data was ready. The main loop would follow by performing a minimal amount of work, optimally just copying the decoded frame. If, however, the application uses a blocking method, then the main loop must issue a call to the decoding library and block until the next frame is decoded. This blocking reduces the amount of work that can be carried out in parallel with decoding—a problem particularly acute if disk input or output is required to obtain the video frames.

Multithreading Decoding

Given that the goal is to maximize performance, even doing minimal work for decoding in the main loop is more expensive than doing none. To remove all the decoding work from the main loop, we create worker threads to do the work. A worker thread is spawned for each input video to be decoded, and the asynchronous decoding and execution of other work are achieved through application-level threads rather than API callbacks. Instead of blocking, the application focuses on having everything ready for the main loop. To keep the main loop running and not stalled waiting on any worker threads, our example code uses a simple first in, first out (FIFO) buffer of decoded frames as its method to feed the main loop. The decoding helper threads monitor the FIFO queue and attempt to keep it full, whereas the size of the FIFO is designed to be large enough to account for the maximum latency on the decoding side (Figure 8.2).
B9780123877666000311/f08-02-9780123877666.jpg is missing
Figure 8.2
Decoding a video using helper threads in addition to the main application thread

Decoding Video on the GPU

Many modern GPUs include fixed function hardware that is available to decode video for certain formats. This feature is designed to enable users to watch video with extended battery life by reducing the power consumption of the CPU (or GPU) due to the higher efficiency of specialty decoding hardware compared with general-purpose computation. This hardware is known by many names: AMD refers to this feature as Unified Video Decode, NVIDIA uses the term PureVideo, and it can also be found in some Intel graphics devices. The Microsoft Window's DXVA API is commonly used to access this hardware on Windows.
The mechanics of using the video decode hardware is that the compressed video data stream is transferred to the GPU device, where it is decoded, and the decompressed video frame is written into the GPU memory (Figure 8.3). To use the decompressed video frame in OpenCL, we need to map the frame into OpenCL's memory space. This is true even if it is already in the GPU memory because different APIs manage memory in different ways. The alternative method would be to copy the decoded frame back from GPU memory to system memory and then upload that frame to the OpenCL device. This latter method may often be more expensive due to the multiple copies across system buses that it requires; thus, it is preferable to have the decoded video data directly accessible to OpenCL.
B9780123877666000311/f08-03-9780123877666.jpg is missing
Figure 8.3
Using the GPU to perform processing in parallel with further decoding of frames

OpenVideo Decode

To benefit from the fixed function decoding hardware in this example while avoiding jumping through hoops to access the memory of the decode frame, the example uses the OpenVideo Decode API. This API supports access to the decoded video frames from OpenCL. The API is written in such a way that it operates on a platform-independent context that has been acquired from a separate API domain such as OpenCL.
Before using the OpenVideo Decode API, we need to check the capabilities of the device and the video format to ensure that they are compatible. In OpenCL, this is achieved by checking that the device capabilities include cl_amd_open_video to interoperate with OpenVideo. In OpenVideo, we can check the results of OVDecodeGetDeviceInfo() and OVDecodeGetDeviceCap() to determine what video accelerators are available and what they are capable of handling.
if ( !(ret = OVDecodeGetDeviceInfo(&numDevices, 0)){
//handle getting device info not succeeded
}
if(numDevices == 0){
//handle no devices being available
}
ovdecode_device_info *deviceInfo =
newovdecode_device_info[numDevices];
ret = OVDecodeGetDeviceInfo(&numDevices, deviceInfo);
for(unsignedint i = 0; i <numDevices; i++){
ovdecode_cap *caps =
newovdecode_cap[deviceInfo[i].decode_cap_size];
ret = OVDecodeGetDeviceCap(
deviceInfo[i].device_id,
deviceInfo[i].decode_cap_size, caps);
Next, we start the coordination between OpenCL and OpenVideo by first creating an OpenCL context and then creating a decode session in OpenVideo passing the OpenCL context to the OVDecodeCreateSession() function.
We must create a decode session for each video stream to be decoded. Furthermore, the application must coordinate the memory that is used and shared by OpenCL and OpenVideo. This is achieved by creating memory objects in OpenCL and then turning them into platform-independent handles using clGetPlatformMemHandle(). From now on, both OpenCL and OpenVideo will have handles to this memory and will need to synchronize accesses.
OPContextHandleovdContext = clCreateContext(
properties,
1,
&clDeviceID,
0,
0,
&err);
ov_session session = OVDecodeCreateSession (
ovdContext,
ovDeviceID,
profile,
output_Format,
output_Width,
output_Height);
With the basic setup described and a link forged between the OpenCL and OpenVideo APIs, we next consider the main decoding loop using OpenVideo's decode functionality. To synchronize the memory accesses, the application first acquires the memory object from OpenVideo Decode using OVDecodeAquireObject(), which enables access such that the decoded video can be written into that memory. For the decode to take place, OVDecodePicture() is used to pass in the compressed video stream, which is fed in from a parsed bitstream, along with the parameters for the decode. Creating the parsed bitstream takes intimate knowledge of the compressed video format; therefore, a third-party library is used for this task. After decoding is complete, the memory object is released from OpenVideo so that the OpenCL runtime can acquire it in its domain and perform memory accesses on the decoded frame. Finally, the memory object can be released when this is complete. We iterate this loop of memory access synchronization, decoding, and accessing the decoded video continue until the video is finished.
The following is the sequence of required commands:
OVresultOVAcquireObject (
session,
unsignedintnum_handle,
ov_handle*decode_handle,
1,
&pre_event,
&acquire_event);
res = OVDecodePicture(
session,
&picture_parameter,
&pic_parameter_2,
pic_parameter_2_size,
&bitstream_data,
bitstream_data_read_size,
slice_data_control,
slice_data_control_size,
output_surface,
1,
&acquire_event,
&eventRunVideoProgram,
0);
ret =OVReleaseObject (
session,
unsignedintnum_handle,
ov_handle*decode_handle,
1 /*num_event_in_wait_list*/,
&decode_event,
&release_event);
status = clEnqueueNDRangeKernel(…);
Although overhead is introduced due to synchronization of the memory object accesses, using the fixed function hardware for video decoding frees up compute resources and reduces the need to pass the decompressed video back and forth. This also reduces the amount of traffic on the bus linking the GPU to the system because only the compressed video needs to be transferred. However, there are limitations to doing decode this way due to the nature of the fixed function decode hardware. The number of video streams that the hardware is designed to decode simultaneously is usually limited. Most high-end systems decode two streams simultaneously (covering picture-in-picture scenarios). Because of this limitation, we must provide a CPU decoding fallback path.

Processing a Video in OpenCL

Assuming we can access the input data, we next discuss how to process the data in OpenCL. The video processing performed in this example uses a single frame of video as input and a single frame as output. It allows for different parameters to be set to control the effect. The video processing parameters are communicated through setting the arguments for the kernel.
A few features of OpenCL make accessing image data easier. First, vector types such as <type>4 (e.g., char4, int4, and float4) map well to a four-color (red, green, blue, alpha (RGBA)) pixel. There is also support for vectors of length three that map even better to RGB, YUV, etc. pixel data.
Another useful way to access image data from OpenCL is to use OpenCL Images. OpenCL Buffers are intended to be used for general-purpose data stored contiguously in memory in the same way as arrays in C. Images not only make access to two- and three-dimensional data easier but also provide support for the data to be stored in various formats and add the ability to perform sampling on that data. For example, image access functions can efficiently perform linear or bi-cubic interpolation. The format of an image can be queried and set, but once the data is in the image, a call to read_image[u,f]() in an OpenCL kernel can produce the data in various output formats.
The video processing example uses images as the data format to store the video frames, and we demonstrate this using a simple effect. The chosen effect darkens or lightens the frame in a linear manner (versus using a gamma correction) with a single parameter. The OpenCL kernel has three parameters: the input video frame, the output video frame, and a floating point scalar representing brightness. When the brightness value is set to 1.0f, there is no change in the output frame. A value greater than 1.0f causes the output to brighten, and a value less than 1.0f darkens the output.
__kernel void
brighten(
read_only image2d_t in,
write_only image2d_t out,
float brightness)
{
float4val;
int2 coordinate = (int2)(get_global_id(0), get_global_id(1));
constsampler_t sampler =
CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_NONE |
CLK_FILTER_NEAREST;
val = read_imagef(in, sampler, coordinate);
val *= (float4)(brightness, brightness, brightness, 1.0f);
write_imagef(val, coordinate);
}
Although the video frame may be in RGBA format with 8 bits per channel, the brightness parameter is a float to allow for fine adjustments. Because we use OpenCL Images in this example, it is sufficient to call the read_imagef()function to return a float4 of each pixel and apply the brightness correction before writing the output pixel back with write_imagef().
After the input video frame is uploaded to an image, we create the special effect OpenCL kernel and apply it to the input data. To apply the special effect, the application compiles the kernel and sets the kernel's arguments accordingly. In the base case of applying just one filter, the output frame is written to a location that is most convenient for display.

Processing Multiple Videos with Multiple Special Effects

To process multiple special effects, we follow the same set of steps that we followed for processing a single video and single special effect, except that now we chain effects together. One could imagine an intermediate image, output by one effect and used as input to the next. Because each effect must wait for the previous effect to finish before it can start executing, the use of an intermediate buffer can be removed and we can use the video frame buffer directly instead. There are some issues with this approach if the algorithm depends on values of other pixels to compute the current pixel (so-called gather algorithms) or when the output pixel is used to alter output pixels (so-called spreading algorithms). In such cases, intermediate buffers may be necessary. However, to work as efficiently as possible and remove unneeded copies between intermediate buffers, the goal is to eliminate the need for intermediate buffers whenever possible.

Event Chaining

The OpenCL command queue model used for scheduling work uses events to capture information about enqueued work. These events allow chains of dependencies to be created such that completion of one command is ensured before the start of the next. In addition, these events can be used by an application to query the status of the queue. Each event is normally created with a call to clEnqueue{*}, and a list of event dependencies can be given to the clEnqueue{*} operations. The runtime will check that each dependency is fulfilled before starting the work (or, specifically, submitting the work) that is enqueued. However, because clEnqueue {*} is creating the event, we cannot put a dependency on a future clEnqueue{*} call because that event has not yet been created.
Following the example of the single video in which we enqueued multiple effects that included dependencies, the same method can be applied to multiple video streams. Because the video processor could have an effect that takes two or more video streams as input, one must ensure that these streams have already finished processing and previously enqueued effects before combining new effects. The OpenCL runtime can handle the ordering dependencies by using the generated events.
The other consideration necessary when working with multiple streams is having intermediate buffers into which processed video frames can be handed off. As in the case of a single video containing multiple effects, we render the output to the video frame buffer of the target stream, thus removing the requirement for intermediate buffers.

Display to Screen of Final Output

At the end of the processing chain, the goal is to present the processed video to the user through display on the screen. Conceptually, displaying a processed video frame is simply a matter of copying the data to a frame buffer or other data structure that interacts with the display. For portability, our example uses OpenGL as an input to which the output video is moved into a double-buffered texture. The texture is rendered by drawing it onto a quad (rectangle) that displays on top of and with the size of the output window.
Before OpenGL can draw the textured quad, the output data must be copied (at least logically) from OpenCL into the texture in OpenGL.

OpenCL/OpenGL Interoperability

OpenCL provides an optional extension to share memory objects between OpenCL and OpenGL. OpenCL extensions, described in detail in Chapter 10, allow the addition of features that are beyond the scope of the core specification. Because these features are optional, an application must test for their availability before assuming they can use any particular extension. Using the OpenCL C++ Wrapper API, a particular extension can be queried using the getInfo() method of the cl::Device classes, returning information for the CL_DEVICE_EXTENSIONS parameter. The resulting string can be tested to determine if OpenGL sharing is supported, comparing it to the specified string value.
//check that the device for support OpenGL/CL interrop
if(devices[0].getInfo<CL_DEVICE_EXTENSIONS>().find(
"cl_khr_gl_sharing") == std::string::npos) {
//OpenCL/GL interrop is not available
}
Because the sharing model for memory objects is defined in a context, we create a context with parameters describing which OpenGL context should be used to support sharing. The method for obtaining an OpenGL context is different across different operating systems, and this is where the slight loss of portability mentioned previously arises. This difference is due to the property name, which changes across context. The following code creates an OpenCL context with OpenGL sharing under Windows:
cl_context_properties props[7] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) amdplatform(),
CL_GL_CONTEXT_KHR,
(intptr_t) wglGetCurrentContext(),
CL_WGL_HDC_KHR,
(intptr_t) wglGetCurrentDC(),//HDC handle of the display
0
};
context = cl::Context(devices,props,NULL,NULL,&err);
if (CL_SUCCESS != err){
std::cerr <<"Creating context failed with value "
<< err << std::endl;
}
Once the OpenCL Context with OpenGL sharing is created, the application must create a texture or buffer in OpenGL that will be used to create an image or buffer in OpenCL. It follows that the video processing application uses images because an OpenGL texture is required.
glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_2D, texture);
glTexParameteri(
GL_TEXTURE_2D,
GL_TEXTURE_MIN_FILTER,
GL_LINEAR);
glTexParameteri(
GL_TEXTURE_2D,
GL_TEXTURE_MAG_FILTER,
GL_LINEAR);
glTexParameteri(
GL_TEXTURE_2D,
GL_TEXTURE_WRAP_S,
GL_CLAMP);
glTexParameteri(
GL_TEXTURE_2D,
GL_TEXTURE_WRAP_T,
GL_CLAMP);
glTexEnvi(
GL_TEXTURE_ENV,
GL_TEXTURE_ENV_MODE,
GL_DECAL);
glEnable(GL_TEXTURE_2D);
Using the cl::Image2DGL class,we create the OpenCL Image2D memory object based on this texture.
glImageBuffer = cl::Image2DGL(
ctx->getContext(),
CL_MEM_READ_WRITE,
GL_TEXTURE_2D,
0,
texture,
&err);
The resulting image can be used just like any other image, with the exception that synchronization must be performed so that both OpenCL and OpenGL do not access stale data (as with the OpenVideo Decode API). This synchronization is performed with enqueueAcquireGLObjects() and enqueueReleaseGLObjects functions called on the queue object. These commands have access to the full OpenCL event architecture because they are enqueued in the same way as any other. When making these synchronization calls, the application needs to ensure that all writes to the memory objects are completed before the call. Enforcing this class of barrier synchronization is implemented in OpenGL using glFinish() and on the OpenCL side by waiting on the resultant event from the command writing to the memory object. The following code eliminates the check of the return values or use of events for brevity (instead assuming an in-order queue):
std::vector<cl::Memory> mem_objs;
mem_objs.push_back(glImageBuffer[buffer_idx]);
ret = cq.enqueueAcquireGLObjects(&mem_objs, NULL, NULL);
ret = cq.enqueueWriteImage(
ImageBuffer[buffer_idx],
CL_FALSE,
origin,
size,
0,
0,
data,
NULL,
NULL);
ret = cq.enqueueReleaseGLObjects(&mem_objs, NULL, NULL);

Summary

In this chapter, we discussed the use of the OpenVideo Decode API, as well as interoperability of OpenCL and OpenGL image structures. The OpenVideo Decode API supports efficient decoding of video streams without significantly impacting power usage. Interoperability between OpenGL and OpenCL enables efficient data sharing, avoiding duplication of data structures when they are already present in a given device's memory. This chapter provides yet another opportunity to leverage the power of the OpenCL framework.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset