A histogram is a statistic that shows the frequency of a certain occurrence within a data set. The histogram of an image provides a frequency distribution of pixel values in the image. If the image is a color image, the pixel value can be the luminosity value of each pixel or the individual R, G, and B color channels. We have either a single histogram if the luminosity is used as the pixel value or three individual histograms, one for each channel, if the R, G, and B color channel values are used. Both types of histograms are useful; the luminosity histogram is more accurate at describing the perceived brightness distribution in an image, whereas the R, G, B color histogram can be a better choice in determining if individual colors are clipped.
In this chapter, we look at how to implement a histogram for color images with OpenCL.
We look at how to compute the histogram for R, G, and B channel values of a color image. For an RGB or RGBA image with 8 bits per channel, the R, G, and B color channels can have values from 0 to 255.
Listing 14.1 shows how to compute the histogram of R, G, and B channels of an image. This code implements a sequential algorithm; that is, the algorithm loops through the pixels of the image serially to generate the histogram results.
// This function computes the histogram for R, G, and B.
//
// image_data is a pointer to an RGBA image with 8 bits per channel
// w is the width of the image in pixels
// h is the height of the image in pixels
//
// The histogram is an array of 256 bins for R, G, and B.
// Each bin entry is a 32-bit unsigned integer value.
//
unsigned int *
histogram_rgba_unorm8(void *image_data, int w, int h)
{
unsigned char *img = (unsigned char *)image_data;
unsigned int *ref_histogram_results;
unsigned int *ptr;
int i;
// clear the histogram results buffer to zeros.
//
// the histogram buffer stores the histogram values for R
// followed by the histogram values for G and then B.
// Since there are 256 bins for an 8-bit color channel,
// the histogram buffer is 256 * 3 entries in size.
// Each entry is a 32-bit unsigned integer value.
//
ref_histogram_results = (unsigned int *)malloc(256 * 3 *
sizeof(unsigned int));
ptr = ref_histogram_results;
memset(ref_histogram_results, 0x0, 256 * 3 *
sizeof(unsigned int));
// compute histogram for R
for (i=0; i<w*h*4; i+=4)
{
int indx = img[i];
ptr[indx]++;
}
ptr += 256;
// compute histogram for G
for (i=1; i<w*h*4; i+=4)
{
int indx = img[i];
ptr[indx]++;
}
ptr += 256;
// compute histogram for B
for (i=2; i<w*h*4; i+=4)
{
int indx = img[i];
ptr[indx]++;
}
return ref_histogram_results;
}
Let us now look at how to write a parallel implementation of the histogram algorithm. An obvious approach to parallelizing the histogram computation is to break the image into tiles, compute the histogram for each tile, and then combine the partial histograms computed for each tile into the final histogram of the image. Listing 14.2 describes the OpenCL kernels that compute the partial histogram for a tile. The partial histogram computed per tile is stored in local memory for performance reasons. The kernel uses the built-in atomic functions as described by the OpenCL 1.1 specification to update the per-tile histogram values. This kernel requires either an OpenCL 1.1 device or an OpenCL 1.0 device that implements the cl_khr_local_int32_base_atomics
extension.
//******************************************************************
// This kernel takes an RGBA 8-bit-per-channel input image and
// produces a partial histogram for R, G, and B. Each work-group
// represents an image tile and computes the histogram for that
// tile.
//
// partial_histogram is an array of num_groups * (256 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
//
// We store 256 R bins, followed by 256 G bins, and then the 256
// B bins.
//******************************************************************
kernel void
histogram_partial_image_rgba_unorm8(image2d_t img,
global uint *histogram)
{
int local_size = (int)get_local_size(0) *
(int)get_local_size(1);
int image_width = get_image_width(img);
int image_height = get_image_height(img);
int group_indx = (get_group_id(1) * get_num_groups(0)
+ get_group_id(0)) * 256 * 3;
int x = get_global_id(0);
int y = get_global_id(1);
local uint tmp_histogram[256 * 3];
int tid = get_local_id(1) * get_local_size(0)
+ get_local_id(0);
int j = 256 * 3;
int indx = 0;
// clear the local buffer that will generate the partial
// histogram
do
{
if (tid < j)
tmp_histogram[indx+tid] = 0;
j -= local_size;
indx += local_size;
} while (j > 0);
barrier(CLK_LOCAL_MEM_FENCE);
if ((x < image_width) && (y < image_height))
{
float4 clr = read_imagef(img,
CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST,
(float2)(x, y));
uchar indx_x, indx_y, indx_z;
indx_x = convert_uchar_sat(clr.x * 255.0f);
indx_y = convert_uchar_sat(clr.y * 255.0f);
indx_z = convert_uchar_sat(clr.z * 255.0f);
atomic_inc(&tmp_histogram[indx_x]);
atomic_inc(&tmp_histogram[256+(uint)indx_y]);
atomic_inc(&tmp_histogram[512+(uint)indx_z]);
}
barrier(CLK_LOCAL_MEM_FENCE);
// copy the partial histogram to appropriate location in
// histogram given by group_indx
if (local_size >= (256 * 3))
{
if (tid < (256 * 3))
histogram[group_indx + tid] = tmp_histogram[tid];
}
else
{
j = 256 * 3;
indx = 0;
do
{
if (tid < j)
histogram[group_indx + indx + tid] =
tmp_histogram[indx + tid];
j -= local_size;
indx += local_size;
} while (j > 0);
}
}
histogram_partial_image_rgba_unorm8
produces num_groups
partial histograms. We now need to sum these partial histograms to generate the final histogram for the image. Listing 14.3 describes the OpenCL kernel that is used to sum the partial histogram results into the final histogram of the image.
//******************************************************************
// This kernel sums partial histogram results into a final
// histogram result.
//
// num_groups is the number of work-groups used to compute partial
// histograms.
//
// partial_histogram is an array of num_groups * (256 * 3) entries.
// we store 256 R bins, followed by 256 G bins, and then the 256 B
// bins.
//
// The final summed results are returned in histogram.
//******************************************************************
kernel void
histogram_sum_partial_results_unorm8(
global uint *partial_histogram,
int num_groups,
global uint *histogram)
{
int tid = (int)get_global_id(0);
int group_indx;
int n = num_groups;
local uint tmp_histogram[256 * 3];
tmp_histogram[tid] = partial_histogram[tid];
group_indx = 256*3;
while (--n > 0)
{
tmp_histogram[tid] += partial_histogram[group_indx + tid];
group_indx += 256*3;
}
histogram[tid] = tmp_histogram[tid];
}
The host side code that describes the OpenCL API calls used to enqueue the two kernels in Listings 14.2 and 14.3 is provided in Listing 14.4.
int image_width = 1920;
int image_height = 1080;
size_t global_work_size[2];
size_t local_work_size[2];
size_t partial_global_work_size[2];
size_t partial_local_work_size[2];
size_t workgroup_size;
size_t num_groups;
cl_kernel histogram_rgba_unorm8;
cl_kernel histogram_sum_partial_results_unorm8;
size_t gsize[2];
// create kernels
histogram_rgba_unorm8 = clCreateKernel(program,
"histogram_image_rgba_unorm8",
&err);
histogram_sum_partial_results_unorm8 = clCreateKernel(program,
"histogram_sum_partial_results_unorm8",
&err);
// get max. work-group size that can be used for
// histogram_image_rgba_unorm8 kernel
clGetKernelWorkGroupInfo(histogram_rgba_unorm8, device,
CL_KERNEL_WORK_GROUP_SIZE,
sizeof(size_t), &workgroup_size, NULL);
if (workgroup_size <= 256)
{
gsize[0] = 16;
gsize[1] = workgroup_size / 16;
}
else if (workgroup_size <= 1024)
{
gsize[0] = workgroup_size / 16;
gsize[1] = 16;
}
else
{
gsize[0] = workgroup_size / 32;
gsize[1] = 32;
}
local_work_size[0] = gsize[0];
local_work_size[1] = gsize[1];
global_work_size[0] = ((image_width + gsize[0] - 1) / gsize[0]);
global_work_size[1] = ((image_height + gsize[1] - 1) / gsize[1]);
num_groups = global_work_size[0] * global_work_size[1];
global_work_size[0] *= gsize[0];
global_work_size[1] *= gsize[1];
err = clEnqueueNDRangeKernel(queue,
histogram_rgba_unorm8,
2, NULL, global_work_size, local_work_size,
0, NULL, NULL);
// get max. work-group size that can be used for
// histogram_sum_partial_results_unorm8 kernel
clGetKernelWorkGroupInfo(histogram_sum_partial_results_unorm8,
device, CL_KERNEL_WORK_GROUP_SIZE,
sizeof(size_t), &workgroup_size, NULL);
if (workgroup_size < 256)
{
printf("A min. of 256 work-items in work-group is needed for
histogram_sum_partial_results_unorm8 kernel. (%d)
",
(int)workgroup_size);
return EXIT_FAILURE;
}
partial_global_work_size[0] = 256*3;
partial_local_work_size[0] =
(workgroup_size > 256) ? 256 : workgroup_size;
err = clEnqueueNDRangeKernel(queue,
histogram_sum_partial_results_unorm8,
1, NULL, partial_global_work_size,
partial_local_work_size,0, NULL, NULL);
if (err)
{
printf("clEnqueueNDRangeKernel() failed for
histogram_sum_partial_results_unorm8 kernel.
(%d)
", err);
return EXIT_FAILURE;
}
Let’s see if additional optimizations are possible with the kernels. One thing we notice is that the histogram_sum_partial_results_unorm8
kernel is bound by memory operations. GPUs hide memory latency by switching to other work-items or work-groups to perform compute operations. In this case, there is not much compute as the total number of work-items (i.e., global_work_size
) specified to clEnqueueNDRangeKernel
is (256*3, 1, 1
), so it may be hard to hide memory latency. One thing we can do is reduce the amount of data we have to fetch from memory to sum partial histograms.
We can do this by reducing the number of partial histograms and performing more work per work-item in the histogram_partial_image_rgba_unorm8
kernel. It turns out that reducing the number of partial histograms makes the overall histogram computation significantly faster. This optimized version of histogram_partial_results_rgba_unorm8
is described in Listing 14.5.
//
// This kernel takes an RGBA 8-bit-per-channel input image and
// produces a partial histogram for R, G, and B. Each work-group
// represents an image tile and computes the histogram for that
// tile.
//
// num_pixels_per_workitem is the number of pixels for which the
// histogram is computed by each work-item. In the implementation
// described in Listing 14.3, num_pixels_per_workitem = 1.
//
// partial_histogram is an array of
// num_groups * (256 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
// num_groups is affected by value of num_pixels_per_workitem.
//
// We store 256 R bins, followed by 256 G bins, and then the 256
// B bins.
//
kernel void
histogram_partial_rgba_unorm8(image2d_t img,
int num_pixels_per_workitem,
global uint *partial_histogram)
{
int local_size = (int)get_local_size(0) *
(int)get_local_size(1);
int image_width = get_image_width(img);
int image_height = get_image_height(img);
int group_indx = (get_group_id(1) * get_num_groups(0) +
get_group_id(0)) * 256 * 3;
int x = get_global_id(0);
int y = get_global_id(1);
local uint tmp_histogram[256 * 3];
int tid = get_local_id(1) * get_local_size(0) + get_local_id(0);
int j = 256 * 3;
int indx = 0;
// clear the local buffer that will generate the partial
// histogram
do
{
if (tid < j)
tmp_histogram[indx+tid] = 0;
j -= local_size;
indx += local_size;
} while (j > 0);
barrier(CLK_LOCAL_MEM_FENCE);
int i, idx;
for (i=0, idx=x; i<num_pixels_per_workitem;
i++, idx+=get_global_size(0))
{
if ((idx < image_width) && (y < image_height))
{
float4 clr = read_imagef(img,
(CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST),
(float2)(idx, y));
uchar indx_x = convert_uchar_sat(clr.x * 255.0f);
uchar indx_y = convert_uchar_sat(clr.y * 255.0f);
uchar indx_z = convert_uchar_sat(clr.z * 255.0f);
atomic_inc(&tmp_histogram[indx_x]);
atomic_inc(&tmp_histogram[256+(uint)indx_y]);
atomic_inc(&tmp_histogram[512+(uint)indx_z]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
// copy the partial histogram to appropriate location in
// histogram given by group_indx
if (local_size >= (256 * 3))
{
if (tid < (256 * 3))
partial_histogram[group_indx + tid] =
tmp_histogram[tid];
}
else
{
j = 256 * 3;
indx = 0;
do
{
if (tid < j)
partial_histogram[group_indx + indx + tid] =
tmp_histogram[indx + tid];
j -= local_size;
indx += local_size;
} while (j > 0);
}
}
The histogram_sum_partial_results_unorm8
kernel requires no changes and is as described in Listing 14.3.
Listing 14.6 describes how to compute the histogram for an RGBA image with a half-float or float channel. The major difference between computing a histogram for an image with 8 bits per channel versus a half-float or float channel is that the number of bins for a half-float or float channel is 257 instead of 256. This is because floating-point pixel values go from 0.0 to 1.0 inclusive.
//******************************************************************
// This kernel takes an RGBA 32-bit or 16-bit FP-per-channel input
// image and produces a partial histogram for R, G, and B. Each
// work-group represents an image tile and computes the histogram for
// that tile.
//
// partial_histogram is an array of num_groups * (257 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
//
// We store 257 R bins, followed by 257 G bins, and then the 257 B
// bins.
//
//******************************************************************
kernel void
histogram_image_rgba_fp(image2d_t img,
int num_pixels_per_workitem,
global uint *histogram)
{
int local_size = (int)get_local_size(0) *
(int)get_local_size(1);
int image_width = get_image_width(img);
int image_height = get_image_height(img);
int group_indx = (get_group_id(1) * get_num_groups(0)
+ get_group_id(0)) * 257 * 3;
int x = get_global_id(0);
int y = get_global_id(1);
local uint tmp_histogram[257 * 3];
int tid = get_local_id(1) * get_local_size(0)
+ get_local_id(0);
int j = 257 * 3;
int indx = 0;
// clear the local buffer that will generate the partial
// histogram
do
{
if (tid < j)
tmp_histogram[indx+tid] = 0;
j -= local_size;
indx += local_size;
} while (j > 0);
barrier(CLK_LOCAL_MEM_FENCE);
int i, idx;
for (i=0, idx=x; i<num_pixels_per_workitem;
i++, idx+=get_global_size(0))
{
if ((idx < image_width) && (y < image_height))
{
float4 clr = read_imagef(img,
CLK_NORMALIZED_COORDS_FALSE |
CLK_ADDRESS_CLAMP_TO_EDGE |
CLK_FILTER_NEAREST,
(float2)(idx, y));
ushort indx;
indx = convert_ushort_sat(min(clr.x, 1.0f) * 256.0f);
atomic_inc(&tmp_histogram[indx]);
indx = convert_ushort_sat(min(clr.y, 1.0f) * 256.0f);
atomic_inc(&tmp_histogram[257+indx]);
indx = convert_ushort_sat(min(clr.z, 1.0f) * 256.0f);
atomic_inc(&tmp_histogram[514+indx]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
// copy the partial histogram to appropriate location in
// histogram given by group_indx
if (local_size >= (257 * 3))
{
if (tid < (257 * 3))
histogram[group_indx + tid] = tmp_histogram[tid];
}
else
{
j = 257 * 3;
indx = 0;
do
{
if (tid < j)
histogram[group_indx + indx + tid] =
tmp_histogram[indx + tid];
j -= local_size;
indx += local_size;
} while (j > 0);
}
}
//******************************************************************
// This kernel sums partial histogram results into a final histogram
// result.
//
// num_groups is the number of work-groups used to compute partial
// histograms.
//
// partial_histogram is an array of num_groups * (257 * 3) entries.
// we store 257 R bins, followed by 257 G bins, and then the 257 B
// bins.
//
// The final summed results are returned in histogram.
//******************************************************************
kernel void
histogram_sum_partial_results_fp(global uint *partial_histogram,
int num_groups,
global uint *histogram)
{
int tid = (int)get_global_id(0);
int group_id = (int)get_group_id(0);
int group_indx;
int n = num_groups;
uint tmp_histogram, tmp_histogram_first;
int first_workitem_not_in_first_group =
((get_local_id(0) == 0) && group_id);
tid += group_id;
int tid_first = tid - 1;
if (first_workitem_not_in_first_group)
tmp_histogram_first = partial_histogram[tid_first];
tmp_histogram = partial_histogram[tid];
group_indx = 257*3;
while (--n > 0)
{
if (first_workitem_not_in_first_group)
tmp_histogram_first += partial_histogram[tid_first];
tmp_histogram += partial_histogram[group_indx+tid];
group_indx += 257*3;
}
if (first_workitem_not_in_first_group)
histogram[tid_first] = tmp_histogram_first;
histogram[tid] = tmp_histogram;
}
The full source (kernels and host source code) for the histogram is provided in the Chapter_14/histogram
directory of the book’s source code examples.