Chapter 14. Image Histogram

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14. Image Histogram

A histogram is a statistic that shows the frequency of a certain occurrence within a data set. The histogram of an image provides a frequency distribution of pixel values in the image. If the image is a color image, the pixel value can be the luminosity value of each pixel or the individual R, G, and B color channels. We have either a single histogram if the luminosity is used as the pixel value or three individual histograms, one for each channel, if the R, G, and B color channel values are used. Both types of histograms are useful; the luminosity histogram is more accurate at describing the perceived brightness distribution in an image, whereas the R, G, B color histogram can be a better choice in determining if individual colors are clipped.

In this chapter, we look at how to implement a histogram for color images with OpenCL.

Computing an Image Histogram

We look at how to compute the histogram for R, G, and B channel values of a color image. For an RGB or RGBA image with 8 bits per channel, the R, G, and B color channels can have values from 0 to 255.

Listing 14.1 shows how to compute the histogram of R, G, and B channels of an image. This code implements a sequential algorithm; that is, the algorithm loops through the pixels of the image serially to generate the histogram results.

Listing 14.1 Sequential Implementation of RGB Histogram

// This function computes the histogram for R, G, and B.
//
// image_data is a pointer to an RGBA image with 8 bits per channel
// w is the width of the image in pixels
// h is the height of the image in pixels
//
// The histogram is an array of 256 bins for R, G, and B.
// Each bin entry is a 32-bit unsigned integer value.
//

unsigned int *
histogram_rgba_unorm8(void *image_data, int w, int h)
{
    unsigned char *img = (unsigned char *)image_data;
    unsigned int *ref_histogram_results;
    unsigned int *ptr;
    int i;

    // clear the histogram results buffer to zeros.
    //
    // the histogram buffer stores the histogram values for R
    // followed by the histogram values for G and then B.
    // Since there are 256 bins for an 8-bit color channel,
    // the histogram buffer is 256 * 3 entries in size.
    // Each entry is a 32-bit unsigned integer value.
    //
    ref_histogram_results = (unsigned int *)malloc(256 * 3 *
                                       sizeof(unsigned int));
    ptr = ref_histogram_results;
    memset(ref_histogram_results, 0x0, 256 * 3 *
                                       sizeof(unsigned int));

    // compute histogram for R
    for (i=0; i<w*h*4; i+=4)
    {
        int indx = img[i];
        ptr[indx]++;
    }

    ptr += 256;
    // compute histogram for G
    for (i=1; i<w*h*4; i+=4)
    {
        int indx = img[i];
        ptr[indx]++;
    }

    ptr += 256;
    // compute histogram for B
    for (i=2; i<w*h*4; i+=4)
    {
        int indx = img[i];
        ptr[indx]++;
    }

    return ref_histogram_results;
}

Parallelizing the Image Histogram

Let us now look at how to write a parallel implementation of the histogram algorithm. An obvious approach to parallelizing the histogram computation is to break the image into tiles, compute the histogram for each tile, and then combine the partial histograms computed for each tile into the final histogram of the image. Listing 14.2 describes the OpenCL kernels that compute the partial histogram for a tile. The partial histogram computed per tile is stored in local memory for performance reasons. The kernel uses the built-in atomic functions as described by the OpenCL 1.1 specification to update the per-tile histogram values. This kernel requires either an OpenCL 1.1 device or an OpenCL 1.0 device that implements the cl_khr_local_int32_base_atomics extension.

Listing 14.2 A Parallel Version of the RGB Histogram—Compute Partial Histograms

//******************************************************************
// This kernel takes an RGBA 8-bit-per-channel input image and
// produces a partial histogram for R, G, and B. Each work-group
// represents an image tile and computes the histogram for that
// tile.
//
// partial_histogram is an array of num_groups * (256 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
//
// We store 256 R bins, followed by 256 G bins, and then the 256
// B bins.
//******************************************************************

kernel void
histogram_partial_image_rgba_unorm8(image2d_t img,
                                    global uint *histogram)
{
    int     local_size = (int)get_local_size(0) *
                         (int)get_local_size(1);
    int     image_width = get_image_width(img);
    int     image_height = get_image_height(img);
    int     group_indx = (get_group_id(1) * get_num_groups(0)
                                + get_group_id(0)) * 256 * 3;
    int     x = get_global_id(0);
    int     y = get_global_id(1);

    local uint  tmp_histogram[256 * 3];

    int     tid = get_local_id(1) * get_local_size(0)
                                  + get_local_id(0);
    int     j = 256 * 3;
    int     indx = 0;

    // clear the local buffer that will generate the partial
    // histogram
    do
    {
        if (tid < j)
            tmp_histogram[indx+tid] = 0;

        j -= local_size;
        indx += local_size;
    } while (j > 0);

    barrier(CLK_LOCAL_MEM_FENCE);

    if ((x < image_width) && (y < image_height))
    {
        float4 clr = read_imagef(img,
                         CLK_NORMALIZED_COORDS_FALSE |
                         CLK_ADDRESS_CLAMP_TO_EDGE |
                         CLK_FILTER_NEAREST,
                         (float2)(x, y));

        uchar   indx_x, indx_y, indx_z;
        indx_x = convert_uchar_sat(clr.x * 255.0f);
        indx_y = convert_uchar_sat(clr.y * 255.0f);
        indx_z = convert_uchar_sat(clr.z * 255.0f);
        atomic_inc(&tmp_histogram[indx_x]);
        atomic_inc(&tmp_histogram[256+(uint)indx_y]);
        atomic_inc(&tmp_histogram[512+(uint)indx_z]);
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    // copy the partial histogram to appropriate location in
    // histogram given by group_indx
    if (local_size >= (256 * 3))
    {
        if (tid < (256 * 3))
            histogram[group_indx + tid] = tmp_histogram[tid];
    }
    else
    {
        j = 256 * 3;
        indx = 0;
        do
        {
            if (tid < j)
                histogram[group_indx + indx + tid] =
                                       tmp_histogram[indx + tid];

            j -= local_size;
            indx += local_size;
        } while (j > 0);
    }
}

histogram_partial_image_rgba_unorm8 produces num_groups partial histograms. We now need to sum these partial histograms to generate the final histogram for the image. Listing 14.3 describes the OpenCL kernel that is used to sum the partial histogram results into the final histogram of the image.

Listing 14.3 A Parallel Version of the RGB Histogram—Sum Partial Histograms

//******************************************************************
// This kernel sums partial histogram results into a final
// histogram result.
//
// num_groups is the number of work-groups used to compute partial
// histograms.
//
// partial_histogram is an array of num_groups * (256 * 3) entries.
// we store 256 R bins, followed by 256 G bins, and then the 256 B
// bins.
//
// The final summed results are returned in histogram.
//******************************************************************

kernel void
histogram_sum_partial_results_unorm8(
                        global uint *partial_histogram,
                        int num_groups,
                        global uint *histogram)
{
    int     tid = (int)get_global_id(0);
    int     group_indx;
    int     n = num_groups;
    local uint  tmp_histogram[256 * 3];

    tmp_histogram[tid] = partial_histogram[tid];

    group_indx = 256*3;
    while (--n > 0)
    {
        tmp_histogram[tid] += partial_histogram[group_indx + tid];
        group_indx += 256*3;
    }

    histogram[tid] = tmp_histogram[tid];
}

The host side code that describes the OpenCL API calls used to enqueue the two kernels in Listings 14.2 and 14.3 is provided in Listing 14.4.

Listing 14.4 Host Code of CL API Calls to Enqueue Histogram Kernels

int       image_width = 1920;
int       image_height = 1080;
size_t    global_work_size[2];
size_t    local_work_size[2];
size_t    partial_global_work_size[2];
size_t    partial_local_work_size[2];
size_t    workgroup_size;
size_t    num_groups;
cl_kernel histogram_rgba_unorm8;
cl_kernel histogram_sum_partial_results_unorm8;
size_t    gsize[2];

// create kernels
histogram_rgba_unorm8 = clCreateKernel(program,
                                "histogram_image_rgba_unorm8",
                                &err);
histogram_sum_partial_results_unorm8 = clCreateKernel(program,
                          "histogram_sum_partial_results_unorm8",
                          &err);

// get max. work-group size that can be used for
// histogram_image_rgba_unorm8 kernel
clGetKernelWorkGroupInfo(histogram_rgba_unorm8, device,
                         CL_KERNEL_WORK_GROUP_SIZE,
                         sizeof(size_t), &workgroup_size, NULL);

if (workgroup_size <= 256)
{
    gsize[0] = 16;
    gsize[1] = workgroup_size / 16;
}
else if (workgroup_size <= 1024)
{
    gsize[0] = workgroup_size / 16;
    gsize[1] = 16;
}
else
{
    gsize[0] = workgroup_size / 32;
    gsize[1] = 32;
}

local_work_size[0] = gsize[0];
local_work_size[1] = gsize[1];

global_work_size[0] = ((image_width + gsize[0] - 1) / gsize[0]);
global_work_size[1] = ((image_height + gsize[1] - 1) / gsize[1]);

num_groups = global_work_size[0] * global_work_size[1];
global_work_size[0] *= gsize[0];
global_work_size[1] *= gsize[1];

err = clEnqueueNDRangeKernel(queue,
                        histogram_rgba_unorm8,
                        2, NULL, global_work_size, local_work_size,
                        0, NULL, NULL);

// get max. work-group size that can be used for
// histogram_sum_partial_results_unorm8 kernel
clGetKernelWorkGroupInfo(histogram_sum_partial_results_unorm8,
                         device, CL_KERNEL_WORK_GROUP_SIZE,
                         sizeof(size_t), &workgroup_size, NULL);

if (workgroup_size < 256)
{
    printf("A min. of 256 work-items in work-group is needed for
            histogram_sum_partial_results_unorm8 kernel. (%d) ",
          (int)workgroup_size);
    return EXIT_FAILURE;
}
partial_global_work_size[0] = 256*3;
partial_local_work_size[0] =
        (workgroup_size > 256) ? 256 : workgroup_size;
err = clEnqueueNDRangeKernel(queue,
            histogram_sum_partial_results_unorm8,
            1, NULL, partial_global_work_size,
            partial_local_work_size,0, NULL, NULL);
if (err)
{
    printf("clEnqueueNDRangeKernel() failed for
            histogram_sum_partial_results_unorm8 kernel.
            (%d) ", err);
    return EXIT_FAILURE;
}

Additional Optimizations to the Parallel Image Histogram

Let’s see if additional optimizations are possible with the kernels. One thing we notice is that the histogram_sum_partial_results_unorm8 kernel is bound by memory operations. GPUs hide memory latency by switching to other work-items or work-groups to perform compute operations. In this case, there is not much compute as the total number of work-items (i.e., global_work_size) specified to clEnqueueNDRangeKernel is (256*3, 1, 1), so it may be hard to hide memory latency. One thing we can do is reduce the amount of data we have to fetch from memory to sum partial histograms.

We can do this by reducing the number of partial histograms and performing more work per work-item in the histogram_partial_image_rgba_unorm8 kernel. It turns out that reducing the number of partial histograms makes the overall histogram computation significantly faster. This optimized version of histogram_partial_results_rgba_unorm8 is described in Listing 14.5.

Listing 14.5 A Parallel Version of the RGB Histogram—Optimized Version

//
// This kernel takes an RGBA 8-bit-per-channel input image and
// produces a partial histogram for R, G, and B. Each work-group
// represents an image tile and computes the histogram for that
// tile.
//
// num_pixels_per_workitem is the number of pixels for which the
// histogram is computed by each work-item. In the implementation
// described in Listing 14.3, num_pixels_per_workitem = 1.
//
// partial_histogram is an array of
//         num_groups * (256 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
//    num_groups is affected by value of num_pixels_per_workitem.
//
// We store 256 R bins, followed by 256 G bins, and then the 256
// B bins.
//

kernel void
histogram_partial_rgba_unorm8(image2d_t img,
                              int num_pixels_per_workitem,
                              global uint *partial_histogram)
{
    int local_size = (int)get_local_size(0) *
                     (int)get_local_size(1);
    int image_width = get_image_width(img);
    int image_height = get_image_height(img);
    int group_indx = (get_group_id(1) * get_num_groups(0) +
                                        get_group_id(0)) * 256 * 3;
    int x = get_global_id(0);
    int y = get_global_id(1);

    local uint  tmp_histogram[256 * 3];

    int tid = get_local_id(1) * get_local_size(0) + get_local_id(0);
    int j = 256 * 3;
    int indx = 0;

    // clear the local buffer that will generate the partial
    // histogram
    do
    {
        if (tid < j)
            tmp_histogram[indx+tid] = 0;

        j -= local_size;
        indx += local_size;
    } while (j > 0);

    barrier(CLK_LOCAL_MEM_FENCE);

    int i, idx;
    for (i=0, idx=x; i<num_pixels_per_workitem;
                              i++, idx+=get_global_size(0))
    {
        if ((idx < image_width) && (y < image_height))
        {
            float4 clr = read_imagef(img,
                                     (CLK_NORMALIZED_COORDS_FALSE |
                                      CLK_ADDRESS_CLAMP_TO_EDGE |
                                      CLK_FILTER_NEAREST),
                                     (float2)(idx, y));

            uchar indx_x = convert_uchar_sat(clr.x * 255.0f);
            uchar indx_y = convert_uchar_sat(clr.y * 255.0f);
            uchar indx_z = convert_uchar_sat(clr.z * 255.0f);
            atomic_inc(&tmp_histogram[indx_x]);
            atomic_inc(&tmp_histogram[256+(uint)indx_y]);
            atomic_inc(&tmp_histogram[512+(uint)indx_z]);
        }
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    // copy the partial histogram to appropriate location in
    // histogram given by group_indx
    if (local_size >= (256 * 3))
    {
        if (tid < (256 * 3))
            partial_histogram[group_indx + tid] =
                                       tmp_histogram[tid];
    }
    else
    {
        j = 256 * 3;
        indx = 0;
        do
        {
            if (tid < j)
                partial_histogram[group_indx + indx + tid] =
                                    tmp_histogram[indx + tid];

            j -= local_size;
            indx += local_size;
        } while (j > 0);
    }
}

The histogram_sum_partial_results_unorm8 kernel requires no changes and is as described in Listing 14.3.

Computing Histograms with Half-Float or Float Values for Each Channel

Listing 14.6 describes how to compute the histogram for an RGBA image with a half-float or float channel. The major difference between computing a histogram for an image with 8 bits per channel versus a half-float or float channel is that the number of bins for a half-float or float channel is 257 instead of 256. This is because floating-point pixel values go from 0.0 to 1.0 inclusive.

Listing 14.6 A Parallel Version of the RGB Histogram for Half-Float and Float Channels

//******************************************************************
// This kernel takes an RGBA 32-bit or 16-bit FP-per-channel input
// image and produces a partial histogram for R, G, and B. Each
// work-group represents an image tile and computes the histogram for
// that tile.
//
// partial_histogram is an array of num_groups * (257 * 3) entries.
// Each entry is a 32-bit unsigned integer value.
//
// We store 257 R bins, followed by 257 G bins, and then the 257 B
// bins.
//
//******************************************************************

kernel void
histogram_image_rgba_fp(image2d_t img,
                        int num_pixels_per_workitem,
                        global uint *histogram)
{
    int     local_size = (int)get_local_size(0) *
                         (int)get_local_size(1);
    int     image_width = get_image_width(img);
    int     image_height = get_image_height(img);
    int     group_indx = (get_group_id(1) * get_num_groups(0)
                                  + get_group_id(0)) * 257 * 3;
    int     x = get_global_id(0);
    int     y = get_global_id(1);

    local uint  tmp_histogram[257 * 3];

    int     tid = get_local_id(1) * get_local_size(0)
                                  + get_local_id(0);
    int     j = 257 * 3;
    int     indx = 0;

    // clear the local buffer that will generate the partial
    // histogram
    do
    {
        if (tid < j)
            tmp_histogram[indx+tid] = 0;

        j -= local_size;
        indx += local_size;
    } while (j > 0);

    barrier(CLK_LOCAL_MEM_FENCE);

    int     i, idx;
    for (i=0, idx=x; i<num_pixels_per_workitem;
                     i++, idx+=get_global_size(0))
    {
        if ((idx < image_width) && (y < image_height))
        {
            float4 clr = read_imagef(img,
                              CLK_NORMALIZED_COORDS_FALSE |
                              CLK_ADDRESS_CLAMP_TO_EDGE |
                              CLK_FILTER_NEAREST,
                              (float2)(idx, y));

            ushort   indx;
            indx = convert_ushort_sat(min(clr.x, 1.0f) * 256.0f);
            atomic_inc(&tmp_histogram[indx]);

            indx = convert_ushort_sat(min(clr.y, 1.0f) * 256.0f);
            atomic_inc(&tmp_histogram[257+indx]);

            indx = convert_ushort_sat(min(clr.z, 1.0f) * 256.0f);
            atomic_inc(&tmp_histogram[514+indx]);
        }
    }

    barrier(CLK_LOCAL_MEM_FENCE);

    // copy the partial histogram to appropriate location in
    // histogram given by group_indx
    if (local_size >= (257 * 3))
    {
        if (tid < (257 * 3))
            histogram[group_indx + tid] = tmp_histogram[tid];
    }
    else
    {
        j = 257 * 3;
        indx = 0;
        do
        {
            if (tid < j)
                histogram[group_indx + indx + tid] =
                                  tmp_histogram[indx + tid];

            j -= local_size;
            indx += local_size;
        } while (j > 0);
    }
}

//******************************************************************
// This kernel sums partial histogram results into a final histogram
// result.
//
// num_groups is the number of work-groups used to compute partial
// histograms.
//
// partial_histogram is an array of num_groups * (257 * 3) entries.
// we store 257 R bins, followed by 257 G bins, and then the 257 B
// bins.
//
// The final summed results are returned in histogram.
//******************************************************************

kernel void
histogram_sum_partial_results_fp(global uint *partial_histogram,
                                 int num_groups,
                                 global uint *histogram)
{
    int     tid = (int)get_global_id(0);
    int     group_id = (int)get_group_id(0);
    int     group_indx;
    int     n = num_groups;
    uint    tmp_histogram, tmp_histogram_first;

    int     first_workitem_not_in_first_group =
                  ((get_local_id(0) == 0) && group_id);

    tid += group_id;
    int     tid_first = tid - 1;
    if (first_workitem_not_in_first_group)
        tmp_histogram_first = partial_histogram[tid_first];

    tmp_histogram = partial_histogram[tid];

    group_indx = 257*3;
    while (--n > 0)
    {
        if (first_workitem_not_in_first_group)
            tmp_histogram_first += partial_histogram[tid_first];

        tmp_histogram += partial_histogram[group_indx+tid];
        group_indx += 257*3;
    }

    if (first_workitem_not_in_first_group)
        histogram[tid_first] = tmp_histogram_first;

    histogram[tid] = tmp_histogram;
}

The full source (kernels and host source code) for the histogram is provided in the Chapter_14/histogram directory of the book’s source code examples.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 14. Image Histogram

Create new playlist

Sign In

Sign Up

Chapter 14. Image Histogram

Computing an Image Histogram

Parallelizing the Image Histogram

Additional Optimizations to the Parallel Image Histogram

Computing Histograms with Half-Float or Float Values for Each Channel

Table of Contents for
Chapter 14. Image Histogram