Chapter 12. Image Analysis

Overview

In the preceding chapter, we learned about the image transformations that OpenCV makes available to us. These transformations were essentially mappings that converted an input image into an output image such that the output remained, essentially, a picture, just like the input. In this chapter we will consider operations that render images into some potentially entirely different representation.

These new representations will usually still be arrays of values, but those values might be quite different in meaning than the intensity values in the input image. For example, the first function we will consider is the discrete Fourier transform, in which the output “image,” though still an array, contains a frequency representation of the input image. In a few cases, the result of the transformation will be something like a list of components, and not an array at all, as would be the case for the Hough line transform.

Finally, we will learn about image segmentation methods that can be used to represent an image in terms of meaningfully connected regions.

Discrete Fourier Transform

For any set of values that are indexed by a discrete (integer) parameter, it is possible to define a discrete Fourier transform (DFT)1 in a manner analogous to the Fourier transform of a continuous function. For N complex numbers x0, x1, x2, ..., xN –1, the one-dimensional DFT is defined by the following formula (where ):

A similar transform can be defined for a two-dimensional array of numbers (of course, higher-dimensional analogs exist also):

In general, one might expect that the computation of the N different terms gk would require O(N2) operations. In fact, there are several fast Fourier transform (FFT) algorithms capable of computing these values in O(N log N) time.

cv::dft(): The Discrete Fourier Transform

The OpenCV function cv::dft() implements one such FFT algorithm. The cv::dft() function can compute FFTs for one- and two-dimensional arrays of inputs. In the latter case, the two-dimensional transform can be computed or, if desired, only the one-dimensional transforms of each individual row can be computed (this operation is much faster than calling cv::dft() several times):

void cv::dft(
  cv::InputArray    src,                        // Input array (real or complex)
  cv::OutputArray   dst,                        // Output array
  int               flags       = 0,            // for inverse, or other options
  int               nonzeroRows = 0             // number of rows to not ignore
);

The input array must be of floating-point type and may be single- or double-channel. In the single-channel case, the entries are assumed to be real numbers, and the output will be packed in a special space-saving format called complex conjugate symmetrical (CCS).2 If the source and channel are two-channel matrices or images, then the two channels will be interpreted as the real and imaginary components of the input data. In this case, there will be no special packing of the results, and some space will be wasted with a lot of 0s in both the input and output arrays.3

The special packing of result values that is used with single-channel CCS output is as follows.

For a one-dimensional array:

For a two-dimensional array:

It is worth taking a moment to look closely at the indices of these arrays. Certain values in the array are guaranteed to be 0 (more accurately, certain values of fk are guaranteed to be real). Also note that the last row listed in the table will be present only if Ny is even and that the last column will be present only if Nx is even. In the case of the two-dimensional array being treated as Ny separate one-dimensional arrays rather than a full two-dimensional transform (we’ll take a look at how to do this), all of the result rows will be analogous to the single row listed for the output of the one-dimensional array.

The third argument, called flags, indicates exactly what operation is to be done. As usual, flags is treated as a bit array, so you can combine any flags you need with Boolean OR. The transformation we started with is known as a forward transform and is selected by default. The inverse transform4 is defined in exactly the same way except for a change of sign in the exponential and a scale factor. To perform the inverse transform without the scale factor, use the flag cv::DFT_INVERSE. The flag for the scale factor is cv::DFT_SCALE, which results in all of the output being scaled by a factor of N–1 (or (Nx Ny)–1 for a two-dimensional transform). This scaling is necessary if the sequential application of the forward transform and the inverse transform is to bring us back to where we started. Because one often wants to combine cv::DFT_INVERSE with cv::DFT_SCALE, there are several shorthand notations for this kind of operation. In addition to just combining the two operations, you can use cv::DFT_INV_SCALE (or cv::DFT_INVERSE_SCALE if you’re not into brevity). The last flag you may want to have handy is cv::DFT_ROWS, which allows you to tell cv::dft() to treat a two-dimensional array as a collection of one-dimensional arrays that should each be transformed separately as if they were Ny distinct vectors of length Nx. This can significantly reduce overhead when you’re doing many transformations at a time. By using cv::DFT_ROWS, you can also implement three-dimensional (and higher) DFT.

Though the default behavior of the forward transform is to produce results in CCS format (which results in an output array exactly the same size as the input array), you can explicitly ask OpenCV to not do this with the flag cv::DFT_COMPLEX_OUTPUT. The result will be the full complex array (with all of the zeros in it). Conversely, when you’re performing an inverse transformation on a complex array, the result is normally also a complex array. If the source array had complex conjugate symmetry,5 you can ask OpenCV to produce a purely real array (which will be smaller than the input array) by passing the cv::DFT_REAL_OUTPUT flag.

In order to understand the last argument, nonzero_rows, we must digress for a moment to explain that, in general, DFT algorithms strongly prefer input vectors of some lengths over input vectors of other lengths; similarly, for arrays of some sizes over arrays of other sizes. In most DFT algorithms, the preferred sizes are powers of 2 (i.e., 2n for some integer n). In the case of the algorithm used by OpenCV, the preference is that the vector lengths, or array dimensions, be 2p3q5r for some integers p, q, and r. Hence the usual procedure is to create a somewhat larger array and then to copy your array into that somewhat roomier zero-padded array. For convenience, there is a handy utility function, cv::getOptimalDFTSize(), which takes the (integer) length of your vector and returns the first equal or larger size that can be expressed in the form given (i.e., 2p3q5r). Despite the need for this padding, it is possible to indicate to cv::dft() that you really do not care about the transform of those rows that you had to add down below your actual data (or, if you are doing an inverse transform, which rows in the result you do not care about). In either case, you can use nonzero_rows to indicate how many rows contain meaningful data. This will provide some savings in computation time.

cv::idft(): The Inverse Discrete Fourier Transform

As we saw earlier, the function cv::dft() can be made to implement not only the discrete Fourier transform, but also the inverse operation (with the provision of the correct flags argument). It is often preferable, if only for code readability, to have a separate function that does this inverse operation by default.

void cv::idft(
  cv::InputArray  src,                          // Input array (real or complex)
  cv::OutputArray dst,                          // Output array
  int             flags       = 0,              // for variations
  int             nonzeroRows = 0               // number of rows to not ignore
);

Calling cv::idft() is exactly equivalent to calling cv::dft() with the cv::DFT_INVERSE flag (in addition to whatever flags you supply to cv::idft(), of course).

cv::mulSpectrums(): Spectrum Multiplication

In many applications that involve computing DFTs, one must also compute the per-element multiplication of the two resulting spectra. Because such results are complex numbers, typically packed in their special high-density CCS format, it would be tedious to unpack them and handle the multiplication via the “usual” matrix operations. Fortunately, OpenCV provides the handy cv::mulSpectrums() routine, which performs exactly this function for us:

void cv::mulSpectrums(
  cv::InputArray  src1,                         // Input array (ccs or complex)
  cv::InputArray  src2,                         // Input array (ccs or complex)
  cv::OutputArray dst,                          // Result array
  int             flags,                        // for row-by-row computation
  bool            conj = false                  // true to conjugate src2
);

Note that the first two arguments are arrays, which must be either CCS packed single-channel spectra or two-channel complex spectra—as you would get from calls to cv::dft(). The third argument is the destination array, which will be of the same size and type as the source arrays. The final argument, conj, tells cv::mulSpectrums() exactly what you want done. In particular, it may be set to false for implementing the preceding pair multiplication or set to true if the element from the first array is to be multiplied by the complex conjugate of the corresponding element of the second array.6

Convolution Using Discrete Fourier Transforms

It is possible to greatly increase the speed of a convolution by using DFT via the convolution theorem [Titchmarsh26] that relates convolution in the spatial domain to multiplication in the Fourier domain [Morse53; Bracewell65; Arfken85].7 To accomplish this, we first compute the Fourier transform of the image and then the Fourier transform of the convolution filter. Once this is done, we can perform the convolution in the transform space in linear time with respect to the number of pixels in the image. It is worthwhile to look at the source code for computing such a convolution, as it will also provide us with many good examples of using cv::dft(). The code is shown in Example 12-1, which is taken directly from the OpenCV reference.

Example 12-1. Using cv::dft() and cv::idft() to accelerate the computation of convolutions
#include <opencv2/opencv.hpp>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {

  if(argc != 2) {
    cout << "Fourier Transform
Usage: " <<argv[0] <<" <imagename>" << endl;
    return -1;
  }

  cv::Mat A = cv::imread(argv[1],0);

  if( A.empty() ) { cout << "Cannot load " << argv[1] << endl; return -1; }

  cv::Size patchSize( 100, 100 );
  cv::Point topleft( A.cols/2, A.rows/2 );
  cv::Rect roi( topleft.x, topleft.y, patchSize.width, patchSize.height );
  cv::Mat B = A( roi );

  int dft_M = cv::getOptimalDFTSize( A.rows+B.rows-1 );
  int dft_N = cv::getOptimalDFTSize( A.cols+B.cols-1 );

  cv::Mat dft_A = cv::Mat::zeros( dft_M, dft_N, CV::F32 );
  cv::Mat dft_B = cv::Mat::zeros( dft_M, dft_N, CV::F32 );

  cv::Mat dft_A_part = dft_A( Rect(0, 0, A.cols,A.rows) );
  cv::Mat dft_B_part = dft_B( Rect(0, 0, B.cols,B.rows) );

  A.convertTo( dft_A_part, dft_A_part.type(), 1, -mean(A)[0] );
  B.convertTo( dft_B_part, dft_B_part.type(), 1, -mean(B)[0] );

  cv::dft( dft_A, dft_A, 0, A.rows );
  cv::dft( dft_B, dft_B, 0, B.rows );

  // set the last parameter to false to compute convolution instead of correlation
  //
  cv::mulSpectrums( dft_A, dft_B, dft_A, 0, true );
  cv::idft( dft_A, dft_A, DFT_SCALE, A.rows + B.rows - 1 );

  cv::Mat corr = dft_A( Rect(0, 0, A.cols + B.cols - 1, A.rows + B.rows - 1) );
  cv::normalize( corr, corr, 0, 1, NORM_MINMAX, corr.type() );
  cv::pow( corr, 3., corr );

  cv::B ^= cv::Scalar::all( 255 );

  cv::imshow( "Image", A );
  cv::imshow( "Correlation", corr );
  cv::waitKey();

  return 0;

}

In Example 12-1, we can see that the input arrays are first created and then initialized. Next, two new arrays are created whose dimensions are optimal for the DFT algorithm. The original arrays are copied into these new arrays and the transforms are then computed. Finally, the spectra are multiplied together and the inverse transform is applied to the product. The transforms are the slowest8 part of this operation; an N × N image takes O(N2 log N) time and so the entire process is also completed in that time (assuming that N > M for an M × M convolution kernel). This time is much faster than O(N2M2), the non-DFT convolution time required by the more naïve method.

cv::dct(): The Discrete Cosine Transform

For real-valued data, it is often sufficient to compute what is, in effect, only half of the discrete Fourier transform. The discrete cosine transform (DCT) [Ahmed74; Jain77] is defined analogously to the full DFT by the following formula:

Of course, there is a similar transform for higher dimensions. Note that, by convention, the normalization factor is applied to both the cosine transform and its inverse (which is not the convention for the discrete Fourier transform).

The basic ideas of the DFT apply also to the DCT, but now all the coefficients are real-valued.9 The actual OpenCV call is:

void cv::dct(
  cv::InputArray  src,                          // Input array (even size)
  cv::OutputArray dst,                          // Output array
  int             flags = 0                     // for row-by-row or inverse
);

The cv::dct() function expects arguments like those for cv::dft() except that, because the results are real-valued, there is no need for any special packing of the result array (or of the input array in the case of an inverse transform). Unlike cv::dft(), however, the input array must have an even number of elements (you can pad the last element with a zero if necessary to achieve this). The flags argument can be set to cv::DCT_INVERSE to generate the inverse transformation, and may be combined with cv::DCT_ROWS with the same effect as with cv::dft(). Because of the different normalization convention, both the forward and inverse cosine transforms always contain their respective contribution to the overall normalization of the transform; hence, there is no analog to cv::DFT_SCALE for cv::dct().

As with cv::dft(), performance strongly depends on array size. In fact, deep down, the implementation of cv::dct() actually calls cv::dft() on an array exactly half the size of your input array. For this reason, the optimal size of an array to pass to cv::dct() is exactly double the size of the optimal array you would pass to cv::dft(). Putting everything together, the best way to get an optimal size for cv::dct() is to compute:

size_t optimal_dct_size = 2 * cv::getOptimalDFTSize( (N+1)/2 );

where N is the actual size of your data that you want to transform.

cv::idct(): The Inverse Discrete Cosine Transform

Just as with cv::idft() and cv::dft(), cv::dct() can be asked to compute the inverse cosine transform using the flags argument. As before, code readability is often improved with the use of a separate function that does this inverse operation by default:

void cv::idct(
  cv::InputArray  src,                          // Input array
  cv::OutputArray dst,                          // Output array
  int             flags = 0,                    // for row-by-row computation
);

Calling cv::idct() is exactly equivalent to calling cv::dct() with the cv::DCT_INVERSE flag (in addition to any other flags you supply to cv::idct()). 

Integral Images

OpenCV allows you to easily calculate an integral image with the appropriately named cv::integral() function. An integral image [Viola04] is a data structure that allows rapid summing of subregions.10 Such summations are useful in many applications; a notable one is the computation of Haar wavelets, which are used in some face recognition and similar algorithms.

OpenCV supports three variations of the integral image. They are the sum, the square-sum, and the tilted-sum. In each case the resulting image is the same size as the original image, plus one in each direction.

A standard integral image sum has the form:

The square-sum image is the sum of squares:

The tilted-sum is like the sum except that it is for the image rotated by 45 degrees:

Using these integral images, you may calculate sums, means, and standard deviations over arbitrary upright or “tilted” rectangular regions of the image. As a simple example, to sum over a simple rectangular region described by the corner points (x1, y1) and (x2, y2), where x2 > x1 and y2 > y1, we’d compute:

In this way, it is possible to do fast blurring, approximate gradients, compute means and standard deviations, and perform fast block correlations even for variable window sizes.

To make this all a little clearer, consider the 7 × 5 image shown in Figure 12-1; the region is shown as a bar chart in which the height associated with the pixels represents the brightness of those pixel values. The same information is shown in Figure 12-2, numerically on the left and in integral form on the right. We compute standard summation integral images I′(x, y) by going across rows, proceeding row by row using the previously computed integral image values together with the current raw image pixel value I(x, y) to calculate the next integral image value as follows:

The last term is subtracted because this value is double-counted when the second and third terms are added. You can verify that this works by testing some values in Figure 12-2.

A simple 7 × 5 image shown as a bar chart with x, y, and height equal to pixel value
Figure 12-1. A simple 7 × 5 image shown as a bar chart with x, y, and height equal to pixel value

When using the integral image to compute a region, we can see in Figure 12-2 that, in order to compute the central rectangular area bounded by the 20s in the original image, we’d calculate 398 – 9 – 10 + 1 = 380. Thus, a rectangle of any size can be computed using four measurements, resulting in O(1) computational complexity.

The 7 × 5 image of Figure 12-1 shown numerically at left (with the origin assumed to be the upper left) and converted to an 8 × 6 integral image at right
Figure 12-2. The 7 × 5 image of Figure 12-1 shown numerically at left (with the origin assumed to be the upper left) and converted to an 8 × 6 integral image at right

cv::integral() for Standard Summation Integral

The different forms of integral are (somewhat confusingly) distinguished in the C++ API only by their arguments. The form that computes the basic sum has only three.

void cv::integral(
  cv::InputArray  image,                        // Input array
  cv::OutputArray sum,                          // Output sum results
  int             sdepth = -1                   // Results depth (e.g., cv::F32)
);

The first and second are the input and result images. If the input image is of size W × H, then the output image will be of size (W + 1) × (H + 1).11 The third argument, sdepth, specifies the desired depth of the sum (destination) image. sdepth can be any of CV::S32, CV::F32, or CV::F64.12

cv::integral() for Squared Summation Integral

The squared sum is computed with the same function as the regular sum, except for the provision of an additional output argument for the squared sum.

void cv::integral(
  cv::InputArray  image,                        // Input array
  cv::OutputArray sum,                          // Output sum results
  cv::OutputArray sqsum,                        // Output sum of squares results
  int             sdepth = -1                   // Results depth (e.g., cv::F32)
);

The cv::OutputArray argument sqsum indicates to cv::integral() that the square sum should be computed in addition to the regular sum. As before, sdepth specifies the desired depth of the resulting images. sdepth can be any of CV::S32, CV::F32, or CV::F64.

cv::integral() for Tilted Summation Integral

Similar to the squared sum, the tilted sum integral is essentially the same function, with an additional argument for the additional result.

void cv::integral(
  cv::InputArray  image,                        // Input array
  cv::OutputArray sum,                          // Output sum results
  cv::OutputArray sqsum,                        // Output sum of squares results
  cv::OutputArray tilted,                       // Output tilted sum results
  int             sdepth = -1                   // Results depth (e.g., cv::F32)
);

The additional cv::OutputArray argument tilted is computed by this form of cv::integral(), in addition to the other sums; thus, all of the other arguments are the same.

The Canny Edge Detector

Though it is possible to expose edges in images with simple filters such as the Laplace filter, it is possible to improve on this method substantially. The simple Laplace filter method was refined by J. Canny in 1986 into what is now commonly called the Canny edge detector [Canny86]. One of the differences between the Canny algorithm and the simpler, Laplace-based algorithm described in the previous chapter is that, in the Canny algorithm, the first derivatives are computed in x and y and then combined into four directional derivatives. The points where these directional derivatives are local maxima are then candidates for assembling into edges. The most significant new dimension to the Canny algorithm is this phase in which it assembles the individual edge-candidate pixels into contours.13

The algorithm forms these contours by applying a hysteresis threshold to the pixels. This means that there are two thresholds, an upper and a lower. If a pixel has a gradient larger than the upper threshold, then it is accepted as an edge pixel; if a pixel is below the lower threshold, it is rejected. If the pixel’s gradient is between the thresholds, then it will be accepted only if it is connected to a pixel that is above the high threshold. Canny recommended a ratio of high:low threshold between 2:1 and 3:1. Figures 12-3 and 12-4 show the results of applying cv::Canny() to a test pattern and a photograph using high:low hysteresis threshold ratios of 5:1 and 3:2, respectively.

Results of Canny edge detection for two different images when the high and low thresholds are set to 50 and 10, respectively
Figure 12-3. Results of Canny edge detection for two different images when the high and low thresholds are set to 50 and 10, respectively
Results of Canny edge detection for two different images when the high and low thresholds are set to 150 and 100, respectively
Figure 12-4. Results of Canny edge detection for two different images when the high and low thresholds are set to 150 and 100, respectively

cv::Canny()

The OpenCV implementation of the Canny edge detection algorithm converts an input image into an “edge image.”

void cv::Canny(
  cv::InputArray  image,                        // Input single channel image
  cv::OutputArray edges,                        // Output edge image
  double          threshold1,                   // "lower" threshold
  double          threshold2,                   // "upper" threshold
  int             apertureSize = 3,             // Sobel aperture
  bool            L2gradient   = false          // true=L2-norm (more accurate)
);

The cv::Canny() function expects an input image, which must be single-channel, and an output image, which will also be grayscale (but which will actually be a Boolean image). The next two arguments are the low and high thresholds. The next-to-last argument, apertureSize, is the aperture used by the Sobel derivative operators that are called inside of the implementation of cv::Canny(). The final argument L2gradient is used to select between computing the directional gradient “correctly” using the proper L2-norm, or using a faster, less accurate L1-norm-based method. If the argument L2gradient is set to true, the more accurate form is used:

If L2gradient is set to false, the faster form is used:

Hough Transforms

The Hough transform14 is a method for finding lines, circles, or other simple forms in an image. The original Hough transform was a line transform, which is a relatively fast way of searching a binary image for straight lines. The transform can be further generalized to cases other than just simple lines.

Hough Line Transform

The basic theory of the Hough line transform is that any point in a binary image could be part of some set of possible lines. If we parameterize each line by, for example, a slope a and an intercept b, then a point in the original image is transformed to a locus of points in the (a, b) plane corresponding to all of the lines passing through that point, of which it could potentially be a part (see Figure 12-5). If we convert every nonzero pixel in the input image into such a set of points in the output image and sum over all such contributions, then lines that appear in the input (i.e., (x, y) plane) image will appear as local maxima in the output (i.e., (a, b) plane) image. Because we are summing the contributions from each point, the (a, b) plane is commonly called the accumulator plane.

The Hough line transform finds many lines in each image; some of the lines found are expected, but others may not be
Figure 12-5. The Hough line transform finds many lines in each image; some of the lines found are expected, but others may not be

It might occur to you that the slope-intercept form is not really the best way to represent all the lines passing through a point (i.e. because of the considerably different density of lines as a function of the slope, and the related fact that the interval of possible slopes goes from –∞ to +∞). This is why the actual parameterization of the transform image used in numerical computation is somewhat different. The preferred parameterization represents each line as a point in polar coordinates (ρ, θ), with the implied line being the line passing through the indicated point but perpendicular to the radial from the origin to that point (see Figure 12-6). The equation for such a line is:

A point (x0, y0) in the image plane (a) implies many lines, each parameterized by a different ρ and θ (b); these lines in the (ρ, θ) plane, taken together, form a curve of characteristic shape (c)
Figure 12-6. A point (x0, y0) in the image plane (a) implies many lines, each parameterized by a different ρ and θ (b); these lines in the (ρ, θ) plane, taken together, form a curve of characteristic shape (c)

The OpenCV Hough transform algorithm does not make this computation explicit to the user. Instead, it simply returns the local maxima in the (ρ, θ) plane. However, you will need to understand this process in order to understand the arguments to the OpenCV Hough line transform function.

OpenCV supports three different kinds of Hough line transform: the standard Hough transform (SHT) [Duda72], the multiscale Hough transform (MHT), and the progressive probabilistic Hough transform (PPHT).15 The SHT is the algorithm we just covered. The MHT algorithm is a slight refinement that gives more accurate values for the matched lines. The PPHT is a variation of this algorithm that, among other things, computes an extent for individual lines in addition to the orientation (as shown in Figure 12-7). It is called “probabilistic” because, rather than accumulating every possible point in the accumulator plane, it accumulates only a fraction of them. The idea is that if the peak is going to be high enough anyhow, then hitting it only a fraction of the time will be enough to find it; the result of this conjecture can be a substantial reduction in computation time.

The Canny edge detector (param1=50, param2=150) is run first, with the results shown in gray, and the progressive probabilistic Hough transform (param1=50, param2=10) is run next, with the results overlaid in white; you can see that the strong lines are generally picked up by the Hough transform
Figure 12-7. The Canny edge detector (param1=50, param2=150) is run first, with the results shown in gray, and the progressive probabilistic Hough transform (param1=50, param2=10) is run next, with the results overlaid in white; you can see that the strong lines are generally picked up by the Hough transform

cv::HoughLines(): The standard and multiscale Hough transforms

The standard and multiscale Hough transforms are both implemented in a single function—cv::HoughLines()—with the distinction being in the use (or nonuse) of two optional parameters.

void cv::HoughLines(
  cv::InputArray  image,                  // Input single channel image
  cv::OutputArray lines,                  // N-by-1 two-channel array
  double          rho,                    // rho resolution (pixels)
  double          theta,                  // theta resolution (radians)
  int             threshold,              // Unnormalized accumulator threshold
  double          srn      = 0,           // rho refinement (for MHT)
  double          stn      = 0            // theta refinement (for MHT)
);

The first argument is the input image. It must be an 8-bit image, but the input is treated as binary information (i.e., all nonzero pixels are considered to be equivalent). The second argument is the place where the found lines will be stored. It will be an N × 1 two-channel array of floating-point type (the number of columns, N, will be the number of lines returned).16 The two channels will contain the rho (ρ) and theta (θ) values for each found line.

The next two arguments, rho and theta, set the resolution desired for the lines (i.e., the resolution of the accumulator plane). The units of rho are pixels and the units of theta are radians; thus, the accumulator plane can be thought of as a two-dimensional histogram with cells of dimension rho pixels by theta radians. The threshold value is the value in the accumulator plane that must be reached for the algorithm to report a line. This last argument is a bit tricky in practice; it is not normalized, so you should expect to scale it up with the image size for SHT. Remember that this argument is, in effect, indicating the number of points (in the edge image) that must support the line for the line to be returned.

The parameters srn and stn are not used by the standard Hough transform; they control an extension of the SHT algorithm called the multiscale Hough transform (MHT). For MHT, these two parameters indicate higher resolutions to which the parameters for the lines should be computed. MHT first computes the locations of the lines to the accuracy given by the rho and theta parameters, and then goes on to refine those results by a factor of srn and stn, respectively (i.e., the final resolution in rho is rho divided by srn, and the final resolution in theta is theta divided by stn). Leaving these parameters set to 0 causes the SHT algorithm to be run.

cv::HoughLinesP(): The progressive probabilistic Hough transform

void cv::HoughLinesP(
  cv::InputArray  image,                  // Input single channel image
  cv::OutputArray lines,                  // N-by-1 4-channel array
  double          rho,                    // rho resolution (pixels)
  double          theta,                  // theta resolution (radians)
  int             threshold,              // Unnormalized accumulator threshold
  double          minLineLength = 0,      // required line length
  double          maxLineGap    = 0       // required line separation
);

The cv::HoughLinesP() function works very much like cv::HoughLines(), with two important differences.  The first is that the lines argument will be a four-channel array (or a vector of objects all of type Vec4i). The four channels will be the (x0,y0) and (x1,y1) (in that order), the (x,y) locations of the two endpoints of the found line segment. The second important difference is the meaning of the two parameters. For the PPHT, the minLineLength and maxLineGap arguments set the minimum length of a line segment that will be returned, and the separation between collinear segments required for the algorithm not to join them into a single longer segment.

Hough Circle Transform

The Hough circle transform [Kimme75] (see Figure 12-8) works in a manner roughly analogous to the Hough line transforms just described. The reason it is only “roughly” is that—if you were to try doing the exactly analogous thing—the accumulator plane would have to be replaced with an accumulator volume with three dimensions: one for x and one for y (the location of the circle center), and another for the circle radius r. This would mean far greater memory requirements and much slower speed. The implementation of the circle transform in OpenCV avoids this problem by using a somewhat trickier method called the Hough gradient method.

The Hough circle transform finds some of the circles in the test pattern and (correctly) finds none in the photograph
Figure 12-8. The Hough circle transform finds some of the circles in the test pattern and (correctly) finds none in the photograph

The Hough gradient method works as follows. First, the image is passed through an edge-detection phase (in this case, cv::Canny()). Next, for every nonzero point in the edge image, the local gradient is considered (we compute the gradient by first computing the first-order Sobel x- and y-derivatives via cv::Sobel()). Using this gradient, we increment every point along the line indicated by this slope—from a specified minimum to a specified maximum distance—in the accumulator. At the same time, the location of every one of these nonzero pixels in the edge image is noted. The candidate centers are then selected from those points in this (two-dimensional) accumulator that are both above some given threshold and larger than all of their immediate neighbors. These candidate centers are sorted in descending order of their accumulator values, so that the centers with the most supporting pixels appear first. Next, for each center, all of the nonzero pixels (recall that this list was built earlier) are considered. These pixels are sorted according to their distance from the center. Working out from the smallest distances to the maximum radius, we select a single radius that is best supported by the nonzero pixels. A center is kept if it has sufficient support from the nonzero pixels in the edge image and if it is a sufficient distance from any previously selected center.

This implementation enables the algorithm to run much faster and, perhaps more importantly, helps overcome the problem of the otherwise sparse population of a three-dimensional accumulator, which would lead to a lot of noise and render the results unstable. On the other hand, this algorithm has several shortcomings that you should be aware of.

First, the use of the Sobel derivatives to compute the local gradient—and the attendant assumption that this can be considered equivalent to a local tangent—is not a numerically stable proposition. It might be true “most of the time,” but you should expect this to generate some noise in the output.

Second, the entire set of nonzero pixels in the edge image is considered for every candidate center; hence, if you make the accumulator threshold too low, the algorithm will take a long time to run. Third, because only one circle is selected for every center, if there are concentric circles then you will get only one of them.

Finally, because centers are considered in ascending order of their associated accumulator value and because new centers are not kept if they are too close to previously accepted centers, there is a bias toward keeping the larger circles when multiple circles are concentric or approximately concentric. (It is only a “bias” because of the noise arising from the Sobel derivatives; in a smooth image at infinite resolution, it would be a certainty.)

cv::HoughCircles(): the Hough circle transform

The Hough circle transform function cv::HoughCircles() has similar arguments to the line transform.

void cv::HoughCircles(
  cv::InputArray  image,                  // Input single channel image
  cv::OutputArray circles,                // N-by-1 3-channel or vector of Vec3f
  int             method,                 // Always cv::HOUGH_GRADIENT
  double          dp,                     // Accumulator resolution (ratio)
  double          minDist,                // Required separation (between lines)
  double          param1    = 100,        // Upper Canny threshold
  double          param2    = 100,        // Unnormalized accumulator threshold
  int             minRadius = 0,          // Smallest radius to consider
  int             maxRadius = 0           // Largest radius to consider
);

The input image is again an 8-bit image. One significant difference between cv::HoughCircles() and cv::HoughLines() is that the latter requires a binary image. The cv::HoughCircles() function will internally (automatically) call cv::Sobel()17 for you, so you can provide a more general grayscale image.

The result array, circles, will be either a matrix-array or a vector, depending on what you pass to cv::HoughCircles(). If a matrix is used, it will be a one-dimensional array of type CV::F32C3; the three channels will be used to encode the location of the circle and its radius. If a vector is used, it must be of type std::vector<Vec3f>. The method argument must always be set to cv::HOUGH_GRADIENT.

The parameter dp is the resolution of the accumulator image used. This parameter allows us to create an accumulator of a lower resolution than the input image. It makes sense to do this because there is no reason to expect the circles that exist in the image to fall naturally into the same number of bins as the width or height of the image itself. If dp is set to 1, then the resolutions will be the same; if set to a larger number (e.g., 2), then the accumulator resolution will be smaller by that factor (in this case, half). The value of dp cannot be less than 1.

The parameter minDist is the minimum distance that must exist between two circles in order for the algorithm to consider them distinct circles.

For the method set to cv::HOUGH_GRADIENT, the next two arguments, param1 and param2, are the edge (Canny) threshold and the accumulator threshold, respectively. You may recall that the Canny edge detector actually takes two different thresholds itself. When cv::Canny() is called internally, the first (higher) threshold is set to the value of param1 passed into cv::HoughCircles(), and the second (lower) threshold is set to exactly half that value. The parameter param2 is the one used to threshold the accumulator and is exactly analogous to the threshold argument of cv::HoughLines().

The final two parameters are the minimum and maximum radius of circles that can be found. This means that these are the radii of circles for which the accumulator has a representation. Example 12-2 shows an example program using cv::HoughCircles().

Example 12-2. Using cv::HoughCircles() to return a sequence of circles found in a grayscale image
#include <opencv2/opencv.hpp>
#include <iostream>
#include <math.h>

using namespace cv;
using namespace std;

int main(int argc, char** argv) {

  if(argc != 2) {
    cout << "Hough Circle detect
Usage: " <<argv[0] <<" <imagename>
" << endl;
    return -1;
  }

  cv::Mat src, image;

  src  = cv::imread( argv[1], 1 );
  if( src.empty() ) { cout << "Cannot load " << argv[1] << endl; return -1; }

  cv::cvtColor(src, image, cv::BGR2GRAY);
  cv::GaussianBlur(image, image, Size(5,5), 0, 0);

  vector<cv::Vec3f> circles;
  cv::HoughCircles(image, circles, cv::HOUGH_GRADIENT, 2, image.cols/10);

  for( size_t i = 0; i < circles.size(); ++i ) {
    cv::circle(
      src,
      cv::Point(cvRound(circles[i][0]), cvRound(circles[i][1])),
      cvRound(circles[i][2]),
      cv::Scalar(0,0,255),
      2,
      cv::AA
    );
  }

  cv::imshow( "Hough Circles", src);
  cv::waitKey(0);

  return 0;

}

It is worth reflecting momentarily on the fact that, no matter what tricks we employ, there is no getting around the requirement that circles be described by three degrees of freedom (x, y, and r), in contrast to only two degrees of freedom (ρ and θ) for lines. The result will invariably be that any circle-finding algorithm requires more memory and computation time than the line-finding algorithms we looked at previously. With this in mind, it’s a good idea to bound the radius parameter as tightly as circumstances allow in order to keep these costs under control.18 The Hough transform was extended to arbitrary shapes by Ballard in 1981 [Ballard81] basically by considering objects as collections of gradient edges.

Distance Transformation

The distance transform of an image is defined as a new image in which every output pixel is set to a value equal to the distance to the nearest zero pixel in the input image—according to some specific distance metric. It should be immediately obvious that the typical input to a distance transform should be some kind of edge image. In most applications the input to the distance transform is an output of an edge detector such as the Canny edge detector that has been inverted (so that the edges have value 0 and the nonedges are nonzero).

There are two methods available to compute the distance transform. The first method uses a mask that is typically a 3 × 3 or 5 × 5 array. Each point in the array defines the “distance” to be associated with a point in that particular position relative to the center of the mask. Larger distances are built up (and thus approximated) as sequences of “moves” defined by the entries in the mask. This means that using a larger mask will yield more accurate distances. When we use this method, given a specific distance metric, the appropriate mask is automatically selected from a set known to OpenCV. This is the “original” method developed by Borgefors (1986) [Borgefors86]. The second method computes exact distances, and is due to Felzenszwalb [Felzenszwalb04]. Both methods run in time linear in the total number of pixels, but the exact algorithm is a bit slower.

The distance metric can be any of several different types including the classic L2 (Cartesian) distance metric. Figure 12-9 shows two examples of using the distance transform on a pattern and an image.

First a Canny edge detector was run with param1=100 and param2=200; then the distance transform was run with the output scaled by a factor of 5 to increase visibility
Figure 12-9. First a Canny edge detector was run with param1=100 and param2=200; then the distance transform was run with the output scaled by a factor of 5 to increase visibility

cv::distanceTransform() for Unlabeled Distance Transform

When you call the OpenCV distance transform function, the output image will be a 32-bit floating-point image (i.e., CV::F32).

void cv::distanceTransform(
  cv::InputArray  src,                       // Input image
  cv::OutputArray dst,                       // Result image
  int             distanceType,              // Distance metric to use
  int             maskSize                   // Mask to use (3, 5, or see below)
);

cv::distanceTransform() has two parameters. The first is distanceType, which indicates the distance metric to be used. Your choices here are cv::DIST_C, cv::DIST_L1, and cv::DIST_L2. These methods compute the distance to the nearest zero based on integer steps along a grid. The difference between the methods is that cv::DIST_C is the distance when the steps are counted on a four-connected grid (i.e., diagonal moves are not allowed), and cv::DIST_L1 gives the number of steps on an eight-connected grid (i.e., diagonal moves are allowed). When distanceType is set to cv::DIST_L2, cv::distanceTransform() attempts to compute the exact Euclidean distances.

After the distance type is the maskSize, which may be 3, 5, or cv::DIST_MASK_PRECISE. In the case of 3 or 5, this argument indicates that a 3 × 3 or 5 × 5 mask should be used with the Borgefors method. If you are using cv::DIST_L1 or cv::DIST_C, you can always use a 3 × 3 mask and you will get exact results. If you are using cv::DIST_L2, the Borgefors method is always approximate, and using the larger 5 × 5 mask will result in a better approximation to the L2 distance, at the cost of a slightly slower computation. Alternatively, cv::DIST_MASK_PRECISE can be used to indicate the Felzenszwalb algorithm (when used with cv::DIST_L2).

cv::distanceTransform() for Labeled Distance Transform

It is also possible to ask the distance transform algorithm to not only calculate the distances, but to also report which object that minimum distance is to. These “objects” are called connected components. We will have a lot more to say about connected components in Chapter 14 but, for now, you can think of them as exactly what they sound like: structures made of continuously connected groups of zeros in the source image.

void cv::distanceTransform(
  cv::InputArray  src,                             // Input image
  cv::OutputArray dst,                             // Result image
  cv::OutputArray labels,                          // Connected component ids
  int             distanceType,                    // Distance metric to use
  int             maskSize,                        // (3, 5, or see below)
  int             labelType = cv::DIST_LABEL_CCOMP // How to label
);

If a labels array is provided, then as a result of running cv::distanceTransform() it will be of the same size as dst. In this case, connected components will be computed automatically, and the label associated with the nearest such component will be placed in each pixel of labels. The output “labels” array will basically be the discrete Voronoi diagram.

Note

If you are wondering how to differentiate labels, consider that for any pixel that is 0 in src, then the corresponding distance must also be 0. In addition, the label for that pixel must be the label of the connected component it is part of. As a result, if you want to know what label was given to any particular zero pixel, you need only look up that pixel in labels.

The argument labelType can be set either to cv::DIST_LABEL_CCOMP or cv::DIST_LABEL_PIXEL. In the former case, the function automatically finds connected components of zero pixels in the input image and gives each one a unique label. In the latter case, all zero pixels are given distinct labels.

Segmentation

The topic of image segmentation is a large one, which we have touched on in several places already, and will return to in more sophisticated contexts later in the book. Here, we will focus on several methods of the library that specifically implement techniques that are either segmentation methods in themselves, or primitives that will be used later by more sophisticated tactics. Note that, at this time, there is no general “magic” solution for image segmentation, and it remains a very active area in computer vision research. Despite this, many good techniques have been developed that are reliable at least in some specific domain, and in practice can yield very good results.

Flood Fill

Flood fill [Heckbert90; Shaw04; Vandevenne04] is an extremely useful function that is often used to mark or isolate portions of an image for further processing or analysis. Flood fill can also be used to derive, from an input image, masks that can be used by subsequent routines to speed or restrict processing to only those pixels indicated by the mask. The function cv::floodFill() itself takes an optional mask that can be further used to control where filling is done (e.g., for multiple fills of the same image).

In OpenCV, flood fill is a more general version of the sort of fill functionality that you probably already associate with typical computer painting programs. For both, a seed point is selected from an image and then all similar neighboring points are colored with a uniform color. The difference is that the neighboring pixels need not all be identical in color.19 The result of a flood fill operation will always be a single contiguous region. The cv::floodFill() function will color a neighboring pixel if it is within a specified range (loDiff to upDiff) of either the current pixel or if (depending on the settings of flags) the neighboring pixel is within a specified range of the original seed value. Flood filling can also be constrained by an optional mask argument. There are two different prototypes for the cv::floodFill() routine, one that accepts an explicit mask parameter, and one that does not.

int cv::floodFill(
  cv::InputOutputArray image,                  // Input image, 1 or 3 channels
  cv::Point            seed,                   // Start point for flood
  cv::Scalar           newVal,                 // Value for painted pixels
  cv::Rect*            rect,                   // Output bounds painted domain
  cv::Scalar           lowDiff  = cv::Scalar(),// Maximum down color distance
  cv::Scalar           highDiff = cv::Scalar(),// Maximum up color distance
  int                  flags                   // Local/global, and mask-only
);

int cv::floodFill(
  cv::InputOutputArray image,                   // Input w-by-h, 1 or 3 channels
  cv::InputOutputArray mask,                    // 8-bit, w+2-by-h+2 (Nc=1)
  cv::Point            seed,                    // Start point for flood
  cv::Scalar           newVal,                  // Value for painted pixels
  cv::Rect*            rect,                    // Output bounds painted domain
  cv::Scalar           lowDiff  = cv::Scalar(), // Maximum down color distance
  cv::Scalar           highDiff = cv::Scalar(), // Maximum up color distance
  int                  flags                    // Local/global, and mask-only
);

The parameter image is the input image, which can be 8-bit or a floating-point type, and must either have one or three channels. In general, this image array will be modified by cv::floodFill(). The flood-fill process begins at the location seed. The seed will be set to value newVal, as will all subsequent pixels colored by the algorithm. A pixel will be colorized if its intensity is not less than a colorized neighbor’s intensity minus loDiff and not greater than the colorized neighbor’s intensity plus upDiff. If the flags argument includes cv::FLOODFILL_FIXED_RANGE, then a pixel will be compared to the original seed point rather than to its neighbors. Generally, the flags argument controls the connectivity of the fill, what the fill is relative to, whether we are filling only a mask, and what values are used to fill the mask. Our first example of flood fill is shown in Figure 12-10.

Results of flood fill (top image is filled with gray, bottom image with white) from the dark circle located just off center in both images; in this case, the upDiff and loDiff parameters were each set to 7.0
Figure 12-10. Results of flood fill (top image is filled with gray, bottom image with white) from the dark circle located just off center in both images; in this case, the upDiff and loDiff parameters were each set to 7.0

The mask argument indicates a mask that can function both as an input to cv::floodFill() (in which case, it constrains the regions that can be filled) and as an output from cv::floodFill() (in which case, it will indicate the regions that actually were filled). mask must be a single-channel, 8-bit image whose size is exactly two pixels larger in width and height than the source image.20

In the sense that mask is an input to cv::floodFill(), the algorithm will not flood across nonzero pixels in the mask. As a result, you should zero it before use if you don’t want masking to block the flooding operation.

When the mask is present, it will also be used as an output. When the algorithm runs, every “filled” pixel will be set to a nonzero value in the mask. You have the option of adding the value cv::FLOODFILL_MASK_ONLY to flags (using the usual Boolean OR operator). In this case, the input image will not be modified at all. Instead, only mask will be modified.

Note

If the flood-fill mask is used, then the mask pixels corresponding to the repainted image pixels are set to 1. Don’t be confused if you fill the mask and see nothing but black upon display; the filled values are there, but the mask image needs to be rescaled if you want to display it so you can actually see it on the screen. After all, the difference between 0 and 1 is pretty small on an intensity scale of 0 to 255.

Two possible values for flags have already been mentioned: cv::FLOODFILL_FIXED_RANGE and cv::FLOODFILL_MASK_ONLY. In addition to these, you can also add the numerical values 4 or 8.21 In this case, you are specifying whether the flood-fill algorithm should consider the pixel array to be four-connected or eight-connected. In the former case, a four-connected array is one in which pixels are only connected to their four nearest neighbors (left, right, above, and below). In the eight-connected case, pixels are considered to be connected to diagonally neighboring pixels as well.

The flags argument is slightly tricky because it has three parts that are possibly less intuitive than they could be. The low 8 bits (0–7) can be set to 4 or 8, as we saw before, to control the connectivity considered by the filling algorithm. The high 8 bits (16–23) are the ones that contain the flags such as cv::FLOODFILL_FIXED_RANGE and cv::FLOODFILL_MASK_ONLY. The middle bits (8–15), however, are used a little bit differently to actually represent a numerical value: the value with which you want the mask to be filled. If the middle bits of flags are 0s, the mask will be filled with 1s (the default), but any other value will be interpreted as an 8-bit unsigned integer. All these flags may be linked together via OR. For example, if you want an eight-way connectivity fill, filling only a fixed range, filling the mask (not the image), and filling using a value of 47, then the parameter to pass in would be:

flags = 8
      | cv::FLOODFILL_MASK_ONLY
      | cv::FLOODFILL_FIXED_RANGE
      | (47<<8);

Figure 12-11 shows flood fill in action on a sample image. Using cv::FLOODFILL_FIXED_RANGE with a wide range resulted in most of the image being filled (starting at the center). We should note that newVal, loDiff, and upDiff are prototyped as type cv::Scalar so they can be set for three channels at once. For example, loDiff = cv::Scalar(20,30,40) will set loDiff thresholds of 20 for red, 30 for green, and 40 for blue.

Results of flood fill (top image is filled with gray, bottom image with white) from the dark circle located just off center in both images; in this case, flood fill was done with a fixed range and with a high and low difference of 25.0
Figure 12-11. Results of flood fill (top image is filled with gray, bottom image with white) from the dark circle located just off center in both images; in this case, flood fill was done with a fixed range and with a high and low difference of 25.0

Watershed Algorithm

In many practical contexts, we would like to segment an image but do not have the benefit of any kind of separate background mask. One technique that is often effective in this context is the watershed algorithm [Meyer92], which converts lines in an image into “mountains” and uniform regions into “valleys” that can be used to help segment objects. The watershed algorithm first takes the gradient of the intensity image; this has the effect of forming valleys or basins (the low points) where there is no texture, and of forming mountains or ranges (high ridges corresponding to edges) where there are dominant lines in the image. It then successively floods basins starting from caller-specified points until these regions meet. Regions that merge across the marks so generated are segmented as belonging together as the image “fills up.” In this way, the basins connected to the marker point become “owned” by that marker. We then segment the image into the corresponding marked regions.

More specifically, the watershed algorithm allows a user (or another algorithm) to mark parts of an object or background that are known to be part of the object or background. Alternatively, the caller can draw a simple line or collection of lines that effectively tells the watershed algorithm to “group points like these together.” The watershed algorithm then segments the image by allowing marked regions to “own” the edge-defined valleys in the gradient image that are connected with the segments. Figure 12-12 clarifies this process.

Watershed algorithm: after a user has marked objects that belong together (left), the algorithm merges the marked area into segments (right)
Figure 12-12. Watershed algorithm: after a user has marked objects that belong together (left), the algorithm merges the marked area into segments (right)

The function specification of the watershed segmentation algorithm is:

void cv::watershed(
  cv::InputArray       image,                 // Input 8-bit, three channels
  cv::InputOutputArray markers                // 32-bit float, single channel
);

Here, image must be an 8-bit, three-channel (color) image and markers is a single-channel integer (CV::S32) image of the same (x, y) dimensions. On input, the value of markers is 0 except where the caller has indicated by using positive numbers that some regions belong together. For example, in the left panel of Figure 12-12, the orange might have been marked with a “1,” the lemon with a “2,” the lime with “3,” the upper background with “4,” and so on.

After the algorithm has run, all of the former zero-value pixels in markers will be set to one of the given markers (i.e., all of the pixels of the orange are hoped to come out with a “1” on them, the pixels of the lemon with a “2,” etc.), except the boundary pixels between regions, which will be set to –1. Figure 12-12 (right) shows an example of such a segmentation.

Warning

It is tempting to think that all regions will be separated by pixels with marker value –1 at their boundaries. However, this is not actually the case. Notably, if two neighboring pixels were input originally with nonzero but distinct values, they will remain touching and not separated by a –1 pixel on output.

Grabcuts

The Grabcuts algorithm, introduced by Rother, Kolmogorov, and Blake [Rother04], extends the Graphcuts algorithm [Boykov01] for use in user-directed image segmentation. The Grabcuts algorithm is capable of obtaining excellent segmentations, often with no more than a bounding rectangle around the foreground object to be segmented.

The original Graphcuts algorithm used user-labeled foreground and user-labeled background regions to establish distribution histograms for those two classes of image regions. It then combined the assertion that unlabeled foreground or background should conform to similar distributions with the idea that these regions tend to be smooth and connected (i.e., a bunch of blobs). These assertions were then combined into an energy functional that gave a low energy (i.e., cost) to solutions that conformed to these assertions and a high energy to solutions that violated them. The algorithm obtained the final result by minimizing this energy function.22

The Grabcuts algorithm extends Graphcuts in several important ways. The first is that it replaces the histogram models with a different (Gaussian mixture) model, enabling the algorithm to work on color images. In addition, it solves the energy functional minimization problem in an iterative manner, which provides better results overall, and allows much greater flexibility in the labeling provided by the user. Notably, this latter point makes possible even one-sided labelings, which identify either only background or only foreground pixels (where Graphcuts required both).

The implementation in OpenCV allows the caller to either just provide a rectangle around the object to be segmented, in which case the pixels “under” the rectangle’s boundary (i.e., outside of it) are taken to be background and no foreground is specified. Alternatively, the caller can specify an overall mask in which pixels are categorized as being either definitely foreground, definitely background, probably foreground, or probably background.23 In this case, the definite regions will be used to classify the other regions, with the latter being classified into the definite categories by the algorithm.

The OpenCV implementation of Grabcuts is implemented by the cv::Grabcuts() function:

void cv::grabCut(
  cv::InputArray       img,
  cv::InputOutputArray mask,
  cv::Rect             rect,
  cv::InputOutputArray bgdModel,
  cv::InputOutputArray fgdModel,
  int                  iterCount,
  int                  mode     = cv::GC_EVAL
);

Given an input image img, the resulting labeling will be computed by cv::grabCut() and placed in the output array mask. This mask array can also be used as an input. This is determined by the mode variable. If mode contains the flag cv::GC_INIT_WITH_MASK,24 then the values currently in mask when it is called will be used to initialize the labeling of the image. The mask is expected to be a single-channel image of cv::U8 type in which every value is one of the following enumerations.

Enumerated value Numerical value Meaning
cv::GC_BGD 0 Definitely background
cv::GC_FGD 1 Definitely foreground
cv::PR_GC_BGD 2 Probably background
cv::PR_GC_FGD 3 Probably foreground

The argument rect is used only when you are not using mask initialization. When the mode contains the flag cv::GC_INIT_WITH_RECT, the entire region outside of the provided rectangle is taken to be “definitely background,” while the rest is automatically set to “probably foreground.”

The next two arrays are essentially temporary buffers. When you first call cv::grabCut(), they can be empty arrays. However, if for some reason you should run the Grabcuts algorithm for some number of iterations and then want to restart the algorithm for more iterations (possibly after allowing a user to provide additional “definite” pixels to guide the algorithm), you will need to pass in the same (unmodified) buffers that were filled by the previous run (in addition to using the mask you got back from the previous run as input for the next run).

Internally, the Grabcuts algorithm essentially runs the Graphcuts algorithm some number of times (with the minor extensions mentioned previously). In between each such run, the mixture models are recomputed. The itercount argument determines how many such iterations will be applied. Typical values for itercount are 10 or 12, though the number required may depend on the size and nature of the image being processed.

Mean-Shift Segmentation

Mean-shift segmentation finds the peaks of color distributions over space [Comaniciu99]. It is related to the mean-shift algorithm, which we will discuss later when we talk about tracking and motion in Chapter 17. The main difference between the two is that the former looks at spatial distributions of color (and is thus related to our current topic of segmentation), while the latter tracks those distributions through time in successive frames. The function that does this segmentation based on the color distribution peaks is cv::pyrMeanShiftFiltering()

Given a set of multidimensional data points whose dimensions are (x, y, blue, green, red), mean-shift can find the highest density “clumps” of data in this space by scanning a window over the space. Notice, however, that the spatial variables (x, y) can have very different ranges from the color-magnitude ranges (blue, green, red). Therefore, mean-shift needs to allow for different window radii in different dimensions. In this case we should have, at a minimum, one radius for the spatial variables (spatialRadius) and one radius for the color magnitudes (colorRadius). As mean-shift windows move, all the points traversed by the windows that converge at a peak in the data become connected to or “owned” by that peak. This ownership, radiating out from the densest peaks, forms the segmentation of the image. The segmentation is actually done over a scale pyramid—cv::pyrUp(), cv::pyrDown()—as described in Chapter 11, so that color clusters at a high level in the pyramid (shrunken image) have their boundaries refined at lower levels in the pyramid.

The output of the mean-shift segmentation algorithm is a new image that has been “posterized,” meaning that the fine texture is removed and the gradients in color are mostly flattened. You can then further segment such images using whatever algorithm is appropriate for your needs (e.g., cv::Canny() combined with cv::findContours(), if in fact a contour segmentation is what you ultimately want).

The function prototype for cv::pyrMeanShiftFiltering() looks like this:

void cv::pyrMeanShiftFiltering(
  cv::InputArray   src,                       // 8-bit, Nc=3 image
  cv::OutputArray  dst,                       // 8-bit, Nc=3, same size as src
  cv::double       sp,                        // Spatial window radius
  cv::double       sr,                        // Color window radius
  int              maxLevel = 1,              // Max pyramid level
  cv::TermCriteria termcrit = TermCriteria(
    cv::TermCriteria::MAX_ITER | cv::TermCriteria::EPS,
    5,
    1
  )
);

In cv::pyrMeanShiftFiltering() we have an input image src and an output image dst. Both must be 8-bit, three-channel color images of the same width and height. The spatialRadius and colorRadius define how the mean-shift algorithm averages color and space together to form a segmentation. For a 640 × 480 color image, it works well to set spatialRadius equal to 2 and colorRadius equal to 40. The next parameter of this algorithm is max_level, which describes how many levels of scale pyramid you want used for segmentation. A max_level of 2 or 3 works well for a 640 × 480 color image.

The final parameter is cv::TermCriteria, which we have seen in some previous algorithms. cv::TermCriteria is used for all iterative algorithms in OpenCV. The mean-shift segmentation function comes with good defaults if you just want to leave this parameter blank.

Figure 12-13 shows an example of mean-shift segmentation using the following values:

cv::pyrMeanShiftFiltering( src, dst, 20, 40, 2);
Mean-shift segmentation over scale using cv::pyrMeanShiftFiltering() with parameters max_level=2, spatialRadius=20, and colorRadius=40; similar areas now have similar values and so can be treated as super pixels (larger statistically similar areas), which can speed up subsequent processing significantly
Figure 12-13. Mean-shift segmentation over scale using cv::pyrMeanShiftFiltering() with parameters max_level=2, spatialRadius=20, and colorRadius=40; similar areas now have similar values and so can be treated as super pixels (larger statistically similar areas), which can speed up subsequent processing significantly

Summary

In this chapter, we expanded our repertoire of techniques for image analysis. Building on the general image transforms from the previous chapter, we learned new methods that we can use to better understand the images we are working with and which, as we will see, will form the foundation for many more complex algorithms. Tools such as the distance transform, integral images, and the segmentation techniques will turn out to be important building blocks for other algorithms in OpenCV as well as for your own image analysis.

Exercises

  1. In this exercise, we learn to experiment with parameters by setting good lowThresh and highThresh values in cv::Canny(). Load an image with suitably interesting line structures. We’ll use three different high:low threshold settings of 1.5:1, 2.75:1, and 4:1.

    1. Report what you see with a high setting of less than 50.

    2. Report what you see with high settings between 50 and 100.

    3. Report what you see with high settings between 100 and 150.

    4. Report what you see with high settings between 150 and 200.

    5. Report what you see with high settings between 200 and 250.

    6. Summarize your results and explain what happens as best you can.

  2. Load an image containing clear lines and circles such as a side view of a bicycle. Use the Hough line and Hough circle calls and see how they respond to your image.

  3. Can you think of a way to use the Hough transform to identify any kind of shape with a distinct perimeter? Explain how.

  4. Take the Fourier transform of a small Gaussian distribution and the Fourier transform of an image. Multiply them and take the inverse Fourier transform of the results. What have you achieved? As the filters get bigger, you will find that working in the Fourier space is much faster than in the normal space.

  5. Load an interesting image, convert it to grayscale, and then take an integral image of it. Now find vertical and horizontal edges in the image by using the properties of an integral image.

    Note

    Use long skinny rectangles; subtract and add them in place.

  6. Write a function to compute an integral image that is rotated 45 degrees; you can then use that image to find the sum of a 45-degree rotated rectangle from four points.

  7. Explain how you could use the distance transform to automatically align a known shape with a test shape when the scale is known and held fixed. How would this be done over multiple scales?

  8. Write a function to blur an image using cv::GaussianBlur() with a kernel size of (50,50). Time it. Use the DFT of a 50 × 50 Gaussian kernel to do the same kind of blur much faster.

  9. Write an image-processing function to interactively remove people from an image. Use cv::grabCut() to segment the person, and then use cv::inpaint() to fill in the resulting hole (recall that we learned about cv::inpaint() in the previous chapter).

  10. Take a sufficiently interesting image. Use cv::pyrMeanShiftFiltering() to segment it. Use cv::floodFill() to mask off two resulting segments, and then use that mask to blur all but those segments in the image.

  11. Create an image with a 20 × 20 square in it. Rotate it to an arbitrary angle. Take a distance transform of this image. Create a 20 × 20 square shape. Use the distance transform image to algorithmically overlay the shape onto the rotated square in the image you made.

  12. In the 2005 DARPA Grand Challenge robot race, the authors on the Stanford team used a kind of color clustering algorithm to separate road from nonroad. The colors were sampled from a laser-defined trapezoid of road patch in front of the car. Other colors in the scene that were close in color to this patch—and whose connected component connected to the original trapezoid—were labeled as road. See Figure 12-14 where the watershed algorithm was used to segment the road after a trapezoid mark was used inside the road and an inverted “U” mark outside the road. Suppose we could automatically generate these marks. What could go wrong with this method of segmenting the road?

    Using the watershed algorithm to identify a road: markers are put in the original image (left), and the algorithm yields the segmented road (right)
    Figure 12-14. Using the watershed algorithm to identify a road: markers are put in the original image (left), and the algorithm yields the segmented road (right)

1 Joseph Fourier [Fourier] was the first to find that some functions can be decomposed into an infinite series of other functions, which became a field known as Fourier analysis. Some key text on methods of decomposing functions into their Fourier series are Morse for physics [Morse53] and Papoulis in general [Papoulis62]. The fast Fourier transform was invented by Cooley and Tukey in 1965 [Cooley65], though Carl Gauss worked out the key steps as early as 1805 [Johnson84]. Early use in computer vision is described by Ballard and Brown [Ballard82].

2 As a result of this compact representation, the size of the output array for a single-channel image is the same as the size of the input array because the elements that are provably zero are omitted. In the case of the two-channel (complex) array, the output size will, of course, also be equal to the input size.

3 When using this method, you must be sure to explicitly set the imaginary components to 0 in the two-channel representation. An easy way to do this is to create a matrix full of 0s using cv::Mat::zeros() for the imaginary part and then call cv::merge() with a real-valued matrix to form a temporary complex array on which to run cv::dft() (possibly in place). This procedure will result in full-size, unpacked, complex matrix of the spectrum.

4 With the inverse transform, the input is packed in the special format described previously. This makes sense because, if we first called the forward DFT and then ran the inverse DFT on the results, we would expect to wind up with the original data—that is, of course, if we remember to use the cv::DFT_SCALE flag!

5 This is not to say that it is in CCS format, only that it possesses the symmetry, as it would if (for example) it were the result of a forward transform of a purely real array in the first place. Also, note that you are telling OpenCV that the input array has this symmetry—it will trust you. It does not actually check to verify that this symmetry is present.

6 The primary usage of this argument is the implementation of a correlation in Fourier space. It turns out that the only difference between convolution (which we will discuss in the next section) and correlation is the conjugation of the second array in the spectrum multiplication.

7 Recall that OpenCV’s DFT algorithm implements the FFT whenever the data size makes the FFT faster.

8 By “slowest” we mean “asymptotically slowest”—in other words, that this portion of the algorithm takes the most time for very large N. This is an important distinction. In practice, as we saw in the earlier section on convolutions, it is not always optimal to pay the overhead for conversion to Fourier space. In general, when you are convolving with a small kernel it is not worth the trouble to make this transformation.

9 Astute readers might object that the cosine transform is being applied to a vector that is not a manifestly even function. However, with cv::dct(), the algorithm simply treats the vector as if it were extended to negative indices in a mirrored manner.

10 The citation given is the best for more details on the method, but it was actually introduced in computer vision in 2001 in a paper titled “Robust Real-Time Object Detection” by the same authors. The method was previously used as early as 1984 in computer graphics, where the integral image is known as a summed area table.

11 This allows for the rows of zeros, which are implied by the fact that summing zero terms results in a sum of zero.

12 It is worth noting that even though sum and tilted_sum allow 32-bit float as output for input images of 32-bit float type, it is recommended to use 64-bit float, particularly for larger images. After all, a modern large image can be many millions of pixels.

13 We’ll have much more to say about contours later. As you await those revelations, keep in mind that the cv::Canny() routine does not actually return objects of a contour type; if we want them, we will have to build those from the output of cv::Canny() by using cv::findContours(). Everything you ever wanted to know about contours will be covered in Chapter 14.

14 Hough developed the transform for use in physics experiments [Hough59]; its use in vision was introduced by Duda and Hart [Duda72].

15 The probabilistic Hough transform (PHT) was introduced by Kiryati, Eldar, and Bruckshtein in 1991 [Kiryati91]; the PPHT was introduced by Matas, Galambos, and Kittler in 1999 [Matas00].

16 As usual, depending on the object type you pass to lines, this could be either a 1 × N array with two channels, or if you like, a std::vector<> with N entries, with each entry being of type Vec2f.

17 The function cv::Sobel(), not cv::Canny(), is called internally. The reason is that cv::HoughCircles() needs to estimate the orientation of a gradient at each pixel, and this is difficult to do with a binary edge map.

18 Although cv::HoughCircles() catches centers of the circles quite well, it sometimes fails to find the correct radius. Therefore, in an application where only a center must be found (or where some different technique can be used to find the actual radius), the radius returned by cv::HoughCircles() can be ignored.

19 Users of contemporary painting and drawing programs should note that most of them now employ a filling algorithm very much like cv::floodFill().

20 This is done to make processing easier and faster for the internal algorithm. Note that since the mask is larger than the original image, pixel (x,y) in image corresponds to pixel (x+1,y+1) in mask. Therefore, you may find this an excellent opportunity to use cv::Mat::getSubRect().

21 The text here reads “add,” but recall that flags is really a bit-field argument. Conveniently, however, 4 and 8 are single bits. So you can use “add” or “OR,” whichever you prefer (e.g., flags = 8 | cv::FLOODFILL_MASK_ONLY).

22 This minimization is a nontrivial problem. In practice it is performed through a technique called Mincut, which is how both the Graphcuts and Grabcuts algorithms get their respective names.

23 Perhaps unintuitively, the algorithm, as implemented, does not allow for a “don’t know” prior labeling.

24 Actually, the way that the cv::grabCut() is implemented, you do not need to explicitly provide the cv::GC_INIT_WITH_MASK flag. This is because mask initialization is actually the default behavior. So, as long as you do not provide the cv::GC_INIT_WITH_RECT flag, you will get mask initialization. However, this is not implemented as a default argument, but rather a default in the procedural logic of the function, and is therefore not guaranteed to remain unchanged in future releases of the library. It is best to either use the cv::GC_INIT_WITH_MASK flag or the cv::GC_INIT_WITH_RECT flag explicitly; doing so not only adds future proofing, but also enhances general clarity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset