Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 16. Keypoints and Descriptors

Keypoints and the Basics of Tracking

This chapter is all about informative feature points in images. We will begin by describing what are called corners and exploring their definition in the subpixel domain. We will then learn how to track such corners with optical flow. Historically, the tracking of corners evolved into the theory of keypoints, to which we will devote the remainder of this chapter, including extensive discussion of keypoint feature detectors and descriptors implemented in the OpenCV library for you to use.¹

The concept of corners, as well as that of keypoints, is based on the intuition that it would be useful in many applications to be able to represent an image or object in an invariant form that will be the same, or at least very similar, in other similar images of the same scene or object. Corner and keypoint representations are powerful methods for doing this. A corner is a small patch of an image that is rich in local information and therefore likely to be recognized in another image. A keypoint is an extension of this concept that encodes information from a small local patch of an image such that the keypoint is highly recognizable and, at least in principle, largely unique. The descriptive information about a keypoint is summarized in the form of its descriptor, which is typically much lower-dimensional than the pixel patch that formed the keypoint. The descriptor represents that patch so as to make it much easier to recognize that patch when it appears in another, different image.

From an intuitive point of view, you can think of a keypoint like a piece from a jigsaw puzzle. When you begin the puzzle, some pieces are easily recognized: the handle of a door, a face, the steeple of a church. When you go to assemble the puzzle, you can immediately relate these keypoints to the image on the puzzle box and know immediately where to place them. In addition, if you and a friend had two unassembled puzzles and you wanted to know if they were, for example, both different images of the beautiful Neuschwanstein Castle, you could assemble both puzzles and compare, but you could also just pick out the most salient pieces from each puzzle and compare them. In this latter case, it would take only a few matches before you were convinced that they were either literally the same puzzle, or two puzzles made from two different images of the same castle.

In this chapter, we will start by building up the basics of the theory of keypoints by discussing the earliest ancestor of the modern keypoint: the Harris corner. From there we will discuss the concept of optical flow for such corners, which captures the basic idea of tracking such features from one frame to another in a video sequence. After that we will move on to more modern keypoints and their descriptors and discuss how OpenCV helps us find them as well as how the library will help us match them between frames. Finally, we will look at a convenient method that allows us to visualize keypoints overlaid on top of the images in which they were detected.

Corner Finding

There are many kinds of local features that you can track. It is worth taking a moment to consider what exactly constitutes such a feature. Obviously, if we pick a point on a large blank wall, then it won’t be easy to find that same point in the next frame of a video.

If all points on the wall are identical or even very similar, then we won’t have much luck tracking that point in subsequent frames. On the other hand, if we choose a point that is unique, then we have a pretty good chance of finding that point again. In practice, the point or feature we select should be unique, or nearly unique, and should be parameterizable such that it can be compared to other points in another image (see Figure 16-1).

Returning to our intuition from the large blank wall, we might be tempted to look for points that have some significant change in them—for example, a strong derivative. It turns out that this is not quite enough, but it’s a start. A point to which a strong derivative is associated may be on an edge of some kind, but it still may look like all of the other points along that same edge (see the aperture problem diagrammed in Figure 16-8 and discussed in the section “Introduction to Optical Flow”).

The points marked with circles here are good points to track, whereas those marked with boxes—even the ones that are sharply defined edges—are poor choices.

However, if strong derivatives are observed nearby in two different directions, then we can hope that this point is more likely to be unique. For this reason, many trackable features are called corners. Intuitively, corners—not edges—are the points that contain enough information to be picked out from one frame to the next.

The most commonly used definition of a corner was provided by Harris [Harris88]. This definition captures the intuition of the previous paragraph in a mathematically specific form. We will look at the details of this method shortly, but for the moment, what is important to know is that you can ask OpenCV to simply find the points in the image that are good candidates for being tracked, and it will use Harris’s method to identify them for you.

Finding corners using cv::goodFeaturesToTrack()

The cv::goodFeaturesToTrack() routine implements Harris’s method and a slight improvement credited to Shi and Tomasi [Shi94]. This function conveniently computes the necessary derivative operators, analyzes them, and returns a list of the points that meet our definition of being good for tracking:

void cv::goodFeaturesToTrack(
  cv::InputArray  image,                         // Input, CV_8UC1 or CV_32FC1
  cv::OutputArray corners,                       // Output vector of corners
  int             maxCorners,                    // Keep this many corners
  double          qualityLevel,                  // (fraction) rel to best
  double          minDistance,                   // Discard corner this close 
  cv::InputArray  mask              = noArray(), // Ignore corners where mask=0
  int             blockSize         = 3,         // Neighborhood used
  bool            useHarrisDetector = false,     // false='Shi Tomasi metric'
  double          k                 = 0.04       // Used for Harris metric
);

The input image can be any 8-bit or 32-bit (i.e., 8U or 32F), single-channel image. The output corners will be a vector or array (depending on what you provide) containing all of the corners that were found. If it is a vector<>, it should be a vector of cv::Point2f objects. If it is a cv::Mat, it will have one row for every corner and two columns for the x and y locations of the points. You can limit the number of corners that will be found with maxCorners, the quality of the returned points with qualityLevel (typically between 0.10 and 0.01, and never greater than 1.0), and the minimum separation between adjacent corners with minDistance.

If the argument mask is supplied, it must be the same dimension as image, and corners will not be generated anywhere mask is 0. The blockSize argument indicates how large an area is considered when a corner is computed; a typical value is 3 but, for high-resolution images, you may want to make this slightly larger. The useHarrisDetector argument, if set to true, will cause cv:: goodFeaturesToTrack() to use an exact corner strength formula of Harris’s original algorithm; if set to false, Shi and Tomasi’s method will be used. The parameter k is used only by Harris’s algorithm, and is best left at its default value.²

Subpixel corners

If you are processing images for the purpose of extracting geometric measurements, as opposed to extracting features for recognition, then you will normally need more resolution than the simple pixel values supplied by cv::goodFeaturesToTrack(). Another way of saying this is that such pixels come with integer coordinates whereas we sometimes require real-valued coordinates—for example, a pixel location of (8.25, 117.16).

If you imagine looking for a particular small object in a camera image, such as a distant star, you would invariably be frustrated by the fact that the point’s location will almost never be in the exact center of a camera pixel element. Of course, in this circumstance, some of the light from the object will appear in the neighboring pixels as well. To overcome this, you might try to fit a curve to the image values and then use a little math to find where the peak occurred between the pixels. Subpixel corner detection techniques all rely on approaches of this kind (for a review and newer techniques, see Lucchese [Lucchese02] and Chen [Chen05]). Common uses of such measurements include tracking for three-dimensional reconstruction, calibrating a camera, warping partially overlapping views of a scene to stitch them together in the most natural way, and finding an external signal such as the precise location of a building in a satellite image.

One of the most common tricks for subpixel refinement is based on the mathematical observation that the dot product between a vector and an orthogonal vector is 0; this situation occurs at corner locations, as shown in Figure 16-2.

Finding corners to subpixel accuracy: (a) the image area around the point p is uniform and so its gradient is 0; (b) the gradient at the edge is orthogonal to the vector q-p along the edge; in either case, the dot product between the gradient at p and the vector q-p is 0 (see text).

In Figure 16-2, we assume a starting corner location q that is near the actual subpixel corner location. We examine vectors starting at point q and ending at p. When p is in a nearby uniform or “flat” region, the gradient there is 0. On the other hand, if the vector q-p aligns with an edge, then the gradient at p on that edge is orthogonal to the vector q-p. In either case, the dot product between the gradient at p and the vector q-p is 0. We can assemble many such pairs of the gradient at a nearby point p and the associated vector q-p, set their dot product to 0, and solve this assemblage as a system of equations; the solution will yield a more accurate subpixel location for q, the exact location of the corner.

The function that does subpixel corner finding is cv::cornerSubPix():

void cv::cornerSubPix(
  cv::InputArray       image,                    // Input image
  cv::InputOutputArray corners,                  // Guesses in, and results out
  cv::Size             winSize,                  // Area is NXN; N=(winSize*2+1)
  cv::Size             zeroZone,                 // Size(-1,-1) to ignore
  cv::TermCriteria     criteria                  // When to stop refinement
);

The input image is the original image from which your corners were computed. The corners array contains the integer pixel locations, such as those obtained from routines like cv::goodFeaturesToTrack(), which are taken as the initial guesses for the corner locations.

As described earlier, the actual computation of the subpixel location uses a system of dot-product expressions that express the combinations that should sum to zero (see Figure 16-2). Each of these equations arises from considering a single pixel in the region around p. The parameter winSize specifies the size of window from which these equations will be generated. This window is centered on the original integer corner location and extends outward in each direction by the number of pixels specified in winSize (e.g., if winSize.width = 4, then the search area is actually 4 + 1 + 4 = 9 pixels wide). These equations form a linear system that can be solved by the inversion of a single autocorrelation matrix.³ In practice, this matrix is not always invertible owing to small eigenvalues arising from the pixels very close to p. To protect against this, it is common to simply reject from consideration those pixels in the immediate neighborhood of p. The parameter zeroZone defines a window (analogously to winSize, but always with a smaller extent) that will not be considered in the system of constraining equations and thus the autocorrelation matrix. If no such zero zone is desired, then this parameter should be set to cv::Size(-1,-1).

Once a new location is found for q, the algorithm will iterate using that value as a starting point and will continue until the user-specified termination criterion is reached. Recall that this criterion can be of type cv::TermCriteria::MAX_ITER or of type cv::TermCriteria::EPS (or both) and is usually constructed with the cv::TermCriteria() function. Using cv::TermCriteria::EPS will effectively indicate the accuracy you require of the subpixel values. Thus, if you specify 0.10, then you are asking for subpixel accuracy down to one-tenth of a pixel.

Introduction to Optical Flow

The optical flow problem involves attempting to figure out where many (and possibly all) points in one image have moved to in a second image—typically this is done in sequences of video, for which it is reasonable to assume that most points in the first frame can be found somewhere in the second. Optical flow can be used for motion estimation of an object in the scene, or even for ego-motion of the camera relative to the scene as a whole. In many applications, such as video security, it is motion itself that indicates that a portion of the scene is of specific interest, or that something interesting is going on. Optical flow is illustrated in Figure 16-3.

Optical flow: target features (left) are tracked over time and their movement is converted into velocity vectors (right). (Original images courtesy of Jean-Yves Bouguet.)

The ideal output of an optical flow algorithm would be the association of some estimate of velocity for each and every pixel in a frame pair or, equivalently, a displacement vector for every pixel in one image that indicates the relative location of that pixel in the other image. Such a construction, when it applies to every pixel in the image, is usually referred to as dense optical flow. There is an alternative class of algorithms, called sparse optical flow algorithms, that track only some subset of the points in the image. These algorithms are often fast and reliable because they restrict their attention to specific points in the image that will be easier to track. OpenCV has many ways of helping us identify points that are well suited for tracking, with the corners introduced earlier being only one among a long list. For many practical applications, the computational cost of sparse tracking is so much less than dense tracking that the latter is relegated to only academic interest.⁴ In this section, we will look at one sparse optical flow technique. Later, we will look at more powerful tools for sparse optical flow, and then finally move on to dense optical flow.

Lucas-Kanade Method for Sparse Optical Flow

The Lucas-Kanade (LK) algorithm [Lucas81], as originally proposed in 1981, was an attempt to produce dense optical flow (i.e., flow for every pixel). Yet, because the method is easily applied to a subset of the points in the input image, it has become an important technique for sparse optical flow. The algorithm can be applied in a sparse context because it relies only on local information that is derived from some small window surrounding each point of interest. The disadvantage of using small local windows in Lucas-Kanade is that large motions can move points outside of the local window and thus become impossible for the algorithm to find. This problem led to development of the “pyramidal” LK algorithm, which tracks starting from highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking over image pyramids allows large motions to be caught by local windows.⁵

Because this is an important and effective technique, we will go into some mathematical detail; readers who prefer to forgo such details can skip to the function description and code. However, it is recommended that you at least scan the intervening text and figures, which describe the assumptions behind Lucas-Kanade optical flow, so that you’ll have some intuition about what to do if tracking isn’t working well.

How Lucas-Kanade works

The basic idea of the LK algorithm rests on three assumptions:

Brightness constancy: A pixel from the image of an object in the scene does not change in appearance as it (possibly) moves from frame to frame. For grayscale images (LK can also be done in color), this means we assume that the brightness of a pixel does not change as it is tracked from frame to frame.
Temporal persistence, or “small movements”: The image motion of a surface patch changes slowly in time. In practice, this means the temporal increments are fast enough relative to the scale of motion in the image that the object does not move much from frame to frame.
Spatial coherence: Neighboring points in a scene belong to the same surface, have similar motion, and project to nearby points on the image plane.

We now look at how these assumptions, illustrated in Figure 16-4, lead us to an effective tracking algorithm. The first requirement, brightness constancy, is just the requirement that pixels in one tracked patch look the same over time, defining:

Assumptions behind Lucas-Kanade optical flow: for a patch being tracked on an object in a scene, the patch’s brightness doesn’t change (left); motion is slow relative to the frame rate (center); and neighboring points stay neighbors (right). (Component images courtesy of Michael Black [Black92].)

The requirement that our tracked pixel intensity exhibits no change over time can simply be expressed as:

The second assumption, temporal persistence, essentially means that motions are small from frame to frame. In other words, we can view this change as approximating a derivative of the intensity with respect to time (i.e., we assert that the change between one frame and the next in a sequence is differentially small). To understand the implications of this assumption, first consider the case of a single spatial dimension.

In this case, we can start with our brightness consistency equation, substitute the definition of the brightness f(x, t) while taking into account the implicit dependence of x on t, I(x(t)t), and then apply the chain rule for partial differentiation. This yields:

where I_x is the spatial derivative across the first image, I_t is the derivative between images over time, and v is the velocity we are looking for. We thus arrive at the simple equation for optical flow velocity in the simple one-dimensional case:

Let’s now try to develop some intuition for this one-dimensional tracking problem. Consider Figure 16-5, which shows an “edge”—consisting of a high value on the left and a low value on the right—that is moving to the right along the x-axis. Our goal is to identify the velocity v at which the edge is moving, as plotted in the upper part of Figure 16-5. In the lower part of the figure, we can see that our measurement of this velocity is just “rise over run,” where the rise is over time and the run is the slope (spatial derivative). The negative sign corrects for the slope of x.

Lucas-Kanade optical flow in one dimension: we can estimate the velocity of the moving edge (upper panel) by measuring the ratio of the derivative of the intensity over time divided by the derivative of the intensity over space.

Figure 16-5 reveals another aspect to our optical flow formulation: our assumptions are probably not quite true. That is, image brightness is not really stable; and our time steps (which are set by the camera) are often not as fast relative to the motion as we’d like. Thus, our solution for the velocity is not exact. However, if we are “close enough,” then we can iterate to a solution. Iteration is shown in Figure 16-6 where we use our first (inaccurate) estimate of velocity as the starting point for our next iteration and then repeat. Note that we can keep the same spatial derivative in x as computed on the first frame because of the brightness constancy assumption—pixels moving in x do not change. This reuse of the spatial derivative already calculated yields significant computational savings. The time derivative must still be recomputed each iteration and each frame, but if we are close enough to start with, then these iterations will converge to near exactitude within about five iterations. This is known as Newton’s method. If our first estimate was not close enough, then Newton’s method will actually diverge.

Iterating to refine the optical flow solution (Newton’s method): using the same two images and the same spatial derivative (slope) we solve again for the time derivative; convergence to a stable solution usually occurs within a few iterations.

Now that we’ve seen the one-dimensional solution, let’s generalize it to images in two dimensions. At first glance, this seems simple: just add in the y-coordinate. Slightly changing notation, we’ll call the y-component of velocity v and the x-component of velocity u; then we have:

This is often written as a single vector equation:

where:

Unfortunately, for this single equation there are two unknowns for any given pixel. This means that measurements at the single-pixel level are underconstrained and cannot be used to obtain a unique solution for the two-dimensional motion at that point. Instead, we can solve only for the motion component that is perpendicular or “normal” to the line described by our flow equation. Figure 16-7 illustrates the geometry.

Two-dimensional optical flow at a single pixel: optical flow at one pixel is underdetermined and so can yield at most motion, which is perpendicular (“normal”) to the line described by the flow equation (figure courtesy of Michael Black).

Normal optical flow results from the aperture problem, which arises when you have a small aperture or window in which to measure motion. When motion is detected with a small aperture, you often see only an edge, not a corner. But an edge alone is insufficient to determine exactly how (i.e., in what direction) the entire object is moving; see Figure 16-8.

Aperture problem: a) An object is moving to the right and down. (b) Through a small aperture, we see an edge moving to the right but cannot detect the downward part of the motion.

So then how do we get around this problem that, at one pixel, we cannot resolve the full motion? We turn to the last optical flow assumption for help. If a local patch of pixels moves coherently, then we can easily solve for the motion of the central pixel by using the surrounding pixels to set up a system of equations. For example, if we use a 5 × 5⁶ window of brightness values (you can simply triple this for color-based optical flow) around the current pixel to compute its motion, we can then set up 25 equations as follows:

We now have an overconstrained system for which we can solve provided it contains more than just an edge in that 5 × 5 window. To solve for this system, we set up a least-squares minimization of the equation, whereby is solved in standard form as:

From this relation we obtain our u and v motion components. Writing this out in more detail yields:

The solution to this equation is then:

When can this be solved? When (A^TA) is invertible. And (A^TA) is invertible when it has full rank (2), which occurs when it has two large eigenvectors. This will happen in image regions that include texture running in at least two directions. In this case, (A^TA) will have the best properties when the tracking window is centered over a corner region in an image. This ties us back to our earlier discussion of the Harris corner detector. In fact, those corners were “good features to track” (see our previous remarks concerning cv::goodFeaturesToTrack()) for precisely the reason that (A^TA) had two large eigenvectors there! We’ll see shortly how all this computation is done for us by the cv::calcOpticalFlowPyrLK() function.

The reader who understands the implications of our assuming small and coherent motions will now be bothered by the fact that, for most video cameras running at 30 Hz, large and noncoherent motions are commonplace. In fact, Lucas-Kanade optical flow by itself does not work very well for exactly this reason: we want a large window to catch large motions, but a large window too often breaks the coherent motion assumption! To circumvent this problem, we can track first over larger spatial scales using an image pyramid and then refine the initial motion velocity assumptions by working our way down the levels of the image pyramid until we arrive at the raw image pixels.

Hence, the recommended technique is first to solve for optical flow at the top layer and then to use the resulting motion estimates as the starting point for the next layer down. We continue going down the pyramid in this manner until we reach the lowest level. Thus we minimize the violations of our motion assumptions and so can track faster and longer motions. This more elaborate function is known as pyramid Lucas-Kanade optical flow and is illustrated in Figure 16-9. The OpenCV function that implements Pyramid Lucas-Kanade optical flow is cv::calcOpticalFlowPyrLK(), which we examine next.

Pyramid Lucas-Kanade optical flow: running optical flow at the top of the pyramid first mitigates the problems caused by violating our assumptions of small and coherent motion; the motion estimate from the preceding level is taken as the starting point for estimating motion at the next layer down.

Pyramid Lucas-Kanade code: cv::calcOpticalFlowPyrLK()

We come now to OpenCV’s algorithm that computes Lucas-Kanade optical flow in a pyramid, cv::calcOpticalFlowPyrLK(). As we will see, this optical flow function makes use of “good features to track” and also returns indications of how well the tracking of each point is proceeding.

void cv::calcOpticalFlowPyrLK(
  cv::InputArray       prevImg,            // Prior image (t-1), CV_8UC1
  cv::InputArray       nextImg,            // Next image (t), CV_8UC1
  cv::InputArray       prevPts,            // Vector of 2d start points (CV_32F)
  cv::InputOutputArray nextPts,            // Results: 2d end points (CV_32F)
  cv::OutputArray      status,             // For each point, found=1, else=0
  cv::OutputArray      err,                // Error measure for found points
  cv::Size             winSize         = Size(15,15),   // size of search window
  int                  maxLevel        = 3,             // Pyramid layers to add
  cv::TermCriteria     criteria        = TermCriteria(  // How to end search
                         cv::TermCriteria::COUNT | cv::TermCriteria::EPS,
                         30,
                         0.01
                       ),
  int                  flags           = 0,    // use guesses, and/or eigenvalues
  double               minEigThreshold = 1e-4  // for spatial gradient matrix
);

This function has a lot of inputs, so let’s take a moment to figure out what they all do. Once we have a handle on this routine, we can move on to the problem of which points to track and how to compute them. The basic plan is simple, however: you supply the images, list the points you want to track in prevPts, and call the routine. When the routine returns, you check the status array to see which points were successfully tracked and then check nextPts to find the new locations of those points. Now let’s proceed into the details.

The first two arguments of cv::calcOpticalFlowPyrLK(), prevImg and nextImg, are the initial and final images. Both should be the same size and have the same number of channels.⁷ The next two arguments, prevPts and nextPts, are the input list of features from the first image, and the output list to which matched points in the second image will be written. These can be either N × 2 arrays or vectors of points. The arrays status and err will be filled with information to tell you how successful the matching was. In particular, each entry in status will tell whether the corresponding feature in prevPts was found at all (status[i] will be nonzero if and only if prevPts[i] was found in nextImg). Similarly, err[i] will indicate an error measure for any point prevPts[i] that was found in nextImg (if point i was not found, then err[i] is not defined).

The window used for computing the local coherent motion is given by winSize. Because we are constructing an image pyramid, the argument maxLevel is used to set the depth of the stack of images. If maxLevel is set to 0, then the pyramids are not used. The argument criteria is used to tell the algorithm when to quit searching for matches; recall that cv::TermCriteria is the structure used by many OpenCV algorithms that iterate to a solution:

struct cv::TermCriteria(

public:
  enum {
    COUNT    = 1,
    MAX_ITER = COUNT,
    EPS      = 2
  };

  TermCriteria();
  TermCriteria( int _type, int_maxCount, double _epsilon );

  int    type,      // one of the enum types above
  int    max_iter,
  double epsilon
);

The default values will be satisfactory for most situations. As is often the case, however, if your image is unusually large, you may want to slightly increase the maximum allowed number of iterations.

The argument flags can have one or both of the following values:

cv::OPTFLOW_LK_GET_MIN_EIGENVALS: Set this flag for a somewhat more detailed error measure. The default error measure for the error output is the average per-pixel change in intensity between the window around the previous corner and the window around the new corner. With this flag set to true, that error is replaced with the minimum eigenvalue of the Harris matrix associated with the corner.⁸
cv::OPTFLOW_USE_INITIAL_FLOW: Use when the array nextPts already contains an initial guess for the feature’s coordinates when the routine is called. (If this flag is not set, then the initial guesses will just be the point locations in prevPts.)

The final argument, minEigThreshold, is used as a filter for removing points that are, in fact, not such good choices to track after all. In effect, it is somewhat analogous to the qualityLevel argument to cv::goodFeaturesToTrack, except its exact method of computation is different. The default value of 10^–4 is a good choice; it can be increased in order to throw away more points.

A worked example

Putting this all together, we now know how to find features that are good ones to track, and we know how to track them. We can obtain good results by using the combination of cv::goodFeaturesToTrack() and cv::calcOpticalFlowPyrLK(). Of course, you can also use your own criteria to determine which points to track.

Let’s now look at a simple example (Example 16-1) that uses both cv::goodFeaturesToTrack() and cv::calcOpticalFlowPyrLK(); see also Figure 16-10.

Example 16-1. Pyramid Lucas-Kanade optical flow code

// Pyramid L-K optical flow example
//
#include <opencv2/opencv.hpp>
#include <iostream>

using namespace std;

static const int MAX_CORNERS = 1000;

void help( argv ) {
    cout << "Call: " <<argv[0] <<" [image1] [image2]" << endl;
    cout << "Demonstrates Pyramid Lucas-Kanade optical flow." << endl;
}

int main(int argc, char** argv) {

  if( argc != 3 ) { help( char** argv ); exit( -1 ); }

  // Initialize, load two images from the file system, and
  // allocate the images and other structures we will need for
  // results.
  //
  cv::Mat imgA = cv::imread(argv[1], cv::LOAD_IMAGE_GRAYSCALE);
  cv::Mat imgB = cv::imread(argv[2], cv::LOAD_IMAGE_GRAYSCALE);
  cv::Size img_sz = imgA.size();
  int win_size = 10;
  cv::Mat imgC = cv::imread(argv[2], cv::LOAD_IMAGE_UNCHANGED);

  // The first thing we need to do is get the features
  // we want to track.
  //
  vector< cv::Point2f > cornersA, cornersB;
  
  cv::goodFeaturesToTrack(
    imgA,                              // Image to track
    cornersA,                          // Vector of detected corners (output)
    MAX_CORNERS,                       // Keep up to this many corners
    0.01,                              // Quality level (percent of maximum)
    5,                                 // Min distance between corners
    cv::noArray(),                     // Mask
    3,                                 // Block size
    false,                             // true: Harris, false: Shi-Tomasi
    0.04                               // method specific parameter
  );
  cv::cornerSubPix(
    imgA,                              // Input image
    cornersA,                          // Vector of corners (input and output)
    cv::Size(win_size, win_size),      // Half side length of search window
    cv::Size(-1,-1),                   // Half side length of dead zone (-1=none)
    cv::TermCriteria(
      cv::TermCriteria::MAX_ITER | cv::TermCriteria::EPS,
      20,                              // Maximum number of iterations
      0.03                             // Minimum change per iteration
    )
  );

  // Call the Lucas Kanade algorithm
  //
  vector<uchar> features_found;
  cv::calcOpticalFlowPyrLK(
    imgA,                              // Previous image
    imgB,                              // Next image
    cornersA,                          // Previous set of corners (from imgA)
    cornersB,                          // Next set of corners (from imgB)
    features_found,                    // Output vector, elements are 1 for tracked
    cv::noArray(),                     // Output vector, lists errors (optional)
    cv::Size( win_size*2+1, win_size*2+1 ), // Search window size
    5,                                 // Maximum pyramid level to construct
    cv::TermCriteria(
      cv::TermCriteria::MAX_ITER | cv::TermCriteria::EPS,
      20,                              // Maximum number of iterations
      0.3                              // Minimum change per iteration
    )
  );

  // Now make some image of what we are looking at:
  // Note that if you want to track cornersB further, i.e.
  // pass them as input to the next calcOpticalFlowPyrLK,
  // you would need to "compress" the vector, i.e., exclude points for which
  // features_found[i] == false.
  for( int i = 0; i < (int)cornersA.size(); i++ ) {
    if( !features_found[i] )
      continue;
    line(imgC, cornersA[i], cornersB[i], Scalar(0,255,0), 2, cv::LINE_AA);
  }
  cv::imshow( "ImageA", imgA );
  cv::imshow( "ImageB", imgB );
  cv::imshow( "LK Optical Flow Example", imgC );
  cv::waitKey(0);

  return 0;
}

Generalized Keypoints and Descriptors

Two of the essential concepts that we will need to understand tracking, object detection, and a number of related topics are keypoints and descriptors. Our first task is to understand what these two things are and how they differ from one another.

At the highest level of abstraction, a keypoint is a small portion of an image that, for one reason or another, is unusually distinctive, and which we believe we might be able to locate in another related image. A descriptor is some mathematical construction, typically (but not always) a vector of floating-point values, which somehow describes an individual keypoint, and which can be used to determine whether—in some context—two keypoints are “the same.”⁹

Historically, one of the first important keypoint types was the Harris corner, which we encountered at the beginning of this chapter. Recall that the basic concept behind the Harris corner was that any point in an image that seemed to have a strong change in intensity along two different axes was a good candidate for matching another related image (for example, an image in a subsequent frame of a video stream).

Take a look at the images in Figure 16-10; the image is of text from a book.¹⁰ You can see that the Harris corners are found at the locations where lines that make up the individual text characters begin or end, or where there are intersections of lines (such as the middle of an h or b. You will notice that “corners” do not appear along the edges of long lines in the characters, only at the end. This is because a feature found on such an edge looks very much like any other feature found anywhere on that edge. It stands to reason that if something is not unique in the current image, it will not be unique in another image, so such pixels do not qualify as good features.

Two images containing text, with the Harris corners shown on the right as white circles; note that in the portion of the image that is slightly out of focus, no corners are detected at all.

To our human eyes, each feature will look somewhat different than any other. The question of feature descriptors addresses the problem of how to have the computer make such associations. We mentioned that the three-way intersections that appear in an h or a b make good “features,” but how would the computer tell the difference between them? This is what feature descriptors do.

We could construct a feature descriptor any way we would like—for example, we might make a vector from the intensity values of the 3 × 3 area around the keypoint. A problem with such a prescription is that the descriptor can have different values if the keypoint is seen from even a slightly different angle. In general, rotational invariance¹¹ is a desirable property for a feature descriptor. Of course, whether or not you need a rotationally invariant descriptor depends on your application. When detecting and tracking people, gravity plays a strong role in creating an asymmetric world in which people’s heads are usually at the top and their feet are usually at the bottom. In such applications, a descriptor lacking rotational symmetry may not be a problem. In contrast, aerial imagery of the ground “rotates” when the aircraft travels in a different direction, and so the imagery may appear in what seems to be a random orientation.

Optical Flow, Tracking, and Recognition

In the previous section, we discussed optical flow in the context of the Lucas Kanade algorithm. In that case, OpenCV provided us with a single high-level tool that would take a list of keypoints and try to locate the best matches for those points in a new frame. These points did not have a lot of structure or identity, just enough to make them locatable from one frame to the next in two very similar frames. Generalized keypoints can have associated descriptors that are very powerful and will allow those points to be matched not only in sequential frames of video, but even in completely different images. This can allow us to locate known objects in new environments, or to track an object through complex changing scenery, as well as many other applications.

The three major categories of tasks for which keypoints (and their descriptors) are useful are tracking, object recognition, and stereoscopy.

Tracking is the task of following the motion of something as the scene evolves in a sequential image stream. Tracking comes up in two major subcategories: the first is tracking objects in a stationary scene, and the other is tracking the scene itself for the purpose of estimating the motion of the camera. Usually, the term tracking alone refers to the former, and the latter is referred to as visual odometry. Of course, it is very common to want to do both of these things at once.

The second task category is object recognition. In this case, one is looking at a scene and attempting to recognize the presence of one or more known objects. The idea here is to associate certain keypoint descriptors with each object, with the reasoning that if you were to see enough keypoints associated with a particular object, you could reasonably conclude that this object was present in the scene.

Finally, there is stereoscopy. In this case, we are interested in locating corresponding points in two or more camera views of the same scene or object. Combining these locations with information about the camera locations and the optical properties of the cameras themselves, we can compute the location, in three dimensions, of the individual points we were able to match.

OpenCV provides methods for handling many types of keypoints and many kinds of keypoint descriptors. It also provides methods for matching them, either between pairs of frames (in the manner of sparse optical flow, tracking and visual odometry, or stereoscopy) or between a frame and a database of images (for object recognition).

How OpenCV Handles Keypoints and Descriptors, the General Case

When you are doing tracking, as well as many other kinds of analysis for which keypoints and their descriptors are useful, there are typically three things you would like to do. The first is to search an image and find all of the keypoints that are in that image, according to some keypoint definition. The second is to create a descriptor for every keypoint you found. The third is to compare those keypoints you found, by means of their descriptors, to some existing set of descriptors, and see whether you can find any matches. In tracking applications, this last step involves looking for features in one frame of a sequence and trying to match those with features in the previous frame. In object-detection applications, one is often searching for features in some (potentially vast) database of “known” features that are “known” to be associated with either individual objects or object classes.

For each of these layers, OpenCV provides a generalized mechanism that follows the “classes that do stuff” (functor) model. Thus, for each of these stages, there is an abstract base class that defines a common interface for a family of objects that are derived from it; each derived class implements a particular algorithm.

The cv::KeyPoint object

Of course, if we are going to find keypoints, we will need some way to represent them. Recall that the keypoint is not the feature descriptor, so most of what we need to know when we have a keypoint is where it is located. After that, there are some secondary features that some keypoints have and some do not. Here is the actual definition of the cv::KeyPoint class:

class cv::KeyPoint {

public:

  cv::Point2f pt;       // coordinates of the keypoint
  float       size;     // diameter of the meaningful keypoint neighborhood
  float       angle;    // computed orientation of the keypoint (-1 if none)
  float       response; // response for which the keypoints was selected
  int         octave;   // octave (pyramid layer) keypoint was extracted from
  int         class_id; // object id, can be used to cluster keypoints by object

  cv::KeyPoint(
    cv::Point2f _pt,
    float       _size,
    float       _angle=-1,
    float       _response=0,
    int         _octave=0,
    int         _class_id=-1
  );
  cv::KeyPoint(
    float       x,
    float       y,
    float       _size,
    float       _angle=-1,
    float       _response=0,
    int         _octave=0,
    int         _class_id=-1
  );
...
};

As you can see, every keypoint has a cv::Point2f member that just tells us where it is located.¹² The concept of size tells us something about the region around the keypoint that either was somehow included in the determination that the keypoint existed in the first place, or is going to play a role in that keypoint’s descriptor. The angle of a keypoint is meaningful only for some keypoints. Many keypoints achieve rotational symmetry not by actually being invariant in the strictest sense, but by having some kind of natural orientation that you can take into account when comparing two descriptors. (This is not a complicated idea; if you were looking at two images of pencils, rotation obviously matters, but if you wanted to compare them, you could easily visualize them both in the same orientation before making the comparison.)

The response is used for detectors that can respond “more strongly” to one keypoint than another. In some cases, this can be interpreted as a probability that the feature is in fact present. The octave is used when the keypoint was found in an image pyramid. In these cases, it is important to know which scale a keypoint was found at because, in most cases, we would expect to find matches at the same or similar scale in new images. Finally, there is the class ID. You use class_id when constructing keypoint databases to distinguish the keypoints that are associated with one object from those that are associated with another (we will return to this point when we discuss the keypoint matching interface in the section “The (abstract) keypoint matching class: cv::DescriptorMatcher”).

The cv::KeyPoint object has two constructors, which are essentially the same; their only difference is that you can set the location of the keypoint either with two floating-point numbers or with a single cv::Point2f object. In fact, though, unless you are writing your own keypoint finder, you will not tend to use these functions. If you are using the detection, descriptor construction, and comparison functions available to you from the library, you will typically never even look inside the keypoint objects.

The (abstract) class that finds keypoints and/or computes descriptors for them: cv::Feature2D

For finding keypoints or computing descriptors (or performing both these tasks simultaneously), OpenCV provides the cv::Feature2D class. There are classes called cv::FeatureDetector and cv::DescriptorExtractor that used to be separate classes for pure feature detection or descriptor extraction algorithms, but since OpenCV 3.x they are all synonyms of cv::Feature2D. This abstract class has just a few methods, described next. Derived classes add some more methods to set and retrieve various properties, as well as static methods that construct the algorithm instance and return a smart pointer to it. There are two methods for actually asking the cv::Feature2D class to detect keypoints, functions that allow you to save and restore from disk, and a handy (static) function that allows you to create a feature detector–derived class by providing the name of the detector type (as a string). The two provided detection methods (as well as the two provided “compute” methods) differ only in that one operates on a single image, while the other operates on a set of images. There is some amount of efficiency to be gained by operating on many images at once (this is more true for some detectors than for others). Here is the relevant excerpt from the class description of cv::Feature2D:

class cv::Feature2D : public cv::Algorithm {

public:

  virtual void detect(
    cv::InputArray                   image,      // Image on which to detect
    vector< cv::KeyPoint >&          keypoints,  // Array of found keypoints
    cv::InputArray                   mask     = cv::noArray()
  ) const;

  virtual void detect(
    cv::InputArrayOfArrays           images,     // Images on which to detect
    vector<vector< cv::KeyPoint > >& keypoints,  // keypoints for each image
    cv::InputArrayOfArrays           masks    = cv::noArray ()
  ) const;

  virtual void compute(
    cv::InputArray image,                 // Image where keypoints are located
    std::vector<cv::KeyPoint>& keypoints, // input/output vector of keypoints
    cv::OutputArray descriptors );        // computed descriptors, M x N matrix,
                                          // where M is the number of keypoints
                                          // and N is the descriptor size
  virtual void compute(
    cv::InputArrayOfArrays image,          // Images where keypoints are located
    std::vector<std::vector<cv::KeyPoint> >& keypoints, //I/O vec of keypnts
    cv::OutputArrayOfArrays descriptors ); // computed descriptors,
                                           // vector of (Mi x N) matrices, where
                                           // Mi is the number of keypoints in
                                           // the i-th image and N is the 
                                           // descriptor size
  virtual void detectAndCompute(
    cv::InputArray image,                 // Image on which to detect
    cv::InputArray mask,                  // Optional region of interest mask 
    std::vector<cv::KeyPoint>& keypoints, // found or provided keypoints
    cv::OutputArray descriptors,          // computed descriptors
    bool useProvidedKeypoints=false );    // if true,
                                          // the provided keypoints are used,
                                          // otherwise they are detected

  virtual int descriptorSize() const;     // size of each descriptor in elements
  virtual int descriptorType() const;     // type of descriptor elements
  virtual int defaultNorm() const;        // the recommended norm to be used
                                          // for comparing descriptors.
                                          // Usually, it's NORM_HAMMING for 
                                          // binary descriptors and NORM_L2 
                                          // for all others.

  virtual void read( const cv::FileNode& );
  virtual void write( cv::FileStorage& ) const;

  ...
};

The actual implementation may implement just cv::Feature2D::detect(), if it’s a pure keypoint detection algorithm (like FAST); just cv::Feature2D::compute(), if it’s a pure feature description algorithm (like FREAK); or cv::Feature2D::detectAndCompute() in the case of “complete solution” algorithm, like SIFT, SURF, ORB, BRISK, and so on, in which case detect() and compute() will call it implicitly:

detect(image, keypoints, mask) ~ detectAndCompute(image, mask, keypoints, noArray(), false)
compute(image, keypoints, descriptors) ~ detectAndCompute(image, noArray(), keypoints, descriptors, true)

cv::Feature2D::detect() methods are the ones that do the basic work of finding keypoints for you—directly or through a call to detectAndCompute(). The first takes an image, a vector of keypoints, and an optional mask. It then searches the image (or the portion corresponding to the mask, if you provided one) for keypoints, and places whatever it finds in the vector you provided. The second variant does exactly the same thing, except that it expects a vector of images, a vector of masks (or none at all), and a vector of vectors of keypoints. The number of images, the number of masks (if not zero), and the number of keypoint vectors must all be equal. Then every image will be searched, and the corresponding keypoint vector in keypoints will be filled with the found keypoints.

The actual method used to find the keypoints is, of course, different for each of the many available derived classes from cv::Feature2D. We will get to the details of these shortly. In the meantime, it is important to keep in mind that the actual keypoints detected may be different (if and how some of the internal parameters are used) from one detector to another. This will mean that keypoints detected by one method may not be universally usable by every kind of feature descriptor.

Once you have found your keypoints, the next thing to do is to compute descriptors for them. As was mentioned earlier, these descriptors will allow you to compare keypoints to one another (based on their appearance, as opposed to by their location). This capability can then serve as the basis for tracking or object recognition.

The descriptors are computed by the compute() methods (that may also use detec⁠tAndCompute()). The first compute() method requires an image, a list of keypoints (presumably produced by the detect() method of the same class, but possibly a different one), and the output cv::Mat (passed as cv::OutputArray) that is used by compute() as a place to put the computed features. All descriptors generated by objects derived from cv::Feature2D are representable as vectors of fixed length. As a result, it’s possible to arrange them all into an array. The convention used is that each row of descriptors is a separate descriptor, and the number of such rows is equal to the number of elements in keypoints.

The second compute() method is used when you have several images you would like to process at once. In this case the images argument expects an STL vector containing all the images you would like to process. The keypoints argument expects a vector containing vectors of keypoints, and the descriptors argument expects a vector containing arrays that can be used to store all of the resulting descriptors. This method is a companion to the multiple-image detect() method and expects the images as you would have given them to that detect() method, and the keypoints as you would have received them from that method.

Depending on when the algorithm was implemented, many of the keypoint feature detectors and extractors are bundled together into a single object. In this case the method cv::Feature2D::detectAndCompute() is implemented. Whenever a certain algorithm provides detectAndCompute(), it’s strongly recommended to use it instead of subsequent calls to detect() and then to compute(). The obvious reason for that is a better performance: normally such algorithms require special image representation (called a scale-space representation), and it may be quite expensive to compute. When you find keypoints and compute descriptors in two separate steps, the scale-space representation is basically computed twice.

In addition to the main compute methods, there are the methods descriptorSize(), descriptorType(), and defaultNorm(). The first of these, descriptorSize(), tells us the length of the vector representation of the descriptor,¹³ The descriptorType() method returns information about the specific type of the elements of the vector descriptor (e.g., CV_32FC1 for 32-bit floating-point numbers, CV_8UC1 for 8-bit descriptors or binary descriptors, etc.).¹⁴ The defaultNorm() basically tells you how to compare the descriptors. In the case of binary descriptors, it’s NORM_HAMMING, and you would rather use it. In the case of descriptors like SIFT or SURF, it’s NORM_L2 (e.g., Euclidean distance), but you can obtain equally good or even better results by using NORM_L1.

The cv::DMatch object

Before we get deep into the topic of how matchers work, we need to know how matches are expressed. In general, a matcher will be an object that tries to match keypoints in one image with either a single other image or a collection of other images called a dictionary. When matches are found, OpenCV describes them by generating lists (STL vectors) of cv::DMatch objects. Here is the class definition for the cv::DMatch object:

class cv::DMatch {

public:

  DMatch();              // sets this->distance 
                         // to std::numeric_limits<float>::max()

  DMatch( int _queryIdx, int _trainIdx, float _distance );
  DMatch( int _queryIdx, int _trainIdx, int _imgIdx, float _distance );

  int   queryIdx;        // query descriptor index
  int   trainIdx;        // train descriptor index
  int   imgIdx;          // train image index
  float distance;

  bool operator<( const DMatch &m ) const; // Comparison operator 
                                           // based on 'distance'

}

The data members of cv::DMatch are queryIdx, trainIdx, imgIdx, and distance. The first two identify the keypoints that were matched relative to the keypoint lists in each image. The convention used by the library is to always refer to these two images as the query image (the “new” image) and the training image (the “old” image), respectively.¹⁵ imgIdx is used to identify the particular image from which the training image came in such cases that a match was sought between an image and a dictionary. The last member, distance, is used to indicate the quality of the match. In many cases, this is something like a Euclidean distance between the two keypoints in the many-dimensional vector space in which they live. Though this is not always the metric, it is guaranteed that if two different matches have different distance values, the one with the lower distance is the better match. To facilitate such comparisons (particularly for the purpose of sorting), the operator cv::DMatch::operator<() is defined, which allows two cv::DMatch objects to be compared directly with the meaning that it is their distance members that are actually compared.

The (abstract) keypoint matching class: cv::DescriptorMatcher

The third stage in the process of detect-describe-match is also implemented through a family of objects that are all derived from a common abstract base class whose interface they all share. This third layer is based on the cv::DescriptorMatcher class.

Before we get deep into how the interface works, it is important to understand that there are two basic situations in which you would want to use a matcher: object recognition and tracking, as described earlier in the chapter. In the object recognition case, one first compiles keypoints associated with a variety of objects into a kind of database called a dictionary. Thereafter, when a new scene is presented, the keypoints in that image are extracted and compared to the dictionary to estimate what objects from the dictionary might be present in the new scene. In the tracking case, the goal is to find all the keypoints in some image, typically from a video stream, and then to look for all of those keypoints in another image, typically the prior or next image in that video stream.

Because there are these two different situations, the matching methods in the class have two corresponding variations. For the object recognition case, we need to first train the matcher with a dictionary of descriptors, and then be able to present a single descriptor list to the matcher and have it tell us which (if any) keypoints that the matcher has stored are matches with the ones in the list we provided. For the tracking case, we would like to provide two lists of descriptors and have the matcher tell us where the matches are between them. The cv:DescriptorMatcher class interface provides three functions, match(), knnMatch(), and radiusMatch(); and for each, there are two different prototypes—one for recognition (takes one list of features and uses the trained dictionary) and another for tracking (takes two lists of features).

Here is the part of the class definition relevant to us for the generic descriptor matcher base class:

class cv::DescriptorMatcher {

public:

  virtual void add( InputArrayOfArrays descriptors );  // Add train descriptors
  virtual void clear();                                // Clear train descriptors
  virtual bool empty() const;                          // true if no descriptors
  void train();                                        // Train matcher
  virtual bool isMaskSupported() const = 0;            // true if supports masks
  const vector<cv::Mat>& getTrainDescriptors() const;  // Get train descriptors

  // methods to match descriptors from one list vs. "trained" set (recognition)
  //
  void match(
    InputArray                    queryDescriptors,
    vector<cv::DMatch>&           matches,
    InputArrayOfArrays            masks           = noArray ()
  );
  void knnMatch(
    InputArray                    queryDescriptors,
    vector< vector<cv::DMatch> >& matches,
    int                           k,
    InputArrayOfArrays            masks           = noArray (),
    bool                          compactResult   = false
  );
  void radiusMatch(
    InputArray                    queryDescriptors,
    vector< vector<cv::DMatch> >& matches,
    float                         maxDistance,
    InputArrayOfArrays            masks           = noArray (),
    bool                          compactResult   = false
  );

  // methods to match descriptors from two lists (tracking)
  //
  // Find one best match for each query descriptor
  void match(
    InputArray                    queryDescriptors,
    InputArray                    trainDescriptors,
    vector<cv::DMatch>&           matches,
    InputArray                    mask            = noArray ()
  ) const;
  // Find k best matches for each query descriptor (in increasing 
  // order of distances)
  void knnMatch(
    InputArray                    queryDescriptors,
    InputArray                    trainDescriptors,
    vector< vector<cv::DMatch> >& matches,
    int                           k,
    InputArray                    mask            = noArray(),
    bool                          compactResult   = false
  ) const;
  // Find best matches for each query descriptor with distance less 
  // than maxDistance
  void radiusMatch(
    InputArray                    queryDescriptors,
    InputArray                    trainDescriptors,
    vector< vector<cv::DMatch> >& matches,
    float                         maxDistance,
    InputArray                    mask            = noArray (),
    bool                          compactResult   = false
  ) const;

  virtual void read( const FileNode& );     // Reads matcher from a file node
  virtual void write( FileStorage& ) const; // Writes matcher to a file storage

  virtual cv::Ptr<cv::DescriptorMatcher> clone( 
         bool emptyTrainData=false 
  ) const = 0;
  static cv::Ptr<cv::DescriptorMatcher> create( 
         const string& descriptorMatcherType 
  );
...
};

The first set of methods is used to match an image against prestored set of descriptors, one array per image. The purpose is to build up a keypoint dictionary that can be referenced when novel keypoints are provided. The first method is the add() method, which expects an STL vector of sets of descriptors, each of which is in the form of a cv::Mat object. Each cv::Mat object should have N rows and D columns, where N is the number of descriptors in the set, and D is the dimensionality of each descriptor (i.e., each “row” is a separate descriptor of dimension D). The reason that add() accepts an array of arrays (which is usually represented as std::vector<cv::Mat>) is that in practice, one often computes a set of descriptors from each image in a set of images.¹⁶ That set of images was probably presented to a cv::Feature2D-based class as a vector of images, and returned as a vector of sets of keypoint descriptors.

Once you have added some number of keypoint descriptor sets, if you would like to access them, you may do so with the (constant) methods getTrainDescriptors(), which will return the descriptors to you in the same way you first provided them (i.e., as a vector of descriptor sets, each of type cv::Mat, with each row of each cv::Mat being a single descriptor). If you would like to clear the added descriptors, you may do so with the clear() method, and if you would like to test whether a matcher has descriptors stored in it, you may do so with the empty() method.

Once you have loaded all of the keypoint descriptors you would like to load, you may need to call train(). Only some implementations require the train() operation (i.e., some classes derived from cv::DescriptorMatcher). This method’s purpose is to tell the matcher that you are done loading images, and that it can proceed to precompute any internal information that it will need in order to perform matches on the provided keypoints. By way of example, if a matcher performs matches using only the Euclidean distance between a provided new keypoint and those in the existing dictionary, it would be prudent to construct a quad-tree or similar data structure to greatly accelerate the task of finding the closest dictionary keypoint to the provided keypoint. Such data structures can require substantial effort to compute, and are computed only once after all of the dictionary keypoints are loaded. The train() method tells the matcher to take the time to compute these sorts of adjunct internal data structures. Typically, if a train() method is provided, you must call it before calling any matching method that uses the internal dictionary.

The next set of methods is the set of matching methods used in object recognition. They each take a list of descriptors, called a query list, which they compare with the descriptors in the trained dictionary. Within this set, there are three methods: match(), knnMatch(), and radiusMatch(). Each of these methods computes matches in a slightly different way.

The match() method expects a single list of keypoint descriptors, queryDescriptors, in the usual cv::Mat form. In this case, recall that each row represents a single descriptor, and each column is one dimension of that descriptor’s vector representation. match() also expects an STL vector of cv::DMatch objects that it can fill with the individual detected matches. In the case of the match() method, each keypoint on the query list will be matched to the “best match” from the train list.

The match() method also supports an optional mask argument. Unlike most mask arguments in OpenCV, this mask does not operate in the space of pixels, but rather in the space of descriptors. The type of mask, however, should still be CV_8U. The mask argument is an STL-vector of cv::Mat objects. Each entire matrix in that vector corresponds to one of the training images in the dictionary.¹⁷ Each row in a particular mask corresponds to a row in queryDescriptors (i.e., one descriptor). Each column in the mask corresponds to one descriptor associated with the dictionary image. Thus, masks[k].at<uchar>(i,j) should be nonzero if descriptor j from image (object) k should be compared with descriptor i from the query image.

The next method is the knnMatch() function, which expects the same list descriptors as match(). In this case, however, for each descriptor in the query list, it will find a specific number of best matches from the dictionary. That number is given by the k integer argument. (The “knn” in the function name stands for k-nearest neighbors.) The vector of cv::DMatch objects from match() is replaced by a vector of vectors of cv::DMatch objects called matches in the knnMatch() method. Each element of the top-level vector (e.g., matches[i]) is associated with one descriptor from queryDescriptors. For each such element, the next-level element (e.g., matches[i][j]) is the jth best match from the descriptors in trainDescriptors.¹⁸ The mask argument to knnMatch() has the same meaning as for match(). The final argument for knnMatch() is the Boolean compactResult. If compactResult is set to the default value of false, the matches vector of vectors will contain one vector entry for every entry in queryDescriptors—even those entries that have no matches (for which the corresponding vector of cv::DMatch objects is empty). If, however, compactResult is set to true, then such noninformative entries will simply be removed from matches.

The third matching method is radiusMatch(). Unlike k-nearest neighbor matching, which searches for the k best matches, radius matching returns all of the matches within a particular distance of the query descriptor.¹⁹ Other than the substitution of the integer k for the maximum distance maxDistance, the arguments and their meanings for radiusMatch() are the same as those for knnMatch().

Note

Don’t forget that in the case of matching, “best” is determined by the individually derived class that implements the cv::DescriptorMatcher interface, so the exact meaning of “best” may vary from matcher to matcher. Also, keep in mind that there is typically no “optimal assignment” being done, so one descriptor on the query list could match several on the train list, or vice versa.

The next three methods—the alternate forms of match(), knnMatch(), and radiusMatch()—support two lists of descriptors. These are typically used for tracking. Each of these has the same inputs as their aforementioned counterparts, with the addition of the trainDescriptors argument. These methods ignore any descriptors in the internal dictionary, and instead compare the descriptors in the queryDescriptors list only with the provided trainDescriptors.²⁰

After these six matching methods, there are some methods that you will need for general handling of matcher objects. The read() and write() methods require a cv::FileNode and cv::FileStorage object, respectively, and allow you to read and write a matcher from or to disk. This is particularly important when you are dealing with recognition problems in which you have “trained” the matcher by loading information in from what might be a very large database of files. This saves you from needing to keep the actual images around and reconstruct the keypoints and their descriptors from every image every time you run your code.

Finally, the clone() and create() methods allow you to make a copy of a matcher or create a new one by name, respectively. The first method, clone(), takes a single Boolean, emptyTrainData, which, if true, will create a copy with the same parameter values (for any parameters accepted by the particular matcher implementation) but without copying the internal dictionary. Setting emptyTrainData to false is essentially a deep copy, which copies the dictionary in addition to the parameters. The create() method is a static method, which will accept a single string from which a particular derived class can be constructed. The currently available values for the descriptorMatcherType argument to create() are given in Table 16-1. (The meaning of the individual cases will be described in the next section.)

Table 16-1. Available options for descriptorMatcherType argument to cv::DescriptorMatcher::create() method
descriptorMatcherType string	Matcher type
`"FlannBased"`	FLANN (Fast Library for Approximate Nearest Neighbors) method; L2 norm will be used by default
`"BruteForce"`	Element-wise direct comparison using L2 norm
`"BruteForce-SL2"`	Element-wise direct comparison using squared L2 norm
`"BruteForce-L1"`	Element-wise direct comparison using L1 norm
`"BruteForce-Hamming"`	Element-wise direct comparison using Hamming distance^a
`"BruteForce-Hamming(2)"`	Element-wise direct comparison using Multilevel Hamming distance (two levels)
^a All the Hamming distance methods can be applied only to binary descriptors that are encoded with the `CV_8UC1` type (i.e., eight components of the descriptor per each descriptor byte).

Core Keypoint Detection Methods

In the last 10 years, there has been tremendous progress in tracking and image recognition. Within this space, one very important theme has been the development of keypoints that, as you now know, are small fragments of an image that contain the highest density of information about the image and its contents. One of the most important features of the keypoint concept is that it allows an image to be “digested” into a finite number of essential elements, even as the resolution of the image becomes very high. In this sense, keypoints offer a way to get from an image in a potentially very high dimensionality pixel representation into a more compact representation whose quality increases with image size, but whose actual size does not. It is thought that the human visual cortex “chunks” individual retinal responses (essentially pixels) up into higher-level blocks of information, at least some of which are analogous to the kind of information contained in a keypoint.

Early work that focused on concepts like corner detection (which we saw earlier) gave way to increasingly sophisticated keypoints with increasingly expressive descriptors (the latter we will visit in the next section), which exhibit a variety of desirable characteristics—such as rotational or scale invariance, or invariance to small affine transformations—that were not present in the earlier keypoint detectors.

The current state of the art, however, is that there is a large number of keypoint-detection algorithms (and keypoint descriptors), none of which is “clearly better” than the others. As a result, the approach of the OpenCV library has been to provide a common interface to all of the detectors, with the hope of encouraging and facilitating experimentation and exploration of their relative merits within your individual context. Some are fast, and some are comparatively quite slow. Some find features for which very rich descriptors can be extracted, and some do not. Some exhibit one or more useful invariance properties, of which some might be quite necessary in your application, and some might actually work against you.

In this section, we will look at each keypoint detector in turn, discussing its relative merits and delving into the actual science of each detector, at least deeply enough that you will get a feel for what each one is for and what it offers that may be different than the others. As we have learned, for each descriptor type, there will be a detector that locates the keypoints and a descriptor extractor. We will cover each of these as we discuss each detection algorithm.

Note

In general, it is not absolutely necessary to use the feature extractor that is historically associated with a particular keypoint detector. In most cases, it is meaningful to find keypoints using any detector and then proceed to characterize those keypoints with any feature extractor. In practice, however, these two layers are usually developed and published together, and so the OpenCV library uses this pattern as well.

The Harris-Shi-Tomasi feature detector and cv::GoodFeaturesToTrackDetector

The most commonly used definition of a corner was provided by Harris [Harris88], though others proposed similar definitions even earlier. Such corners, known as Harris corners, can be thought of as the prototypical keypoint.²¹ Figure 16-11 shows the Harris corners on a pair of images that we will continue to use for other keypoints in this section (for convenient visual comparison). Their definition relies on a notion of autocorrelation between the pixels in a small neighborhood. In plain terms, this means “if the image is shifted a small amount (Δx, Δy), how similar is it to its original self?”

Two images of the same vehicle; in each image are the 1,000 strongest Harris-Shi-Tomasi corners. Notice that, in the image on the right, the corners in the background are stronger than those on the car, and so most of the corners in that image originate from there.

Harris began with the following autocorrelation function, for image intensity pixels I(x, y):

This is just a weighted sum over a small window around a point (x,y) of the squared difference between the image at some point (i,j) in the window and some other point displaced by (Δx, Δy). (The weighting factor w_{i, j} is a Gaussian weighting that makes the differences near the center of the window contribute more strongly than those farther away from the center.)

What follows in Harris’s derivation is a small amount of algebra, and the approximation that because Δx and Δy are assumed to be small, the term I(i + Δx, j + Δy) can be estimated by I(i, j) + I_x (i, j)Δx + I_y (i, j)Δy) (here I_x and I_y are the first-order partial derivatives of I(x,y) in x and y, respectively).²² The result is the re-expression of the autocorrelation in the form:

Where M(x, y) is the symmetric autocorrelation matrix defined by:

Corners, by Harris’s definition, are places in the image where the autocorrelation matrix has two large eigenvalues. In essence, this means that moving a small distance in any direction will change the image.²³ This way of looking at things has the advantage that, when we consider only the eigenvalues of the autocorrelation matrix, we are considering quantities that are invariant also to rotation, which is important because objects that we are tracking might rotate as well as translate.

Note

In this case these two eigenvalues of the Harris corner do more than determine whether a point is a good feature to track (i.e., a keypoint); they also provide an identifying signature for the point (i.e., a keypoint descriptor). It is a common, though by no means universal, feature of keypoints that they are intimately tied to their descriptors in this way. In many cases, the keypoint is, in essence, any point for which the associated descriptor (in this case the two eigenvalues of M(x, y)) meets some threshold criteria. At the same time, it is also noteworthy that Harris’s original threshold criterion was not the same as that later proposed by Shi and Tomasi; the latter turns out to be superior for most tracking applications.

Harris’s original definition involved taking the determinant of M(x, y) and subtracting its squared trace (with some weighting coefficient):

One then found the “corners” (what we now call keypoints) by searching for local maxima of this function (and often also comparing this function to a predetermined threshold). This function H, known as the Harris measure, effectively compares the eigenvalues of M (which we refer to as λ₁ and λ₂ in our definition of H) without requiring their explicit computation. This comparison implicitly contains the parameter κ, termed the sensitivity, which can be set meaningfully to any value between 0 and 0.24, but is typically set to about 0.04.²⁴ Figure 16-12 shows an image in which the regions around some individual keypoint candidates are shown enlarged.

In a classic image (a), keypoints found by the Shi-Tomasi method are shown as black dots. Below that are three images that are enlargements of a small subsection of the original. On the left (b) are shown (as Xs) points that are not keypoints. These points have small eigenvalues in both dimensions. In the center (c) are shown (as Xs) points that are also not keypoints; these are edges, and have one small eigenvalue and one large eigenvalue associated with them. On the right (d) are actual found keypoints; for these points, both eigenvalues are large. The ovals visualize the inverse of these eigenvalues.

It was later found by Shi and Tomasi [Shi94] that good corners resulted as long as the smaller of the two eigenvalues was greater than a minimum threshold. Shi and Tomasi’s method was not only sufficient, but in many cases gave more satisfactory results than Harris’s method. The OpenCV implementation of cv::GFTTDetector, as a default, uses Shi and Tomasi’s measure, but other keypoint detectors we will discuss later often use either Harris’s original measure or a variation of it.

Keypoint finder

The Harris-Shi-Tomasi corner detector is also the simplest implementation of the cv::Feature2D (the detector part) interface:

class cv::GFTTDetector : public cv::Feature2D {
public:
  static Ptr<GFTTDetector> create(
    int    maxCorners        = 1000,  // Keep this many corners
    double qualityLevel      = 0.01,  // fraction of largest eigenvalue
    double minDistance       = 1,     // Discard corners if this close
    int    blockSize         = 3,     // Neighborhood used
    bool   useHarrisDetector = false, // If false, use Shi Tomasi
    double k                 = 0.04   // Used for Harris metric
  );
...
};

The constructor for cv::GFTTDetector takes arguments that set all of the basic runtime parameters for the algorithm. The maxCorners parameter indicates the maximum number of points that you would like returned.²⁵ The parameter qualityLevel indicates the minimal acceptable lower eigenvalue for a point to be included as a corner. The actual minimal eigenvalue used for the cutoff is the product of the qualityLevel and the largest lower eigenvalue observed in the image. Hence, the qualityLevel should not exceed 1 (a typical value might be 0.10 or 0.01). Once these candidates are selected, a further culling is applied so that multiple points within a small region need not be included in the response. In particular, the minDistance guarantees that no two returned points are within the indicated number of pixels.

The blockSize is the region around a given pixel that is considered when you are computing the autocorrelation matrix of derivatives. It turns out that in almost all cases you will get superior results if you sum these derivatives over a small window than if you simply compute their value at only a single point (i.e., at a blockSize of 1).

If useHarris is true, then the Harris corner definition is used rather than the Shi-Tomasi definition, and the value k is the weighting coefficient used to set the relative weight given to the trace of the autocorrelation matrix Hessian compared to the determinant of the same matrix.

Of course, when you want to actually compute keypoints, you do that with the detect() method, which cv::GFTTDetector inherits from the cv::Feature2D base class.

Additional functions

cv::GFTTDetector also supports setting and retrieving various properties using set/get methods; for example, you can turn on the Harris detector instead of the default minimum eigenvalue based (Shi-Tomasi) GFTT algorithm by calling the gfttdetector->setHarrisDetector(true) method.

A brief look under the hood

Internally, cv::goodFeaturesToTrack() and cv::GFTTDetector have a few specific phases: the computation of the autocorrelation matrix M(x, y), the analysis of this matrix, and some kind of threshold applied. The critical steps are accomplished with the functions cv::cornerHarris() and cv::cornerMinEigenVal():

void cv::cornerHarris(
  cv::InputArray  src,                            // Input array CV_8UC1
  cv::OutputArray dst,                            // Result array CV_32FC1
  int             blockSize,                      // Autocorrelation block sz
  int             ksize,                          // Sobel operator size
  double          k,                              // Harris's trace weight
  int             borderType = cv::BORDER_DEFAULT // handle border pix
);
void cv::cornerMinEigenVal(
  cv::InputArray  src,                            // Input array CV_8UC1
  cv::OutputArray dst,                            // Result array CV_32FC1
  int             blockSize,                      // Autocorrelation block sz
  int             ksize,     = 3                  // Sobel operator size
  int             borderType = cv::BORDER_DEFAULT // handle border pix
);

The arguments to these two functions are exactly analogous to the function cv::goodFeaturesToTrack(), with the first filling dst with the characteristic values used by Harris:

and the second filling dst with the characteristic values used by Shi and Tomasi—that is, the minimal eigenvalue of the autocorrelation matrix M(x, y).

If you would like to implement your own variation of the GFTT algorithm, a final function is provided for you that computes and gives to you the eigenvalues and eigenvectors of the autocorrelation matrix for every point on your image. This function is called cv::cornerEigenValsAndVecs():

 void cornerEigenValsAndVecs(
  cv::InputArray  src,                            // Input array CV_8UC1
  cv::OutputArray dst,                            // Result array CV_32FC1
  int             blockSize,                      // Autocorrelation block sz
  int             ksize,                          // Sobel operator size
  int             borderType = cv::BORDER_DEFAULT // handle border pix
);

The only significant difference between this function and cv::cornerMinEigenVal() is the nature of the output. In this case, the results array dst will be of type CV_32FC6. The six channels will contain the two eigenvalues, the two components of the eivenvector for the first eigenvalue, and the two components in the eigenvector for the second eigenvalue (in that order).

The simple blob detector and cv::SimpleBlobDetector

The corner detection concept embodied by the work of Harris, and later Shi and Tomasi, represents what will turn out to be one major approach to the keypoint concept. From this point of view, keypoints are highly localized structures that exist at points in an image where a larger-than-normal amount of information is present. An alternative point of view is the concept of a blob (see Figure 16-13). Blobs are, by nature, not so clearly localized, but represent regions of interest that might be expected to have some stability over time (see Figure 16-14).

The simple blob detector run on two similar images of the same vehicle. There is little consistency between the blobs found in the two images. Blob detection works best in simple environments where there are expected to be a few very well-defined objects to locate.

Starting with an image of a countryside scene (left), six thresholded images are generated (center). One set of overlapping blob candidates, corresponding to the building in the bottom center of the original image, are shown (right). These candidates will be combined to produce a final estimation of the associated blob (not shown). The contours contributing to these blob candidates are highlighted (in black) in the center thresholded images.

There are many algorithms for blob detection. The cv::SimpleBlobDetector class implements just one of them.²⁶ The simple blob detector works by first converting the input image to grayscale, and then computing a series of thresholded (binary) images from that grayscale image. The number of binary images is determined by parameters to the algorithm: minimum threshold, maximum threshold, and a threshold step. Once converted to binary, connected components are extracted—for example, by cv::findContours()—and the centers of each such contour are computed; they are the candidate blob centers. Next, candidate blob centers near one another in space (controlled by a minimum distance parameter, minDistBetweenBlobs) and from images with adjacent thresholds (differing by one step in the list of applied thresholds) are grouped together. Once these groups have been determined, the groups are assigned a radius and a center, which is computed from all of the contours that form the group. The resulting objects are the keypoints.

Once the blobs have been located, some built-in filtering can be turned on to reduce the number of blobs. Blobs may be filtered by color (which really means intensity, since this is a grayscale image), by size (area), by circularity (ratio of the area of the actual blob to a circle of the blob’s computed effective radius), by what is called the inertia ratio (the ratio of the eigenvalues of the second moment matrix), or by the convexity (the ratio of the blob’s area to the area of its convex hull).

Keypoint finder

Start by taking a look at (a somewhat simplified version of) the blob detector’s declaration:

class SimpleBlobDetector : public Feature2D {

public:
  struct Params {
      Params();
      float  minThreshold;              // First threshold to use
      float  maxThreshold;              // Highest threshold to use
      float  thresholdStep;             // Step between thresholds

      size_t minRepeatability;          // Blob must appear
                                        // in this many images
      float  minDistBetweenBlobs;       // Blob must be this far
                                        // from others

      bool   filterByColor;             // True to use color filter
      uchar  blobColor;                 // always 0 or 255

      bool   filterByArea;              // True to use area filter
      float  minArea, maxArea;          // min and max area to accept

      // True to filter on "circularity", and min/max
      // ratio to circle area
      bool   filterByCircularity;
      float  minCircularity, maxCircularity;

      // True to filter on "inertia", and min/max eigenvalue ratio
      bool   filterByInertia;
      float  minInertiaRatio, maxInertiaRatio;

      // True to filter on convexity, and min/max ratio to hull area
      bool   filterByConvexity;
      float  minConvexity, maxConvexity;

      void read( const FileNode& fn );
      void write( FileStorage& fs ) const;
  };

  static Ptr<SimpleBlobDetector> create(
    const SimpleBlobDetector::Params &parameters
      = SimpleBlobDetector::Params()
  );

  virtual void read( const FileNode& fn );
  virtual void write( FileStorage& fs ) const;

  ...
};

As you can see from scanning over this declaration, it is clear that there is not actually much going on here. There is a definition for cv::SimpleBlobDetector::Params, a structure that can hold all of the information needed to actually run a simple blob detector, a constructor (which takes a Params argument), and read() and write() functions that will allow us to store the state of our detector. Of course, there are also the all-important detect() routines inherited from the cv::Feature2D interface.

We already know how the detect() member works, in the sense that what it does is entirely generic to all feature detectors; what really matters to us here is how to set up the parameters in the Params argument to the constructor. The first group of five parameters controls the basic functioning of the algorithm. We use thresholdStep, minThreshold, and maxThreshold to configure the set of thresholded images to generate. We do so by starting at minThreshold and stepping up by thresholdStep each time up until, but not including, maxThreshold. It is typical to start with a value around 50 to 64 and step in small increments (e.g., 10) up to about 220 to 235, thus avoiding the often-less-informative ends of the intensity distribution. minRepeatability determines how many (consecutive) threshold images must contain overlapping blob candidates in order for the candidates to be combined into a blob. This number is typically a small integer, but rarely less than two. The actual meaning of “overlapping” is controlled by minDistBetweenBlobs. If two blob candidates have their centers within this distance, they are considered to be related to the same blob. Keep in mind that this one is in pixel units, and so should scale with your image. The default constructor for cv::SimpleBlobDetector::Params sets this to 10, which is probably only suitable for images of about 640 × 480.

The remaining parameters affect the different filtering options, and are arranged into small groups, each of which contains a Boolean that turns on or off the particular filtering feature, and a parameter or two that control the filtering (if it is turned on). The first is filterByColor, which has only one associated parameter. That parameter, blobColor, is an intensity value required for a blob candidate to be kept. Because the blob candidates are generated on binary thresholded images, only the values 0 and 255 are meaningful. (Use the former to extract only dark blobs and the latter to extract only light blobs; turn off the feature all together to get both kinds of blobs.)

The filterByArea parameter, if true, will cause only blobs whose area is greater than or equal to minArea, yet strictly less than maxArea, to be kept. Similarly the filterByCircularity parameter, if true, will cause only blobs whose circularity is greater than or equal to minCircularity, yet strictly less than maxCircularity, to be kept. The same goes for filterByInertial, minInertiaRatio, and maxInertiaRatio, as well as for filterByConvexity, minConvexity, and maxConvexity.²⁷

The FAST feature detector and cv::FastFeatureDetector

The FAST (Features from Accelerated Segments Test) feature-detection algorithm, originally proposed by Rosten and Drummond [Rosten06], is based on the idea of a direct comparison between a point P and a set of points on a small circle around it (see Figure 16-15). The basic idea is that if only a few of the points nearby are similar to P, then P is going to be a good keypoint. An earlier implementation of this idea, the SUSAN algorithm, compared all of the points in a disk around P. FAST, which could be thought of as a successor to SUSAN, improves on this idea in two ways.

Two images of the same vehicle; in each image are the 1,000 strongest FAST features. Notice that in the image on the right, as with the Harris-Shi-Tomasi features, the corners in the background are stronger than those on the car, and so again most of the corners in the right image originate from the background.

The first difference is that FAST only uses the points on a ring around P. The second is that individual points on the ring are classified as either darker than P, lighter than P, or similar to P. This classification is done with a threshold t, such that the darker pixels are ones that are less bright than I_P – t, the lighter pixels are ones that are more bright than I_P + t, and the similar pixels are those that are in between I_P – t and I_P + t. Once this classification has been done, the FAST detector requires some number of contiguous points on the ring to be either all brighter or all darker than P. If the number of points on the ring is N, then the arc that contains only lighter or darker pixels must contain at least N/2 + 1 pixels (i.e., more than half the total number on the ring).

This algorithm is already very fast, but a moment’s thought will also reveal that this test permits a convenient optimization in which only four equidistant points are tested. In this case, if there is not at least a pair of consecutive points that are brighter or darker than P, then the point P cannot be a FAST feature. In practice this optimization greatly reduces the time required to search an entire image.

One difficulty with the algorithm as described so far is that it will tend to return multiple adjacent pixels all as corners. In Figure 16-16, for example, the pixel directly above P, among others, is also a FAST keypoint. In general this is not desirable.

The point P is a keypoint candidate for the FAST algorithm. The ring of points that contribute to the classification of P are identified by a circle around p. In this case there are 16 pixels on that circle, numbered 0–15 here.

To avoid this problem, the FAST algorithm defines a score for each corner, and can remove all keypoints that are adjacent to keypoints of higher score. We construct the score by first computing the sum of absolute differences between the “lighter” pixels and the center pixel, then doing the same for the darker pixels, and finally taking the greater of these two.

It is worth noting, mainly because we will come back to this point later when we discuss the ORB feature, that the FAST feature, as defined here, does not have any kind of intrinsic orientation.

Keypoint finder

The FAST feature detector is very simple, and looks very much like the Harris corner detector cv::GoodFeaturesToTrackDetector:

class cv::FastFeatureDetector : public cv::Feature2D {
public:
  enum {
    TYPE_5_8  = 0,                      //  8 points, requires 5 in a row
    TYPE_7_12 = 1,                      // 12 points, requires 7 in a row
    TYPE_9_16 = 2                       // 16 points, requires 9 in a row
  };

  static Ptr<FastFeatureDetector> create(
    int    threshold        = 10,       // center to periphery diff
    bool   nonmaxSupression = true,     // suppress non-max corners?
    int    type             = TYPE_9_16 // Size of circle (see enum)
  );
...
};

The constructor for the cv::FastFeatureDetector has three arguments: the threshold, a Boolean flag, and the operator type. The value of threshold is measured in pixel intensity units and is therefore an integer. The Boolean value, nonMaxSupression, turns on or off the suppression of neighboring points with inferior scores. The last argument sets the type of operator, with this type determining the circumference of the circle of sampled points. There are three available types, defined in the cv::FastFeatureDetector class as enumerations. Each type specifies both the circumference of the circle and the number of contiguous points required for the center of that circle to be considered a keypoint. For example, the value cv::FastFeatureDetector::TYPE_9_16 conveys that 9 out of 16 points should be either all brighter or all darker than the center point.

Note

In most cases, you will want to set the threshold to a somewhat large number, like 30. If the threshold is too low, you will get a lot of spurious points in areas of very minor real intensity variation.

The SIFT feature detector and cv::xfeatures2d::SIFT

The SIFT feature (Scale Invariant Feature Transform),²⁸ originally proposed by David Lowe in 2004 [Lowe04], is widely used and the basis for many subsequently developed features (see Figure 16-17). SIFT features are computationally expensive compared to many other feature types, but they are highly expressive, and thus are well suited to both tracking and recognition tasks.

Two images of the same vehicle from different angles. On the left 237 SIFT features are found, while on the right 490 are found. In this case, you can see that the features on the vehicle are relatively stable and can find many correspondences by eye. The density of features on the car is approximately the same in both images, despite the many more features found on the background in the righthand image.

The scale invariant property that gives SIFT features their name results from an initial phase of the SIFT algorithm in which a set of convolutions are computed between the input image and Gaussian kernels of increasing size. These convolutions are then combined, each with its successor, which is the one convolved with a slightly larger Gaussian. The result of this process is a new set of images that approximate the difference of Gaussian (DoG) operator. Given this set of images, which you might visualize as a stack, each pixel in each image in the stack is compared with not only its neighbors in its own image (of which there are eight), but also with itself and its neighbors in the images above and below in the stack (of which there are nine more in the image above and nine more in the image below). If a pixel has a higher value in the difference of Gaussian convolution than all 26 of these neighbors, it is considered a scale space extremum of the difference of Gaussian operator (see Figure 16-18).

The intuition here is straightforward. Consider an image that is black except for a white disk in the center. The difference of Gaussians kernel (Figure 16-19) will give the strongest response when the zero crossings are exactly at the edges of the white disk. In this configuration, the positive part of the kernel is multiplied by the positive image values of the white disk, while the negative part of the kernel is entirely multiplied by the zero values of the black background. Neighboring locations or different sizes at the same location will give weaker responses. In this sense, the “feature” of the disk is found, both in position and in scale.

We locate scale space extrema by first convolving the original image with Gaussian kernels of various sizes and then computing difference images between convolutions of neighboring sizes. In the difference images, each pixel (shown as a solid square) is compared to all of its neighbors (shown as Xs) in both the same layer and the adjacent layers. If the difference of Gaussians signal is stronger than all neighbors on all three layers, that pixel is considered a scale space extremum.

The Gaussian kernels G1 and G2, along with the difference G1–G2.

Once a set of features is found, the algorithm tests each feature both to determine its quality as a feature and to refine the estimate of its location. It does so by fitting a paraboloid to the 3 × 3 × 3 volume around the extremum (the three dimensions being x, y, and scale). From this quadratic form, we can extract two important pieces of information. The first is an offset to be added to the location of the keypoint; this offset is a subpixel correction for spatial location, and interpolates in scale as well as in between the discrete scales available from the original set of Gaussian convolutions. The second is an estimate of the local curvature at this extremal point, in the form of a Hessian matrix whose determinant can be used as a thresholdable value by which to reject keypoints of low discriminating power. Similarly, a large ratio between the eigenvalues of the purely spatial part of this matrix indicates that the feature is primarily an edge (rather than a “corner”), and is also a means by which candidate features can be rejected. These considerations can be compared to the figure of merit used in the Harris corner or Shi-Tomasi feature we saw earlier in this chapter.

Once all such scale space extrema have been found, the SIFT algorithm proceeds to construct a descriptor for each such object as shown in Figure 16-20. The first step here is to assign an orientation to the keypoint. The orientation is based on essentially comparing the directional derivatives over the points around the keypoint and picking the orientation that corresponds to the largest derivatives.²⁹ Once such an orientation is found, all subsequent properties of the descriptor can be assigned relative to this primary orientation. In this way, SIFT features are not only scale invariant, but also orientation invariant—the latter in the sense that given a new rotated image, the same keypoint would be found; its orientation would be found to be different, but the rest of the feature descriptors, because they are measured relative to the orientation, would be found to match.

Finally, with the scale and orientation computed, the local image descriptor can be computed. The local image descriptor is also formed from local image gradients, but this time after the local region has been rotated to a fixed orientation relative to the descriptor orientation. Next, they are clumped into regions (typically 16 in a 4 × 4 pattern around the keypoint, or more), and for each region an angle histogram is created from all of the points in the associated region. Typically this histogram will have 8 entries. Combining the 8 entries per region with the 16 regions gives a vector of 128 components. This 128-component vector is the SIFT keypoint descriptor. This large number of components is essential to the highly descriptive nature of the SIFT keypoints.

A SIFT feature is extracted from an image (a). That feature has a size and an orientation, shown in (b). The area around the feature is divided up into blocks (c), and for each block, a directional derivative is computed for every pixel in the cell (d). These directional derivatives are aggregated into a histogram for each block (e), and the magnitudes in each bin in all of the histograms for all of the blocks are concatenated into a vector descriptor for the feature (f).

Keypoint finder and feature extractor

The SIFT implementation in OpenCV implements both the feature detection and descriptor extraction parts of cv::Feature2D interface, and, as usual, we recommend using the detectAndCompute() method, which combines the keypoint detection and feature extraction into one step. Note that because SIFT algorithm is patented, it’s placed in the opencv_contrib repository, in the module xfeatures2d, starting from OpenCV 3.0. Here is (a slightly abbreviated version of) the class definition for the cv::xfeatures2d::SIFT object:

class SIFT : public Feature2D {

public:

  static Ptr<SIFT> create (
    int    nfeatures         = 0,    // Number of features to use
    int    nOctaveLayers     = 3,    // Layers in each octave
    double contrastThreshold = 0.04, // to filter out weak features
    double edgeThreshold     = 10,   // to filter out "edge" features
    double sigma             = 1.6   // variance of level-0 Gaussian
  );

  int descriptorSize() const;        // descriptor size, always 128
  int descriptorType() const;        // descriptor type, always CV_32F
   ...
};

All of the parameters required by the cv::xfeatures2d::SIFT constructor are used in the construction of the scale-space image representation and also at the keypoint-detection portion of the algorithm. In the current implementation, the actual feature descriptors are always computed with a fixed set of parameters.³⁰

The first argument, nfeatures, indicates the maximum number of features you would like be computed from the image. If it is left set to the default value of 0, the algorithm will return all features that can be found. The next argument, nOctaveLayers, determines how many layers (different scales of Gaussian convolutions) should be computed for each octave (images in the image pyramid). The number of layers actually computed is the value of the nOctaveLayers argument plus three. Thus, the illustration in Figure 16-18 shows the case of five layers for each image in the pyramid, which corresponds to a value for nOctaveLayers of two.

The next two parameters are threshold values, which are used to determine whether a found keypoint candidate should be retained. Recall that once a keypoint candidate is generated by scale space search, that keypoint is then subjected to two tests. The first test is whether or not the local extremum of the DoG operator is sufficiently distinct from the surrounding region. This test is against the value of contrastThreshold; typical values of this argument are of the order of 0.04 (the default value). The second test has to do with the ratio of spatial eigenvalues, and serves the purpose of rejecting edges. This test is against the value of edgeThreshold; typical values of this argument are of the order 10.0 (the default value).

The final parameter, sigma, is used to create a presmoothing of the image, which incidentally effectively also sets the scale of the very first layer in the scale space. Typical values are of the order of a pixel (the default is 1.6 pixels), but it is often useful to make this value slightly larger for images that contain a bit of noise or other artifacts.

Note

You could just prefilter an image by convolving with your own Gaussian filter, instead of using the sigma parameter, but if you do this, the algorithm will not be aware that there is no information below the scale you used in the filtering. As a result, it will waste time computing layers whose purpose is to search for features that cannot exist. Therefore, it is much more efficient to use the sigma parameter provided for presmoothing. In effect, you are telling the algorithm “I don’t care about anything smaller than this size.”

Once you have created your cv::xfeatures2d::SIFT object, you can use the descriptorSize() and descriptorType() functions to query the size of the features it will compute and the type of elements the feature vector will contain. These two functions always return 128 and CV_32F, respectively, but are often handy when you are using many kinds of feature objects and are handling them by a base class pointer, but need to query an individual one to find out about the feature vectors it will return.

The main function you will be using is the overloaded detectAndCompute() method. Depending on the arguments given, this operator will either just compute keypoints, or compute keypoints and their associated descriptors. The keypoints-only case requires just three arguments: img, mask, and keypoints. The first argument is the image you want keypoints extracted from (which can be color or grayscale, but in the former case it will be changed to a grayscale representation internally before the algorithm begins). The image argument should always be of type CV_8U. The mask argument is used in order to restrict the keypoints generated to only those within some specified area. The mask array should be a single channel of type CV_8U, but can be set to cv::noArray() if no filtering of this kind is required. The next argument to cv::xfeatures2d::SIFT::detectAndCompute() is keypoints, which must be a reference to an STL-style vector of cv::KeyPoint objects. This is where SIFT will put the keypoints it finds.

The next argument, descriptors, is an output array, analogous to other feature-point descriptors we have seen already. If descriptors is an array, then each row of the array will be a separate descriptor, and the number of such rows is equal to the number of keypoints.

The final argument is the Boolean useProvidedKeypoints. If this argument is set to true, then a keypoint search will not be undertaken, and the keypoints argument will instead be treated as an input. In this case, descriptors will be generated for every keypoint indicated in the keypoints vector.

The SURF feature detector and cv::xfeatures2d::SURF

The SURF feature (Speeded-Up Robust Features)³¹ was originally proposed in 2006 by Bay et al. [Bay06, Bay08], and is in many ways an evolution of the SIFT feature we just discussed (see Figure 16-21). The creators of SURF were interested in ways in which the different components of the SIFT features could be replaced with more computationally efficient techniques that might give similar or better performance in (primarily) recognition tasks. The resulting features are not only much faster to compute; in many cases, their slightly simpler nature results in greater robustness to changes in orientation or lighting than is observed with SIFT features.

SURF features are computed for the same vehicle from two different orientations. On the left, 224 features are found, while on the right 591 features are found, with many new features being associated with the visible background in the righthand image. SURF features are orientable like SIFT features. This image was generated with the hessianThreshold set to 1500.

Relevant to several phases of the SURF feature detector’s operation is the concept of the integral image. Recall that we encountered integral images in Chapter 12; where we saw that they enable us to make a single transform of a whole image and, thereafter, that transformed image allows us to compute sums over any rectangular area with only a few simple operations. The SURF feature relies heavily on computations that can be greatly accelerated by the use of the integral image technique.

Like many other detectors, SURF defines a keypoint in terms of the local Hessian³² at a given point. Previously, we saw that in order to introduce the concept of scale, the SIFT detector computed the local Hessian using a difference of Gaussian convolutions with slightly differing widths (Figure 16-19). The SURF detector, however, computes the local Hessian by convolving with a box filter that approximates the difference of the two Gaussian kernels (Figure 16-22).³³ The primary advantage of these box filters is that we can evaluate them quickly using the integral image technique.

The difference of two continuous Gaussian kernels is shown on the left (a). A discrete 9 × 9 filter kernel is shown in the center (b) that approximates the second derivative in the vertical direction. A box filter approximation of the DoG filter kernel is shown on the right (c).

Because the cost of the box filters does not change with the size of the filter (because of the integral image), it is not necessary to generate a scale pyramid of the images as was done with SIFT. Instead, a variety of ever larger box filters can be used to evaluate the Hessian at different scales. From the response to these box filters, keypoint features are defined in SURF to be the local extrema of the determinant of this Hessian that also exceed some threshold.

Like SIFT, SURF includes the notion of orientation for a feature, which we compute by using the integral image again to estimate a local bulk gradient of the region around the feature. We do so using a pair of simple Haar wavelets (Figure 16-23c) to approximate local gradients, and by applying these wavelets to different areas of the region around where the scale space extremum was found. If the scale of the feature was found to be s, then we calculate the gradients using wavelets of size 4s, spaced at intervals a distance s apart in a region of radius 6s centered on the feature (Figure 16-23b). We then aggregate these gradient estimations by considering a sliding window in angle of size . By summing all of the gradients in this orientation window (with a weighting factor determined by the area’s distance from the feature center), we choose the maximum orientation found as the orientation of the feature (Figure 16-23d).³⁴ Once an orientation has been computed, the feature vector can then be generated in a manner relative to that orientation and, as was the case with SIFT, the feature becomes effectively invariant to orientation.

We determine the SURF orientation of an image (a) by analyzing the region around where a scale space extremum was found (b). Two simple wavelets (c) are used to approximate a local gradient and are convolved with the image in many positions near that extremum (dashed boxes in b regularly sampled from the region shown by the solid circle in b). The final orientation is extracted from an analysis of all of the gradients measured in this way (d).

The features themselves are also computed in a manner analogous to the SIFT feature design. For a feature of scale s, the 20s × 20s region centered on the feature is first divided into 16 cells in a 4 × 4 grid. This grid is rotated relative to the feature by an angle given by the orientation just computed. For every such cell, 25 pairs of Haar wavelets (identical to those shown in Figure 16-23c, other than being much smaller) are used to approximate the x- and y-gradients of the image in each of a 5 × 5 array (within each cell of the 4 × 4 grid).³⁵ For each cell of this kind, the 25 x-direction wavelet convolutions are summed, as are the 25 y-direction wavelet convolutions. For each direction, both the sum as well as the sum of absolute values are computed. This gives 4 numbers for each cell in the 4 × 4 grid, or a total of 64 numbers. These 64 values form the entries in a 64-dimensional feature vector for the individual SURF feature (see Figure 16-24).

There is a variant SURF feature, called an “extended” SURF feature, that sums separately the sums of the convolutions into eight sums rather than four. It does so by summing those values of the x-direction wavelet convolutions for which the corresponding y-direction wavelet convolution was positive from those for which the corresponding y-direction wavelet convolution was negative (and similarly for the y-direction wavelet convolutions sums depending on the sign of the corresponding x-direction convolution). The resulting descriptors are larger, which means that matching them will be slower, but in some cases the greater descriptive power of the extended features has been found to improve recognition performance.

We compute the SURF feature from estimating a gradient in each of 400 subcells. The area around the feature is first divided into a 4 × 4 grid of cells (a). Each cell is then divided into 25 subcells, and directional derivatives are estimated for each subcell (b). The directional derivatives for the subcells are then summed to compute four values for each cell in the large grid (c).

Keypoint finder and feature extractor

As with SIFT, the best and most recent implementation of SURF in OpenCV uses the cv::Feature2D interface. The construction process is done by a single constructor for the cv::xfeatures2d::SURF object, which then provides an interface for keypoint finding as well as feature descriptor extraction—as well as a few other useful functions. Here is the (slightly abbreviated) definition of the SURF object:

class cv::xfeatures2d::SURF : public cv::Feature2D {

public:
  static Ptr<SURF> create (
    double hessianThreshold = 100,   // Keep features above this
    int    nOctaves         = 4,     // Num of pyramid octaves
    int    nOctaveLayers    = 3,     // Num of images in each octave
    bool   extended         = false, // false: 64-element,
                                     //  true: 128-element descriptors
    bool   upright          = false, // true: don't compute orientation
                                     //  (w/out is much faster)
  );

  int descriptorSize() const;        // descriptor size, 64 or 128
  int descriptorType() const;        // descriptor type, always CV_32F

  ...
};
typedef SURF SurfFeatureDetector;
typedef SURF SurfDescriptorExtractor;

The constructor method create() for the cv::xfeatures2d::SURF object has five arguments that are used to configure the algorithm. The first, hessianThreshold, sets the threshold value for the determinant of the Hessian that is required in order for a particular local extremum to be considered a keypoint. The value assigned by the default constructor is 100, but this is a very low value that could be interpreted as meaning “all of them.” A typical value for reasonable selectivity would be something like 1500.³⁶

The extended parameter tells the feature extractor to use the extended (128-dimensional) feature set (described in the previous section). The parameter upright indicates that orientations should not be computed for features, and they should all be treated as “vertical”; this is also known as “upright SURF” or just “U-SURF.”

Note

In uses such as automotive or mobile robot applications, it is often safe to assume that the orientation of the camera is fixed relative to the orientation of the objects one wants to detect. Consider, for example, the case of an automobile detecting road signs. In this case, the use of the upright argument will improve speed and most likely improve matching performance as well.

The final arguments, nOctaves and nOctaveLayers, are closely analogous to the corresponding arguments for cv::xfeatures2d::SIFT(). The nOctaves argument determines how many “doublings” of scale will be searched for keypoints. In the case of SURF, the minimal size feature that can be found is calculated by convolution with a 9 × 9 pixel filter. The default number of octaves is four, which is normally sufficient for most applications. When using very high resolution imagers, however, you might wish to increase this number. Reducing it to three produces only a very small improvement in speed, however, as the scale search for the higher octaves is very inexpensive compared to the lower octaves.³⁷

For each octave, several different kernels will be evaluated. Unlike SIFT however, the kernels are not distributed to evenly subdivide the octaves. In fact, if more than two octave layers are used, there will be overlap between the size of the kernels used in successive octaves. (This does not mean that a larger value is not useful, only that the effect of larger numbers of octave layers is not entirely intuitive.) The default value for nOctaveLayers is 3, but some studies have found increasing it to as high as four useful (though at increasing computational cost).

The descriptorSize() and descriptorType() methods return the number of elements in the descriptor vector (64 normally, and 128 for extended SURF features) and the type of the descriptor vector. (Currently the latter is always CV_32F.)

There are overloaded methods SURF::detect(), SURF::compute(), and SURF::detectAndCompute(). As usual, when you need both the keypoints and their descriptors, it’s recommended to use the latter method. The parameters are absolutely identical to those of SIFT::detectAndCompute().

Additional functions provided by cv::xfeatures2d::SURF

cv::xfeatures2d::SURF also provides a bunch of methods to set and retrieve the algorithm’s parameters on the fly. Make sure that you do not alter parameters while processing a set of images. Find the optimal parameters and keep using them; otherwise, you may get incomparable descriptors.

The Star/CenSurE feature detector and cv::xfeatures2d::StarDetector

The Star features (see Figure 16-25) were developed originally for visual odometry, measuring the self-motion of a video camera from the image data alone [Agarwal08]. In this context, features such as Harris corner or FAST are desirable because they are highly localized. In contrast, features like SIFT, which rely on image pyramids, can become poorly localized in the original image space as you move higher up the pyramid. Unfortunately, features like the Harris corner or FAST are not scale invariant like SIFT is, precisely because of the lack of a scale space search. The Star feature, also known as the Center Surround Extremum (or CenSurE) feature, attempts to solve the problem of providing the level of localization of Harris corners or FAST features while also providing scale invariance. There is no associated unique descriptor attached to the Star/CenSurE feature; in the paper in which it was first introduced, the authors used a variant of the “Upright SURF” or U-SURF feature descriptors.

Conceptually, the approach of Star is to compute all variants of some feature at all scales and select the extrema across scale and location. At the same time, the goal was to have the features be very fast to compute (recall that visual odometry has applications in robotic and many other real-time environments). The CenSurE feature addresses these competing goals with a two-stage process. The first stage is a very fast approximation to something like the difference of Gaussians (DoG) operation used by SIFT (and others), and the extraction of the local extrema of this operation. The second stage attempts to cull things that look too much like edges (as opposed to corners) using a scale-adapted version of the Harris measure.

Star features computed for the same vehicle from two slightly different orientations. At default parameters, 137 are found on the left image and 336 are found on the right. In both cases, the number on the automobile is about the same, with the additional features on the background accounting for the greater number in the second image. The found features on the vehicle are few compared to other methods, but you can easily see correspondences, suggesting the features are very stable.

To understand how the fast DoG approximation is done, it is useful to think back to how SURF computed the box approximation to similar objects (Figure 16-22). In this case however, the DoG is approximating a feature that looks like the difference of two similarly sized Gaussians that are both rotationally symmetric in the image plane. As a result the feature is particularly simple.

The approximation used is square (Figure 16-26) and so it can be constructed at any size. It has only two regions, which can both be computed via integral images. Thus, the entire evaluation of the feature amounts to three operations for the outer square, three for the inner square, two more to scale them, and one final operation to add the two terms: nine operations. This simplicity is why every point can be tested at many scales. In practice, the smallest feature that can be constructed this way has a side length of 4 and, in principle, every size above that can be computed. In practice, of course, it is natural to distribute the sizes of the features actually computed in an exponential, rather than linear, way. The result of this stage of the process is that for every point in the image, there is a particular size DoG kernel for which the response was the largest and the particular magnitude for that response.

Box representation of DoG kernel used by CenSureE keypoint finder; for a particular size S, the center portion of the feature is of size S/2.

Once the DoG kernel has been approximated everywhere, the next step is to threshold this value and then reject those points that are not local extrema. We do this by comparing to the usual 3 × 3 × 3 cube of neighbors in (x, y, scale)–space and keeping only those with the highest (or lowest) value in that 27-element set.

Finally, because features of this kind can still respond rather strongly to edges, the Star algorithm computes a scale-adapted Harris measure. The scale-adapted Harris measure is computed from a matrix very similar to the one we encountered in the Harris-Shi-Tomasi corners discussion earlier, with two important exceptions. The first is that the window over which the summations are done for the individual elements of the autocorrelation matrix is sized proportional to the scale of the feature. The second is that the autocorrelation matrix is constructed from the maximal responses to the censure features rather than the image intensity. The actual test then performed is the test used by Harris, which compares the determinant of this matrix to the squared trace multiplied by the sensitivity constant.

In the OpenCV implementation, there is also a second test, which is similar to the scale-adapted Harris measure described except that it constructs an autocorrelation matrix from the size values associated with the response at each point in the window. This measure is called the binarized scale-adapted Harris measure. It is called “binarized” because for each point in the window, a value of 1, 0, or –1 is assigned based on the rate of change of the size of the maximal response at that point relative to its neighbors. Recall that the original Harris measure used a rate of change of image intensity, and the scale-adapted Harris measure used a rate of change of response to the DoG operator; the binarized measure uses the rate of change of the size of the maximal DoG operator. This binarized test is a way of quantifying the extent to which a particular point is a scale space extremum.

Keypoint finder

As was mentioned earlier, there is no specific feature descriptor extractor associated with the Star algorithm. The detector cv::StarDetector is derived directly from the cv::Feature2D base class. Here is the slightly abbreviated cv::StarDetector object definition:

// Constructor for the Star detector object:
//
class cv::xfeatures2d::StarDetector : public cv::Feature2D {

public:

  static Ptr<StarDetector> create(
    int maxSize                = 45,  // Largest feature considered
    int responseThreshold      = 30,  // Minimum wavelet response
    int lineThresholdProjected = 10,  // Threshold on Harris measure
    int lineThresholdBinarized = 8,   // Threshold on binarized Harris
    int suppressNonmaxSize     = 5    // Keep only best features
                                      //  in this size space
  );

  ...
);

The Star detector constructor takes five arguments. The first is the largest size of feature that will be searched for. The maxSize argument can be set to only one of a finite list of values: 4, 6, 8, 11, 12, 16, 22, 23, 32, 45, 46, 64, 90, or 128. For any value you might choose, all of the lower values in this list will also be checked.

The responseThreshold argument indicates the threshold applied to the convolution with the CenSurE kernel (Figure 16-26c) in order to find keypoint candidates. For all scales above the smallest, the kernel is normalized such that the threshold at all larger scales is equivalent to the given value on the smallest kernel (i.e., of size 4).

The next two arguments, lineThresholdProjected and lineThresholdBinarized, are thresholds associated with the scale-adapted Harris measure described earlier. The projected line threshold is effectively the inverse of the sensitivity constant in the Harris test on the response values; raising lineThresholdProjected will reject more features as lines. The lineThresholdBinarized value does something very similar, except that it is the sensitivity constant for the binarized scale-adapted Harris measure. This second parameter enforces the requirement that the CenSurE feature be a scale space extrema. Both comparisons are made, and a keypoint candidate must succeed relative to both criteria in order to be accepted.

The final argument, supressNonmaxSize, sets the region over which Star features will be rejected if they are not the strongest feature within that distance.

The BRIEF descriptor extractor and cv::BriefDescriptorExtractor

BRIEF, which stands for Binary Robust Independent Elementary Features, is a relatively new algorithm for assigning a novel kind of feature to a keypoint (see Figure 16-27). BRIEF features were introduced by Calonder et al. and for this reason are also often known as Calonder features [Calonder10]. BRIEF does not locate keypoints; rather, it is used to generate descriptors for keypoints that can be located through any of the other available feature-detector algorithms.³⁸

A visualization of pixel-to-pixel tests that collectively comprise a single BRIEF descriptor; each line connects a pair of pixels that are compared by the test.

The basic concept behind the BRIEF descriptor is that a feature is described as a series of tests, each of which simply compares a single pixel in the area of the feature to some other single pixel, yielding a simple binary result (i.e., 0 or 1) based on which portion was brighter (Figure 16-27). The BRIEF descriptor is simply the result of n such tests arranged into a bit string. In order to keep the descriptor from being overly sensitive to noise, the BRIEF descriptor first smooths the image by convolution with a Gaussian kernel. Because the descriptors are binary strings, they not only can be computed quickly and stored efficiently, but they can also be compared to one another extremely efficiently.³⁹

There are many ways to generate the actual pairs that will be matched to form the BRIEF descriptor. One of the best ways is simply to randomly generate all of the pairs by first drawing a point from a Gaussian distribution around the center of the feature, and then computing the second point by drawing from a Gaussian distribution around the first (with one half the standard deviation). The area within which the points are drawn (the total footprint of the feature) is called the patch size, while the standard deviation of the distribution from which the points are drawn is called the kernel size. In the case of Figure 16-27, the ratio of the kernel size to the patch size is approximately 1:5. The current OpenCV implementation fixes these sizes, but in principle they are tunable parameters of the algorithm. The number of such tests generated overall is typically 128, 256, or 512, but following the style of the original creators of the feature, it is traditional to refer to this size in terms of the number of bytes in the descriptor (i.e., 16, 32, or 64 bytes, respectively).

Feature extractor

As was mentioned earlier, the BRIEF algorithm is specifically for extracting feature descriptors, and so the associated method is derived directly from the cv::Feature2D base class, implementing only the descriptor extraction part. The relevant parts of the class definition are the following:

class cv::xfeatures2d::BriefDescriptorExtractor : public cv::Feature2D {

public:

  static Ptr<BriefDescriptorExtractor> create(
    int bytes            = 32,         // can be equal 16, 32 or 64 bytes
    bool use_orientation = false       // true if point pairs are "rotated"
                                       // according to keypoint orientation
  );


  virtual int descriptorSize() const;  // number of bytes for features
  virtual int descriptorType() const;  // Always returns CV_8UC1
};

At this time, the only two user-configurable parameters of the BRIEF descriptor extractor are the number of bytes of information comprising the feature (equal to the total number of tests divided by eight) and the use_orientation flag which is analogous to the role of the upright parameter of SURF algorithm. The same considerations are applicable here also—when the features are unlikely to rotate much—for example, you recognize road signs or stitch images—use_orientation should likely be set to false. Otherwise, you may wish to set it to true.

In order to actually compute descriptors from an image and a set of keypoint locations, cv::xfeatures2d::BriefDescriptorExtractor uses the compute() interfaces defined in the cv::Feature2D base class.

The BRISK algorithm

Not long after the introduction of the BRIEF descriptor, several new techniques appeared that use a similar notion of point-wise comparison as a means to produce a compact descriptor that could be compared quickly. The BRISK⁴⁰ descriptor (see Figure 16-28), introduced by Leutenegger et al., attempts to improve on BRIEF in two distinct ways [Leutenegger11]. Firstly, BRISK introduces a feature detector of its own (recall that BRIEF is only a method for computing descriptors). Second, the BRISK feature itself, though similar to BRIEF in principle, attempts to organize the binary comparisons in a manner that improves robustness of the feature as a whole.

Here the BRISK feature detector is used on our two reference images. The left image contains 232 features, while the right contains 734. The more complex visible background in the right image contributes most of the new features, however, and the features on the car are relatively stable in both number and location.

The feature-detector portion of BRISK is essentially based on the FAST-alike AGAST⁴¹ detector, with the improvement that it attempts to identify a scale for the feature as well as an orientation. BRISK identifies the scale by first creating a scale space pyramid with a fixed number of scales (factors of two in size), and then computing a fixed number of intra-octaves per scale.⁴² The first step of the BRISK feature detector is to apply FAST (or actually AGAST) to find features at all of these scales. Once this is done, nonmaxima suppression is applied; that is, features whose score (called ρ₀ in our discussion of FAST) is not the largest of all of its neighbors are removed; this leaves only the “maximal” features.

Once a list of features is found in this way, BRISK then goes on to compute the AGAST score at the corresponding locations in the image immediately larger and immediately smaller (Figure 16-29). At this point, a simple quadratic is fit to the AGAST scores (as a function of scale), and the maximum point of that quadratic is taken to be the true scale of the BRISK feature. In this way, a continuous value is extracted and the BRISK feature is not forced to be associated with one of the discrete images computed in the pyramid. A similar interpolation method is applied in pixel coordinates to assign a subpixel location to the feature.

BRISK constructs the scale space using several octaves with “intra-octaves” in between. When a FAST feature is found at one scale, its FAST strength is computed at that scale and at the scale immediately above and below (left). These strengths are used to fit a quadratic curve and the scale at which the maximum score is expected is extrapolated from that curve (right).

In addition to a scale, BRISK features also have an orientation. To see how, we need to first understand how the sampling pattern used by BRISK differs from the random sampling pattern used by BRIEF. The BRISK descriptor is constructed by a series of rings around the center point. Each ring has some number K_i of sample points allocated to it, and each sample point is assigned a circular area of diameter equal to the circumference of the particular circle C_i divided by K_i (Figure 16-30). This area corresponds to the image being convolved by a Gaussian of that particular radius (σ_i = C_i/2K_i) and sampled at the indicated point.

The test points in the BRISK descriptor are identified in the figure by the small solid dots. The regions contributing to each test point are shown as circles around each point. Note that as the test points move out from the center of the descriptor, the size of the associated regions is increased. The left image shows only the points and their associated regions (a). The center image shows all long-range pairings associated with one particular point (b). The right image shows all of the short-range pairings associated with that same point (c).

The brightness comparisons that comprise the bitwise descriptor (analogous to BRIEF) are computed between pairings in all of the circles. In particular, these pairings are divided into two subsets, called short- and long-range pairings. The short-range pairings are all the pairings between points that are less than some specific distance d_max apart, while the long-range pairings are all of the pairings between points that are more than a specific distance d_min apart. The short-range pairings form the descriptor, while the long-range pairings are used to compute a dominant orientation.

To see how this dominant orientation is constructed, first note that the BRISK descriptor, like the BRIEF descriptor, computes differences in intensity between point pairs. The BRISK descriptor, however, goes on to normalize those differences by the distance between the points, thus creating what is in effect a local gradient. By summing these local gradients over all of the long-range pairs, we compute an orientation that can be used to orient the descriptor. The short-range features are then computed relative to this orientation such that the descriptor, made by thresholding these short-range gradients, is effectively orientation independent. By tuning the number of points per circle and the value of d_max, we can give the descriptor any length. By convention, these values are chosen so as to make the descriptor 512 bits long, the same as the (typical) BRIEF descriptor.

Keypoint finder and feature extractor

The cv::BRISK object inherits from the cv::Feature2D interface, and thus provides both a feature-finding capability and a descriptor-extraction capability. Here is the (abbreviated as usual) definition of the cv::BRISK object:

class cv::BRISK : public cv::Feature2D {

public:
  static Ptr<BRISK> create(
    int                  thresh       = 30,   // Threshold passed to FAST
    int                  octaves      = 3,    // N doublings in pyramid
    float                patternScale = 1.0f  // Rescale default pattern
  );

  int descriptorSize() const;                 // descriptor size
  int descriptorType() const;                 // descriptor type


   static Ptr<BRISK> create(                  // Compute BRISK features
    const vector<float>& radiusList,          // Radii of sample circles
    const vector<int>&   numberList,          // Sample points per circle
    float                dMax        = 5.85f, // Max distance for short pairs
    float                dMin        = 8.2f,  // Min distance for long pairs
    const vector<int>&   indexChange = std::vector<int>() // Unused
  );


};

The cv::BRISK::create() constructor method accepts three arguments: the AGAST threshold, the number of octaves, and an overall scale factor for the pattern. If you use this constructor, the locations of the sample points will be taken from a fixed lookup table inside of the library. The threshold argument, thresh, sets the threshold that will be used by the AGAST feature detector.⁴³ The number of octaves, octaves, sets the number of full octaves. Remember that if you set this to some number N, then the total number of levels computed will be 2N – 1 (including the intra-octaves). The final patternScale argument applies an overall scale factor to the built-in pattern.

The overloaded cv::BRISK::detectAndCompute() method implements the usual feature detection and descriptor extraction, which are inherited from the cv::Feature2D interface.

Additional functions provided by cv::BRISK

In addition to the preceding methods, which you would expect from any function that inherits from cv::Feature2D, cv::BRISK has an extra extended constructor: cv::BRISK::create(). This function is used if you do not want to rely on the built-in pattern for the sample points, which is provided by the library. If you would rather build your own pattern, then you must provide a list of radii for the circles in the form of an STL-style vector of numbers for the radiusList argument. Similarly, you must provide an STL-style list of integers (of the same length as radiusList) to numberList that indicates the number of sample points to be used at each radius. You can then (optionally) specify dMax and dMin, the maximum distance for short-range pairings and the minimum distance for long-range pairings (d_max and d_min from the previous section). The final argument, indexChange, currently has no effect, and should best be omitted.

The ORB feature detector and cv::ORB

For many applications, feature detector speed is not only helpful, but essential. This is particularly true for tasks that are expected to run in real time on video data, such as augmented reality or robotics applications. For this reason, the SURF feature was developed with the goal of providing similar capability to SIFT, but at a much higher speed. Similarly, the ORB feature [Rublee11] was created with the goal of providing a higher-speed alternative to either SIFT or SURF (see Figure 16-31). The ORB feature uses a keypoint detector that is very closely based on FAST (which we saw earlier in this chapter), but uses a substantially different descriptor, based primarily on BRIEF. The BRIEF descriptor is augmented in ORB by an orientation computation that essentially gives ORB features the same sort of rotational invariance enjoyed by SIFT and SURF.⁴⁴

This first stage of the ORB algorithm is to use FAST to locate a candidate set of features. FAST features are inexpensive to locate, but have several shortcomings. One is that they tend to respond to edges as well as corners. In order to overcome this, the ORB algorithm computes the Harris corner measure for the located FAST points. This measure, you may recall, is a constraint on the eigenvalues of an autocorrelation matrix formed from pixels near the location of the feature.⁴⁵ Using this process, we then construct an image pyramid so that a scale space search can be done. Because the Harris corner measure provides not only a test for the quality of a FAST feature, but also a better metric for feature quality, it can also be used to select for the “best” features in an image. When some particular number of features is desired (which is commonly the case in practical applications), the features are ordered by the Harris corner measure; those that are the best are retained until the desired number is found.

Two images of the same vehicle each produce 500 ORB features. An interesting characteristic of ORB is visible here, namely that if a corner is large in the image, many ORB features of differing sizes will be found on that same corner.

An important contribution of the ORB algorithm relative to FAST (or the Harris corners) is the introduction of an orientation to the located keypoints. The orientation is assigned in a two-step process. First the first moments (in x and y) are computed for the distribution of intensities inside of a box around the feature. This box is of side length equal to twice the scale at which the feature was found (an approximation to a disk of radius given by the scale). Figure 16-32 shows the normalized x- and y-gradients (they are divided by their mean) that gives the orientation of the gradient direction relative to the center of the feature. For this reason, this ORB feature is also known as an oFAST feature, or oriented-FAST feature.

We compute the orientation of the ORB feature by analyzing the first moments (average intensity) of the image in a box whose size is given by the scale at which the FAST feature was found; the orientation of the feature is given by the direction vector from the center of the feature to the point indicated by those moments.

Once the feature has been located and an orientation has been assigned, it is possible to compute a feature vector relative to that orientation. The resulting feature can then be used in a rotationally invariant way.⁴⁶ The feature descriptor used by ORB, as mentioned earlier, is based on the descriptor used in the BRIEF algorithm, but the introduction of orientation information is an important differentiator of the ORB feature relative to its predecessor BRIEF.

The second significant difference between the ORB and BRIEF features is that BRIEF’s authors actually produced the rotation-aware BRIEF descriptor by analyzing a large data set of images and looking for test pairs that had particular properties: a high variance, an average result close to 0.5, and a minimal correlation with the other tests pairs.⁴⁷ In order to do this analysis, however, they converted each descriptor to a representation that located the test points relative to the orientation of the feature. This analysis was done once in the construction of the ORB descriptor by its authors, and is hereafter “built into” the descriptor. The data set they used was a well-known image data set containing many kinds of images.⁴⁸

Keypoint finder and feature extractor

As with SIFT and SURF, the ORB algorithm is implemented in OpenCV via the cv::Feature2D interface. Here is the (slightly abbreviated) definition of the cv::ORB object that implements the ORB algorithm:

class ORB : public Feature2D {
public:
  // the size of the signature in bytes
  enum { kBytes = 32, HARRIS_SCORE = 0, FAST_SCORE = 1 };

  static Ptr<ORB> create(
    int   nfeatures     = 500,    // Maximum features to compute
    float scaleFactor   = 1.2f,   // Pyramid ratio (greater than 1.0)
    int   nlevels       = 8,      // Number of pyramid levels to use
    int   edgeThreshold = 31,     // Size of no-search border
    int   firstLevel    = 0,      // Always '0'
    int   WTA_K         = 2,      // Pts in each comparison: 2, 3, or 4
    int   scoreType     = 0,      // Either HARRIS_SCORE or FAST_SCORE
    int   patchSize     = 31,     // Size of patch for each descriptor
    int   fastThreshold = 20      // Threshold for FAST detector
  );

  int descriptorSize() const;     // descriptor size (bytes), always 32
  int descriptorType() const;     // descriptor type, always CV_8U
};

The ORB constructor supports a somewhat intimidating list of arguments. Most of these can safely be left at their default values and you will get satisfactory results. The first argument, nfeatures, is probably the one you are most likely to change; it simply determines the number of keypoints you would like cv::ORB to find at a time.

Because the ORB detector uses an image pyramid, you must tell the detector both what the scale factor is between each layer in the pyramid and how many levels the pyramid is to have. You do not want to use a coarse factor-of-two type pyramid of the kind that would be created by cv::buildPyramid() because too many features would get lost in between. The default value for the scale factor is only 1.2. The variables that control the scale factor and the number of levels in the pyramid are scaleFactor and nlevels, respectively.

Because the keypoints are of a specific size in pixels, it is necessary to avoid the boundaries of the image. This distance is set by edgeThreshold. The size of the patch used for individual features can also be set with the patchSize argument. You will notice that they have the same default value of 31. If you change patchSize, you should make sure that edgeThreshold remains equal to or greater than patchSize.

The firstLevel argument allows you to set the scale pyramid such that the level whose scale is unity is not necessarily the first level. In effect, setting firstLevel to a value other than zero means that some number of images in the pyramid will actually be larger than the input image. This is meaningful with ORB features because their descriptors rely intrinsically on a smoothed version of the image anyhow. In most cases, however, making the firstLevel very high will produce features at the resulting smaller scales that are mainly driven by noise.

The argument WTA_K controls the tuple size, which in turn controls precisely how the descriptor is constructed from the binary tests. The case of WTA_K=2 is the scenario we described before in which each bit of each descriptor byte is a separate comparison between pairs of test points. These test points are drawn from a pregenerated list. In the case of WTA_K=3, the bits of the descriptor are set two at a time through a three-way comparison between sets of three test points from that list. Similarly, for WTA_K=4, the bits of the descriptor are set two at a time through a four-way comparison between sets of four test points.

Note

It is important to understand, however, that the pregenerated list of test points described in the beginning of this section is only really meaningful for a tuple size of 2, and for a feature size of 31. If you use features of any other size, the test points will be generated randomly (instead of the precomputed list). If you use a tuple size other than 2, the test points will also be arranged randomly into tuples of the correct length. (So if you use a feature size of 31, but a tuple size other than 2, you will be using the precomputed list of test points, but in a random manner, which is arguably no better than just using random test points altogether.)

The last argument to the cv::ORB constructor is the score type. The scoreType argument can be set to one of two values: cv::ORB::HARRIS_SCORE or cv::ORB::FAST_SCORE. The former case was described in the beginning of this section, in which a large number of features are found, all have their scores recomputed using the Harris metric, and only the best ones by that metric are kept. Unfortunately, this incurs compute cost in two ways. One is that the Harris metric takes time to compute, and the other is the need to compute more feature metrics in the first place.⁴⁹ The alternative is to use the metric natively associated with FAST. The features are not as good, but it improves run speed slightly.

Additional functions provided by cv::ORB

cv::ORB also includes a bunch of get*/set* methods that can be used to retrieve or modify various algorithm parameters after the class instance is constructed.

The FREAK descriptor extractor and cv::xfeatures2d::FREAK

As with the BRIEF descriptor, the FREAK algorithm (see Figure 16-33) computes a descriptor only, and does not have a naturally associated keypoint detector. Originally introduced as an advancement on BRIEF, BRISK, and ORB, the FREAK descriptor is a biologically inspired descriptor that functions much like BRIEF, differing primarily in the manner in which it computes the areas for binary comparison [Alahi12]. The second, more subtle, distinction is that rather than making point comparisons of pixels around a uniformly smoothed image, FREAK uses points for comparison that each correspond to a different sized region of integration, with points farther from the center of the descriptor being assigned larger regions. In this way FREAK captures an essential feature of the human visual system, and thus derives its name Fast Retinal Keypoint.

A diagram representing the operation of the FREAK descriptor. Straight lines represent possible comparisons between vertices. The circles represent the “receptive field” associated with each vertex. Notice that the sizes of the receptive fields grow as the vertices get farther from the center of the descriptor. Figure taken with permission from [Alahi12].

To understand the FREAK feature, it is useful to first revisit the BRIEF feature, but from a slightly different perspective than when it was introduced earlier. Recall that the BRIEF feature involves bitwise comparisons between a large number of pairs of individual pixel intensities in the immediate proximity of the feature. To improve robustness to noise, the BRIEF algorithm first applies a Gaussian filter to the image before computing these pixel intensities.

Alternatively, we can think of each of the pixels used in the comparison as representing Gaussian weighted sums over the pixels in the input image corresponding to the immediate vicinity of the output pixel. Mathematically these two are exactly equivalent, but intuitively, this new picture naturally introduces the idea of the receptive field of a comparison pixel, and calls to mind the relationship between the ganglion cells of the retina and the set of individual photoreceptors to which that ganglion cell responds.

One substantial difference, however, between the receptive fields of the BRIEF feature and those of the human eye is that while the receptive fields in BRIEF are of uniform size (recall the Gaussian convolution), those of the human retina are increasingly large as one moves farther from the center of the retina. Inspired by this biological observation, the FREAK descriptor employs a series of receptive fields that increase in size with distance from the center of the feature (the circles in Figure 16-33). In the standard construction there are 43 such visual fields.

As was the case with the ORB descriptor, the creators of the FREAK descriptor then went on to use a learning technique to organize the possible comparisons between these receptive fields in order of their decreasing utility. In this way the comparisons with the greatest discriminative power (across a large training set of features) could be given priority over those with comparatively less discriminative power. Once this ordering was complete, pairs were retained only if they showed strong decorrelation with pairs of higher individual utility. In application, given a few dozen fields of varying sizes and thousands of possible comparisons, only the most useful 512 were found to be worth keeping.

Of these 512 comparisons, the FREAK descriptor organizes them into four sets of 128. Empirically, it was observed that each group appears to successively employ more of the small-scale receptive fields near the center, but that the most initially discriminative comparisons are between the large field areas. As a result, it is possible to first make comparisons only with these large field values and then, if enough similarity is found, to proceed to the finer areas to refine the match. Each set of 128 comparisons corresponds only to an XOR operation (and bit summation) across a single 16-byte value. This number is significant, as many modern processors can make such a comparison in a single cycle. Because it is possible to reject the overwhelming majority of possible matches with just this single comparison, the FREAK descriptor has been found to be extremely efficient.

Feature extractor

The FREAK feature is implemented in OpenCV by the class of the same name: cv::xfeatures2d::FREAK, which inherits from the cv::Features2D interface and implements the descriptor extraction part.

class FREAK : public Features2D {

public:

  static Ptr<FREAK> create(
    bool orientationNormalized = true,  // enable orientation normalization
    bool scaleNormalized       = true,  // enable scale normalization
    float patternScale         = 22.0f, // scaling of the description pattern
    int nOctaves               = 4,     // octaves covered by detected keypoints
    const vector<int>& selectedPairs = vector<int>() // user selected pairs
  );

  virtual int descriptorSize() const;   // returns the descriptor length in bytes
  virtual int descriptorType() const;   // returns the descriptor type

  ...
};

The FREAK constructor has a number of arguments that, as is often the case, can be left safely at their default arguments most of the time. The first two represent slight extensions to the published algorithm. As originally proposed, the FREAK features have a fixed orientation and scale. By setting orientationNormalized to true, you can ask the OpenCV FREAK object to build descriptors that are orientation invariant. This option reduces the discriminatory power of the keypoints, but in exchange you get orientation invariance.⁵⁰

To understand the roles of the argument scaleNormalized, you must recall first that (most) keypoints have a size, which is set by the keypoint detector that located them in the first place. If scaleNormalized is true, then the image patch around the feature will first be rescaled by this keypoint size before the feature vector is computed.

The patternScale argument is used to uniformly rescale the FREAK receptor field pattern. It is unlikely you will want to change this. Closely related to the patternScale argument is the nOctaves argument. When the FREAK descriptor object is created, a lookup table is generated containing all of the necessary information to compute the FREAK descriptor at a range of scales. The combination of the patternScale and nOctaves arguments allows you to control the span of this lookup table (in terms of the size of the patterns). The exact scales generated are given by the following formula:

Here nbScales is the total number of scales, equal to 64 in the current implementation. Note that in every case the number of scales generated is the same, but the spacing between the scales is increased or reduced when nOctaves is increased or reduced.⁵¹

The final argument to the constructor is selectedPairs. This argument is for real experts, and allows you to override the lists of pairs used for comparison in the descriptor’s construction. If present, selectedPairs must be a vector of exactly 512 integers. These integers index into an internal table of all possible pairings of fields.⁵² In this way, you can give your own pairs as integers. It is unlikely that most users would ever use this feature; it is exposed only for serious power users—those who, after reading the original paper on FREAK, would conclude that it was necessary to repeat the learning process used by the creators of FREAK with some hope of maximizing the efficacy of the descriptor on their own unique data set.

Dense feature grids and the cv::DenseFeatureDetector class

There is one other feature detector, which is really a sort of “almost” feature detector in that it does not really detect features, it just generates them. The purpose of the cv::DenseFeatureDetector class⁵³ is just to generate a regular array of features in a grid across your image (see Figure 16-34). In this case, nothing is done yet with these features (i.e., no descriptors are computed). Once these keypoints are generated, however, you can then compute any descriptor for them that you like. It turns out that, in many applications, it is not only sufficient, but especially desirable, to compute a descriptor everywhere, in the sense that “everywhere” is represented by a uniform grid of some density you choose.

Here dense features are generated on the two views of an automobile; in this case, there are three levels, with a starting spatial step of 50, a scale step of 2.0, and feature step being scaled as well as feature size.

It is also often useful to be able to compute not only a uniform grid of features in image space, but also to sample that space at a uniform number of scales. cv::DenseFeatureDetector will also do this for you if you so desire.

Keypoint finder

The keypoint finder for the dense feature class is just derived from the cv::FeatureDetector interface, as it has no descriptor extraction functionality. Thus, the main thing we need to understand about it is how to use its constructor in order to configure it:

class cv::DenseFeatureDetector : public cv::FeatureDetector {

public:
  explicit DenseFeatureDetector(
    float initFeatureScale      = 1.f,  // Size of first layer
    int   featureScaleLevels    = 1,    // Number of layers
    float featureScaleMul       = 0.1f, // Scale factor for layers
    int   initXyStep            = 6,    // Spacing between features
    int   initImgBound          = 0,    // No-generate boundary
    bool  varyXyStepWithScale   = true, // if true, scale 'initXyStep'
    bool  varyImgBoundWithScale = false // if true, scale 'initImgBound'
  );

  cv::AlgorithmInfo* info() const;

  ...
};

The constructor for the dense feature detector has a lot of arguments because there are a lot of different ways you might want to generate feature grids. The first argument, initFeatureScale, sets the size of the first layer of features. The default value of 1.0 is almost certainly not what you want, but remember that the right size for features depends not only on the nature of your image, but also on the descriptor type you are going to use.

By default, the dense feature detector will generate a single grid of keypoints. You can generate a pyramid of such keypoints, however, by setting the featureScaleLevels to any value larger than 1. Each grid generated after the first will assign the scale of the generated features to be the initFeatureScale multiplied by one factor of featureScaleMul for each level above the first that scale was generated on.⁵⁵ Figure 16-35 shows a representation of the use of this detector.

cv::DenseFeatureDetector generates one or more grids of keypoints of varying scale.

The spacing between the features is set by initXyStep. If featureScaleLevels is larger than one, then the locations of the features will, by default, also be scaled by one power of featureScaleMul for each level after the first. If varyXyStepWithScale is set to false, then only the scale value is assigned to the features will change (and not their locations).

Features will not be generated on the boundary of the image. You can increase the width of the area in which features will not be generated by setting initImgBound. In most cases, it makes sense to set initImgBound to be equal to initFeatureScale (i.e., such that generated features do not spill over the edges of the image). In order to have this boundary scale up in the case of multiple levels, however, you must set var⁠yImgBoundWithScale to true; otherwise, it will keep a constant size.

Keypoint Filtering

It is a relatively common situation that you find yourself with a list of keypoints that needs to be pruned in some way. Sometimes this happens because the list is just too long, and you want to throw out some of the lower-quality ones until you get a manageable number. Other times, you need to remove duplicates or all of the keypoints that are outside of some region. OpenCV handles all of these sorts of tasks with an object called a keypoint filter. The keypoint filter provides a variety of methods you can use to cull lists of keypoints. These functions are heavily used internally by the implementations of various keypoint finders, but you will probably find them very useful as standalone tools in many of your applications as well.

The cv::KeyPointsFilter class

All keypoint filtering is done by methods of a single cv::KeyPointsFilter class. The cv::KeypointFilter class is used internally by many of the keypoint detectors we have already encountered in order to filter by location (apply a mask) or reduce the number of keypoints to some number that are the “best” from a list. The salient part of the definition of this class is as follows:

class cv::KeyPointsFilter {
public:
  static void runByImageBorder(
    vector< cv::KeyPoint >& keypoints,  // in/out list of keypoints
    cv::Size                imageSize,  // Size of original image
    int                     borderSize  // Size of border in pixels
  );
  static void runByKeypointSize(
    vector< cv::KeyPoint >& keypoints,  // in/out list of keypoints
    float                   minSize,    // Smallest keypoint to keep
    float                   maxSize  = FLT_MAX // Largest one to keep
  );
  static void runByPixelsMask(
    vector< cv::KeyPoint >& keypoints,  // in/out list of keypoints
    const cv::Mat&          mask        // Keep where mask is nonzero
  );
  static void removeDuplicated(
    vector< cv::KeyPoint >& keypoints   // in/out list of keypoints
  );
  static void retainBest(
    vector< cv::KeyPoint >& keypoints,  // in/out list of keypoints
    int                     npoints     // Keep this many
  );
}

The first thing you probably noticed was that all of these methods were static, so really cv::KeyPointsFilter is more of a namespace than an object. Each of the five filter methods takes a reference to an STL-style vector of cv::KeyPoint objects called keypoints. This is both the input and the output of the function (i.e., you will give the function your keypoints, and when you check, you will find some of them are now missing).

The cv::KeyPointsFilter::runByImageBorder() function removes all of the keypoints that are within borderSize of the edge of the image. You must also tell cv::KeyPointsFilter::runByImageBorder() how big the image was in the first place with the imageSize argument.

The cv::KeyPointsFilter::runByKeypointSize() function removes all of the keypoints that are either smaller than minSize or larger than maxSize.

The cv::KeyPointsFilter::runByPixelsMask() function removes all of the keypoints that are associated with zero-valued pixels in mask.

The cv::KeyPointsFilter::removeDuplicated() function removes all duplicate keypoints.

The cv::KeyPointsFilter::retainBest() function removes keypoints until the target number given by npoints is reached. Keypoints are removed in ascending order of their quality, as indicated by response.

Matching Methods

Once you have your keypoints, you will want to use them to do something useful. As we discussed early on, the two most common applications for keypoint methods are object recognition and tracking. In both cases, we saw that it was objects derived from the cv::DescriptorMatcher base class that would provide this functionality for us. In this section, we will look at the options available to us for doing this kind of matching.

At this time, there are essentially two different matching methods you can use. The first is brute force matching, which is the most basic and obvious choice—just compare everything in set A to everything in set B. The second is called FLANN, and is really an interface to a collection of methods for locating nearest neighbors.

Brute force matching with cv::BFMatcher

The (important part of the) declaration of the cv::BFMatcher class is shown here:

class cv::BFMatcher : public cv::DescriptorMatcher {

public:

  BFMatcher( int normType, bool crossCheck=false );

  virtual ~BFMatcher() {}

  virtual bool isMaskSupported() const { return true; }
  virtual Ptr<DescriptorMatcher> clone(
    bool emptyTrainData=false
  ) const;
  ...

};

The brute force matcher essentially implements the most straightforward solution to the matching problem. It takes each descriptor in the query set and attempts to match it with each descriptor in the training set (either the internal dictionary, or a set provided with the query set). The only thing you need to decide when you create a brute force matcher is what distance metric it will use in order to compute distances for comparison. The available options are shown in Table 16-2.

Table 16-2. Available metrics for the brute force matcher, with their associated formulas; the summations are over the dimensions of the feature vector
Metric	Function
`NORM_L2`
`NORM_L2SQR`
`NORM_L1`
`NORM_HAMMING`
`NORM_HAMMING2`

It is worth noting that the member function isMaskSupported() always returns true. We will see that this is not the case for the FLANN matcher in the next section. Only the brute force matcher (at this time) supports the mask construct described when we discussed the cv::DescriptorMatcher base class.⁵⁶

The final feature of the brute force matching object is what is called cross-checking. Cross-checking is turned on with the argument crosscheck to the cv::BFMatcher constructor. When cross-checking is enabled, a match between object i of the query set and object j of the training set will be reported only if both train[j] is query[i]’s closest neighbor in the training set and query[i] is train[j]’s closest neighbor in the query set. This is very useful for eliminating false matches, but costs additional time to calculate.

Fast approximate nearest neighbors and cv::FlannBasedMatcher

The term FLANN refers to the Fast Library for Approximate Nearest Neighbor computation. OpenCV provides an interface to FLANN, which itself provides a variety of algorithms for finding (or at least approximately finding) the nearest neighbors of points in high-dimensional spaces. Conveniently, this is precisely what we need for descriptor matching. The common interface to FLANN is through the cv::FlannBasedMatcher object, which of course is derived from the cv::DescriptorMatcher base class:

class cv::FlannBasedMatcher : public cv::DescriptorMatcher {

public:

  FlannBasedMatcher(
    const cv::Ptr< cv::flann::IndexParams>&  indexParams
      = new cv::flann::KDTreeIndexParams(),
    const cv::Ptr< cv::flann::SearchParams>& searchParams
      = new cv::flann::SearchParams()
  );

  virtual void add( const vector<Mat>& descriptors );
  virtual void clear();
  virtual void train();
  virtual bool isMaskSupported() const;

  virtual void read( const FileNode& );     // Read from file node
  virtual void write( FileStorage& ) const; // Write to file storage

  virtual cv::Ptr<DescriptorMatcher> clone(
    bool emptyTrainData = false
  ) const;
  ...
};

Pretty much everything in the preceding declaration is what you should have expected by now. The one important feature is the constructor for cv::FlannBasedMatcher, which takes some special structures that are used to actually configure the matching. These arguments determine both what method the FLANN matcher will use, as well as how to parameterize the selected method. For example, the default value for the indexParams argument is new cv::flann::KDTreeIndexParams(). This tells the FLANN matcher that the index method it should use is kd-trees as well as how many kd-trees to use (the number of trees is an argument to the cv::flann::KDTreeIndexParams() constructor, whose default happens to be 4). The searchParams argument is somewhat more generic, and plays a role somewhat analogous to cv::TermCriteria, but with a little more general function. We will look at each of the indexing methods, and then return to the cv::flann::SearchParams object.

Linear indexing with cv::flann::LinearIndexParams

You can ask the FLANN library to do essentially the same thing that cv::BFMatcher would have done. This is usually useful for benchmark comparisons, but also can be used as a comparison to verify that other, more approximating methods are giving satisfactory results. The constructor for cv::LinearIndexParams takes no arguments. By way of example, the following code generates a matcher that is essentially equivalent to cv::BFMatcher:

cv::FlannBasedMatcher matcher(
  new cv::flann::LinearIndexParams(), // Default index parameters
  new cv::flann::SearchParams()       // Default search parameters
);

KD-tree indexing with cv::flann::KDTreeIndexParams

The cv::flann::KDTreeIndexParams index parameters tell the FLANN matcher that you would like to use randomized kd-trees for matching. FLANN’s normal behavior is to assume that you want many such randomized trees to be generated as the indexing method, and then it will search them all when you attempt to match.⁵⁷ The constructor for the cv::flann::KDTreeIndexParams accepts one argument, which specifies the number of such trees to construct. The default value is 4, but it is common to make this number as large as 16 or so. The declaration for cv::flann::KDTreeIndexParams is the following:

struct cv::flann::KDTreeIndexParams : public cv::flann::IndexParams {
  KDTreeIndexParams( int trees = 4 ); // kd-tree needs number of trees
};

An example declaration of a matcher using cv::flann::KDTreeIndexParams is shown here:

cv::FlannBasedMatcher matcher(
  new cv::flann::KDTreeIndexParams( 16 ), // Index using 16 kd-trees
  new cv::flann::SearchParams()           // Default search parameters
);

Hierarchical k-means tree indexing with cv::flann::KMeansIndexParams

Another option for constructing the index is to use hierarchical k-means clustering.⁵⁸ The advantage of k-means clustering is that it makes some intelligent use of the density of points in the database. Hierarchical k-means clustering is a recursive scheme by which the data points are first grouped into some number of clusters, and then each cluster is grouped into some number of subclusters, and so on. Obviously this is going to be more helpful when you have reason to believe that such structure exists in the data in the first place. The declaration for cv::flann::KMeansIndexParams is the following:

struct cv::flann::KMeansIndexParams : public cv::flann::IndexParams {

  KMeansIndexParams(
    int             branching    = 32,  // Branching factor for tree
    int             iterations   = 11,  // Max for k-means stage
    float           cb_index     = 0.2, // Probably don't mess with
    cv::flann::flann_centers_init_t centers_init
                                 = cv::flann::CENTERS_RANDOM
  );

};

The default parameters for the cv::flann::KMeansIndexParams structure are all perfectly reasonable values, so in many cases, you will just leave these. The first argument, branching, is the branching factor that is used in the hierarchical k-means tree. It determines how many clusters will be formed at each level of the tree. The next argument, iterations, determines how many iterations will be allowed to the k-means algorithm for the formation of each cluster.⁵⁹ It can be set to -1 to force the clustering algorithm to run to completion at every node in the tree. The third argument controls how the cluster centers are initialized. Traditionally, random initialization was common (cv::flann::CENTERS_RANDOM), but in recent years it has been shown that in most cases, a prudent choice of starting centers can give substantially better results. The two additional options cv::flann::CENTERS_GONZALES (Gonzales’s algorithm [Tou77]) and cv::flann::CENTERS_KMEANSPP (the so called “k-means++” algorithm [Arthur07]) are available, with the latter being an increasingly standard choice. The final argument, cb_index (cluster boundary index), is really there for true experts in the FLANN library; it is used when the tree is being searched, and controls how the tree will be explored. It is best left at its default value, or set to 0 (which indicates that searches that exhaust one domain should move directly to the closest unexplored domain⁶⁰).

Combining KD-trees and k-means with cv::flann::CompositeIndexParams

This method simply combines the random kd-tree and k-means methods described earlier and attempts to find the best matches as found by either method. Because these are all approximate methods, there is always potential benefit to searching another way. (You could think of this as an extension of the logic behind having multiple random trees in the kd-tree method.) The constructor for the cv::flann::CompositeIndexParams object combines the arguments for the kd-tree and k-means methods:

struct cv::flann::CompositeIndexParams : public cv::flann::IndexParams {

  CompositeIndexParams(
    int             trees        = 4,   // Number of trees
    int             branching    = 32,  // Branching factor for tree
    int             iterations   = 11,  // Max for k-means stage
    float           cb_index     = 0.2, // Usually leave as-is
    cv::flann::flann_centers_init_t centers_init
                                 = cv::flann::CENTERS_RANDOM

  );

};

Locality-sensitive hash (LSH) indexing with cv::flann::LshIndexParams

Another, very different method of indexing the space of known objects is to attempt to use hash functions to map similar objects into the same buckets. To the extent that this can be done, these hash functions can be used to generate a list of candidate objects very quickly, which can then be evaluated and compared to one another. This technique is called locality-sensitive hashing (LSH). The variation of LSH implemented in OpenCV as part of the FLANN library was first proposed by Lv et al. [Lv07]:

struct cv::flann::LshIndexParams : public cv::flann::IndexParams {

  LshIndexParams(
    unsigned int table_number,      // Number of hash tables to use
    unsigned int key_size,          // key bits, usually '10' to '20'
    unsigned int multi_probe_level  // Best to just set this to '2'
  );

};

The first argument, table_number, is the number of actual hash tables to use. This number is typically tens of tables, with 10 to 30 being reasonable numbers. The second argument, key_size, is the size of the hash key (in bits). This number is typically also more than 10 and usually less than 20. The last argument, multi_probe_level, controls how neighboring buckets are searched; it is part of what distinguishes the multiprobe algorithm (cited earlier) from previous LSH implementations. The recommended value for multi_probe_level is 2. If it is set to 0, the algorithm will degenerate into non-multiprobe LSH.

Note

LSH indexing in FLANN works only for binary features (using Hamming distances); it should not be applied to other distance metrics.

Automatic index selection with cv::flann::AutotunedIndexParams

You can also ask OpenCV/FLANN to attempt to identify for you what the best indexing scheme is. Needless to say, this will take a while. The basic idea behind this approach is that you set a target precision, which is the percentage of nearest neighbor searches you would like to return the correct exact solution. Of course, the higher you make this, the more difficult it will be for the algorithm to find an indexing scheme that can deliver, and the longer it will take to actually generate a full index for all of your data of that type:

struct cv::flann::AutotunedIndexParams : public cv::flann::IndexParams {

  AutotunedIndexParams(
    float target_precision = 0.9,    // Percentage of searches required
                                     //  to return an exact result
    float build_weight     = 0.01,   // Priority for building fast
    float memory_weight    = 0.0,    // Priority for saving memory
    float sample_fraction  = 0.1     // Fraction of training data to use
  );

};

The target precision is set by the argument targetPrecision. When you create a FLANN-based matcher using cv::flann::AutotunedIndexParams, you will also have to indicate how important it is to you that the index builds quickly; this is controlled by the build_weight argument. If you do not care much how long it takes to build the index, so long as your returns are quick, you can set this to a very small value (such as the default of 0.01). If you need to build your index often, you will want to set this number higher. Similarly, the memory_weight controls the priority you want to put on minimizing the amount of memory consumed by the indexes. The default value of this parameter is 0, which means you don’t care.

Finally, there is the question of how much of the training data to actually use in this search. This is controlled by the sample_fraction argument. Clearly, if you make this fraction too big, it will take an enormous amount of time to find a satisfactory solution. On the other hand, if you make it too small, the observed performance on your full data set may be much worse than you saw when you generated the index. For large data sets, the default value of 0.1 is generally found to be a good choice.

FLANN search parameters and cv::flann::SearchParams

In addition to the previous arguments used for the indexParams argument, the cv::FlannBasedMatcher constructor requires an object of type cv::flann::SearchParams. This is a straightforward structure that controls some of the matcher’s general behavior. It has the following simple definition:

struct cv::flann::SearchParams : public cv::flann::IndexParams {

  SearchParams(
    int   checks = 32,   // Limit on NN candidates to check
    float eps    = 0,    // (Not used right now)
    bool  sorted = true  // Sort multiple returns if 'true'
  );

};

The parameter checks is used by the kd-tree and k-means algorithms differently, but in each case it essentially limits the number of nearest neighbor candidates that are evaluated in the attempt to find the truly nearest neighbor(s). The eps parameter is currently not used.⁶¹ The sorted parameter indicates that, in the case of searches that can return multiple hits (e.g., radius search), the hits returned should be in ascending order of distance from the query point.⁶²

Displaying Results

Now that you can compute all of these feature types and compute matches between them from one image to another, the next logical thing to want to do is to actually display the keypoints and matches on the screen. OpenCV provides one function for each of these tasks.

Displaying keypoints with cv::drawKeypoints

void cv::drawKeypoints(
  const cv::Mat&                image,      // Image to draw keypoints
  const vector< cv::KeyPoint >& keypoints,  // List of keypoints to draw
  cv::Mat&                      outImg,     // image and keypoints drawn
  const Scalar&                 color    = cv::Scalar::all(-1),
  int                           flags    = cv::DrawMatchesFlags::DEFAULT
);

Given an image and a set of keypoints, cv::drawKeypoints will annotate all of the keypoints onto the image and put the result in outImg. The color of the annotations can be set with color, which can be set to the special value cv::Scalar::all(-1), indicating that they should be all different colors. The flags argument can be set to cv::DrawMatchesFlags::DEFAULT or to cv::DrawMatchesFlags:: DRAW_RICH_KEYPOINTS. In the former case, keypoints will be visualized as small circles. In the latter, they will be visualized as circles with radius equal to their size member (if available), and an orientation line given by their angle member (if available).⁶³

Displaying keypoint matches with cv::drawMatches

Given a pair of images, the associated keypoints, and a list of cv::DMatch objects generated by one of the matchers, cv::drawMatches() will compose an image for you containing the two input images, visualizations of all of the keypoints (in the style of cv::drawKeypoints()), and indicate for you which keypoints in the first image matched which keypoints in the second image (Figure 16-36). There are two variations of cv::drawMatches(); they are the same except for two arguments.

Here SIFT keypoints and their descriptors are extracted for two views of the same automobile. Matches were generated using a FLANN-based matcher and visualized with cv::drawMatches(). Matches that were found are marked in white with a line connecting the corresponding features. Keypoints in either image that were not found to have a match are indicated in black.

void cv::drawMatches(
  const cv::Mat&                img1,         // "Left" image
  const vector< cv::KeyPoint >& keypoints1,   // Keypoints (lt. img)
  const cv::Mat&                img2,         // "Right" image
  const vector< cv::KeyPoint >& keypoints2,   // Keypoints (rt. img)
  const vector< cv::DMatch >&   matches1to2,  // List of matches
  cv::Mat&                      outImg,       // Result images
  const cv::Scalar&             matchColor       = cv::Scalar::all(-1),
  const cv::Scalar&             singlePointColor = cv::Scalar::all(-1),
  const vector<char>&           matchesMask      = vector<char>(),
  int                           flags
                                      = cv::DrawMatchesFlags::DEFAULT
)

void cv::drawMatches(
  const cv::Mat&                img1,         // "Left" image
  const vector< cv::KeyPoint >& keypoints1,   // Keypoints (lt. img)
  const cv::Mat&                img2,         // "Right" image
  const vector< cv::KeyPoint >& keypoints2,   // Keypoints (rt. img)
  const vector< vector<cv::DMatch> >& matches1to2, // List of lists
                                                   // of matches
  cv::Mat&                      outImg,       // Result images
  const cv::Scalar&             matchColor    // and connecting line
                                       = cv::Scalar::all(-1),
  const cv::Scalar&             singlePointColor   // unmatched ones
                                       = cv::Scalar::all(-1),
  const vector< vector<char> >& matchesMask   // only draw for nonzero
                                       = vector< vector<char> >(),
  int                           flags  = cv::DrawMatchesFlags::DEFAULT
);

In both cases the two images are supplied by the img1 and img2 arguments, while the corresponding keypoints are supplied by the keypoints1 and keypoints2 arguments. An argument that differs in the two cases is matches1to2. It has the same meaning in both versions of cv::drawMatches(), but in one case it is an STL vector of cv::DMatch objects, while in the other it is a vector of such vectors. The second form is just a convenience, which is useful when you want to visualize the response to many different match computations at once.

The results are placed in the image outImg. When the output is drawn, those features that have matches will be drawn in the color matchColor (along with a line connecting them), while those features that are not matched will be drawn in singlePointColor. The vector matchesMask indicates which matches should be visualized; only those matches for which matchesMask[i] is nonzero will be drawn. The variation of cv::drawMatches() that expects a vector of vectors for the matches also expects a vector of vectors for the matchesMask argument.

The final argument to cv::drawMatches() is flags. The flags argument can have any of four values, which can be combined (where relevant) with the OR operator.

If flags is set to cv::DrawMatchesFlags::DEFAULT, then the output image will be created for you in outImg, and the keypoints will be visualized as small circles (without additional size or orientation information).⁶⁴

If flags contains cv::DrawMatchesFlags::DRAW_OVER_OUTIMG, then the output image will not be reallocated, but the annotations will be drawn onto it. This is useful when you have several sets of matches you would like to visualize in different colors; in this case, you can make multiple calls to cv::drawMatches() and use cv::DrawMatchesFlags::DRAW_OVER_OUTIMG for all of the calls after the first.

By default, keypoints that are not part of any match will be drawn in the color indicated by singlePointColor. If you would prefer to not have them drawn at all, you can set the flag cv::DrawMatchesFlags::NOT_DRAW_SINGLE_POINTS.

Finally, the flag cv::DrawMatchesFlags::DRAW_RICH_KEYPOINTS has the same function as in cv::drawKeypoints; it causes the keypoints to be visualized with scale and orientation information (as in Figure 16-36).

Summary

We started this chapter by reviewing the basic concepts of subpixel corner location and of sparse optical flow, and then we discussed the essential role played by keypoints. From there we saw how OpenCV handles keypoints as a general concept, and looked at a wide array of keypoint-detection schemes implemented by the library. We also saw that the process of identifying keypoints was distinct from the process of characterizing them. This characterization was accomplished by descriptors. As with keypoint identification methods, there were many descriptor types, and we saw how OpenCV handles these descriptors as a general class.

Thereafter, we considered how keypoints and their descriptors could be matched in an efficient manner for object recognition or object tracking. We concluded this chapter by looking at a useful function in the library that allows us to easily visualize the keypoints in the context of the image in which they were found. Note that there are more feature detectors and descriptors in xfeatures2d, mentioned in Appendix B.

Exercises

There are sample code routines included with OpenCV in its .../samples/cpp directory that demonstrate many of the algorithms discussed in this chapter. Use these examples in the following exercises:

matchmethod_orb_akaze_brisk.cpp (feature matching in samples/cpp)
videostab.cpp (feature tracking to stabilize video in samples/cpp)
video_homography.cpp (planar tracking in opencv_contrib/modules/xfeatures2d/samples)
lkdemo.cpp (optical flow in samples/cpp)

The covariance Hessian matrix used in cv::goodFeaturesToTrack() is computed over some square region in the image set by block_size in that function.
1. Conceptually, what happens when block size increases? Do we get more or fewer “good features”? Why?
2. Dig into the lkdemo.cpp code, search for cv::goodFeaturesToTrack(), and try playing with the block_size to see the difference.
Refer to Figure 16-2 and consider the function that implements subpixel corner finding, cv::findCornerSubPix().
1. What would happen if, in Figure 16-2, the corner was twisted so that the straight dark-light lines formed curves that met in a point? Would subpixel corner finding still work? Explain.
2. If you expand the window size around the twisted checkerboard’s corner point (after expanding the win and zero_zone parameters), does subpixel corner finding become more accurate or does it rather begin to diverge? Explain your answer.
Modify matchmethod_orb_akaze_brisk.cpp to train on a planar object (a magazine or book cover, for instance) and track it with a video camera. Study and report how well it finds the correct keypoints using AKAZE, ORB, and BRISK features in the modified code.
Run video_homography.cpp on the same planar pattern as in Exercise 3. Learn the pattern using the key control in the program and then track it, noting how the homography is used to generate a stable output. How many features must be found in order to compute the homography matrix H?
Using what you learned in Exercise 3, test out videostab.cpp and describe how it works to stabilize video.
Features and their descriptors can also be used for recognition. Take one top-down photo of three different book covers against a blank background, then take 10 different off-center photos of each book cover against varied backgrounds, and finally take 10 photos with no books in varied backgrounds. Modify matchmethod_orb_akaze_brisk.cpp to separately store the descriptors for each book. Then modify matchmethod_orb_akaze_brisk.cpp again to detect (as best you can) the correct book in the off-center photos and to indicate no book in the photos with no books. Report the results.
Optical flow:
1. Describe an object that would be better tracked by block matching than by Lucas-Kanade optical flow.
2. Describe an object that would be better tracked by Lucas-Kanade optical flow than by block matching.
Compile lkdemo.cpp. Set up a web camera to view a well-textured object (or use a previously captured sequence of a textured moving object). In running the program, note that r autoinitializes tracking, c clears tracking, and a mouse click will enter a new point or turn off an old point. Run lkdemo.cpp and initialize the point tracking by typing r. Observe the effects.
1. Now go into the code and remove the subpixel point placement function cv::findCornerSubPix(). Does this hurt the results? In what way?
2. Go into the code again and, in place of cv::goodFeaturesToTrack(), just put down a grid of points in an ROI around the object. Describe what happens to the points and why.
Hint: part of what happens is a consequence of the aperture problem—given a fixed window size and a line, we can’t tell how the line is moving.
Modify the lkdemo.cpp program to create a program that performs simple image stabilization for moderately moving cameras. Display the stabilized results in the center of a much larger window than the one output by your camera (so that the frame may wander while the first points remain stable).

¹ In this chapter we restrict our attention primarily to the features in the main library that have the main library’s licensing terms. There are more “nonfree” features and feature detectors and descriptors in xfeatures2D module (see Appendix B), as well as new experimental features. Of the nonfree features, we will cover only SIFT and SURF in this chapter due to their great historical, practical, and pedagogical importance.

² Later in this chapter, we will look at many ways to compute keypoints, which are essentially a generalization of “corners.” In that section, we will discuss many keypoint finding algorithms in detail, including Harris’s algorithm. We will get into more detailed descriptions of the algorithms and these parameters there.

³ Later, we will encounter another autocorrelation matrix in the context of the inner workings of Harris corners. The two are unrelated, however.

⁴ Black and Anandan have created dense optical flow techniques [Black93; Black96] that are often used in movie production where, for the sake of visual quality, the movie studio is willing to spend the time necessary to obtain detailed flow information. These techniques are slated for inclusion in later versions of OpenCV (see Chapter 23).

⁵ The definitive description of Lucas-Kanade optical flow in a pyramid framework implemented in OpenCV is an unpublished paper by Bouguet [Bouguet04].

⁶ Of course, the window could be 3 × 3, 7 × 7, or anything you choose. If the window is too large, then you will end up violating the coherent motion assumption and will not be able to track well. If the window is too small, you will encounter the aperture problem again.

⁷ In older versions of the library, only single- or three-channel images could be used. The new implementation (as of v2.4) can handle images with arbitrary numbers of channels. This enables the use of textural descriptors or other dense descriptors to pixel tracking (so long as the descriptors can be compared using Euclidean norm).

⁸ When we get to the details of the Harris algorithm later in this chapter, the meaning of this flag will become more clear.

⁹ In the next chapter, we will talk extensively about tracking (following things from one frame to another) and about object recognition (finding a thing in an image about which you have some information in a database of prior experience). A third very important case, which we will not talk about much in these chapters, is sparse stereo. In sparse stereo, one is interested in locating keypoints in two or more images taken at the same time from different points of view. Until we get to discussing stereo vision in Chapter 19, the only important fact you will need to keep in mind is that from the point of view of the library, sparse stereo will use the same methods and interfaces as tracking; that is, you will have two images in hand and want to find correspondences (matches) between the keypoints in each image.

¹⁰ In fact, this is the text from the first edition of the very book you are reading.

¹¹ The exact meaning of rotational invariance is actually that the feature is invariant under rotation. Specifically, this means that when a rotation operation is applied to the underlying image, the feature descriptor is unchanged. Such invariances (also called symmetries) are of central importance when you are selecting or designing feature descriptors. Ideally, this includes not only rotations in the plane of the image, but also (at least small) three-dimensional rotations of the object that, from the point of view of the imager, will appear as affine transformations in the image plane.

¹² For experts, this is a little subtler of a point than it might at first appear. Most, if not all, keypoints are actually extended objects with some number of pixels that contribute to them. For each keypoint type, however, there is a “center” that is defined in terms of that neighborhood (or, perhaps more accurately, the neighborhood is defined in terms of that center). In this sense the center of a keypoint is similar to the anchor point of a filter or morphological operator.

¹³ In the case of binary descriptor (we will encounter these later in the chapter), this method returns the number of bytes in the descriptor—not the number of bits. Generally, descriptorSize() returns the number of columns in the descriptor matrix that cv::DescriptorExtractor::compute() would return.

¹⁴ In all currently implemented cases, the descriptor type is single channel. The convention is to always “flatten” any channel-like elements of the descriptors into the descriptor’s overall length. As a result a descriptor might be, for example, 10-dimensional for a grayscale image or 30-dimensional for a color image, but would never be a 10-dimensional array of three channel objects.

¹⁵ This is a somewhat awkward piece of terminology, and is a bit confusing, as the word train is also used to refer to the process of digesting all of the images in the object recognition case into a dictionary that can subsequently be queried. We will use this terminology because it is what is used in the library, but where possible we will always say “training image” to make clear that this noun is distinct from the verb train.

¹⁶ Typically, these images contain individual objects that one hopes to be able to find by means of the appearance of their keypoints in subsequent images. For this reason, the keypoints provided as part of the keypoints argument should have their class_id fields set such that, for each object to be recognized, the class_id for the associated keypoints is distinct.

¹⁷ Recall that this typically means “one of the objects in your dictionary.”

¹⁸ The order is always starting with the best and working down to the kth-best (i.e., ).

¹⁹ It should be noted, however, that the term distance here is defined by the metric used by a particular matcher, and may or may not be the standard Euclidean distance between the vector representations of the descriptors.

²⁰ Note that even though this second list is called the “train” list, it is not the list “trained” into the matcher with the add() method; rather, it is being used in place of the internal “trained” list.

²¹ These features were once associated with the name “Good Features To Track” (i.e., cvGoodFeaturesToTrack()) in older versions of the library. This is why the associated detector is now called cv::GFTTDetector, rather than something potentially more intuitive like cv::HarrisCornerDetector.

²² That is, a first-order Taylor approximation.

²³ If there had been only one large eigenvalue, then this point would be something like an edge, in that motion along the edge would not seem to change the image, while motion perpendicular to the edge would change the image. If there were no large eigenvalues at all, this would have meant that you could displace the little window in any direction and nothing would happen at all; in other words, the image intensity is constant here.

²⁴ Making this number smaller increases the sensitivity of the algorithm, so the value of 0.04 is on the more sensitive side.

²⁵ Because the method of returning keypoints by the detect() method is to fill an STL vector of cv::KeyPoint objects, there is no real upper bound on how many keypoints you could ask for. In practice, however, it is often useful to limit the number of keypoints for the purpose of computational efficiency, or to bound computation time in downstream processing (particularly in real-time applications). In any case, the returned corners will be “best” corners found, in terms of the magnitude of the smaller eigenvalue of the autocorrelation matrix M(x, y).

²⁶ As we will see as we investigate other more complex feature detectors in this section, there are many other possible approaches to blob detection. Many of them will appear as components of more complex algorithms, and we will look at how they work as we get to them. Difference of Gaussian (DoG), Laplacian of Gaussian (LoG), and Determinant of Hessian (DoH) are all examples of blob detection mechanisms.

²⁷ Recall that the exact definitions of circularity, inertia (or inertia ratio), and convexity were covered previously in the algorithm description for the blob finder.

²⁸ Unlike the open and free algorithms using cv::features2d, cv::xfeatures2d is suspected of having patent issues and so is relegated to a special opencv_contrib directory.

²⁹ For the curious, the way the algorithm does this is to first rescale the pixels around the keypoint using the scale already determined. Then, in this scale-normalized image, Sobel derivatives are used in the x- and y-directions, and then converted to a polar form (magnitude and orientation). These derivatives, one for each point in the area of the feature, are then put into a histogram of orientations, each weighted by its magnitude. Finally, the maximum of the histogram is found, a parabola is fit to that maximum and its immediate neighbors, and the maximum of that parabola serves as an interpolated angle for the overall orientation of the feature.

³⁰ In general, it has been found that SIFT works best with 128 element descriptors. Similarly, some other values used by the descriptor computation (such as the magnification, if you are a SIFT expert) have been found to have values that are essentially always left unchanged. In future implementations, some of these may be exposed if there is found to be value in doing so.

³¹ As was the case with SIFT, because SURF is a patented algorithm, it’s placed in the xfeatures2d module of the opencv_contrib repository as of the OpenCV 3.0 release.

³² The Hessian is normally understood to be the matrix of second-order derivatives. In this case, however, it is really the matrix of so-called “Gaussian second-order derivatives,” which are defined by , where is a normalized Gaussian of size σ by which the image is convolved before the derivative is approximated.

³³ Note that in the SIFT case the actual technique was to convolve with the two different Gaussian kernels and then subtract the results. In the SURF case the kernels are first subtracted and a single convolution is done with the differenced kernel. The two operations are equivalent, but the second operation is more natural and more efficient in the case of the box filter approximation used by SURF.

³⁴ This procedure might seem quite convoluted, but note that the actual number of evaluations required to compute this orientation is actually very small. With only nine cells, and each cell requiring only six additions, the 81 points are computed in fewer than 500 operations.

³⁵ Note that x and y as used here are in the rotated coordinate system of the feature descriptor, not the coordinate system of the image.

³⁶ For the car images in Figure 16-21, the Hessian threshold was empirically adjusted to give a similar number of features to the number found by SIFT in Figure 16-11. For comparison, the OpenCV default value of 100 would have generated 2,017 and 2,475 features, respectively, for the left and right images in Figure 16-21.

³⁷ It is worth noting here that unlike SIFT, which actually reduces the size of the image with each octave, SURF is instead increasing the size of the kernels with which it is convolving. In addition, the step between “adjacent” evaluations of the kernels is also being increased as the kernels themselves are enlarged. As a result, because the kernels have fixed cost regardless of size (remember the integral image trick?), the cost of evaluating higher octaves decreases rapidly.

³⁸ In the paper in which they were introduced, BRIEF descriptors were used with features found through U-SURF.

³⁹ Many modern processors contain single-cycle instructions that will perform an XOR on a 256-bit word (e.g., the Intel SSE4™ instruction set).

⁴⁰ “BRISK” does not appear to stand for anything; it is not an acronym. Rather, the name is essentially a play on words similar to BRIEF.

⁴¹ The feature detector called “AGAST” (Adaptive and Generic Corner Detection Based on the Accelerated Segment Test) [Mair10] is an improvement on FAST and a precursor to BRISK. It is mentioned here for completeness, but is not implemented separately in the OpenCV library.

⁴² In the original implementation there was just one intra-octave per scale, thus creating just some number N of scales, and N – 1 intra-octaves for a total of 2N – 1 images.

⁴³ Deep inside of the BRISK code, BRISK calls cv::FAST and passes precisely this threshold value to the FAST algorithm.

⁴⁴ The feature used in ORB is also referred to (by its creators as well as others) as an rBRIEF (or rotation-aware BRIEF) feature, and is closely related to the BRIEF feature we just encountered. In fact this is the origin of the name ORB, which is derived from “Oriented FAST and Rotation Aware BRIEF.”

⁴⁵ You will recall from our earlier discussion that Harris originally proposed a slightly different constraint than the one later proposed by Shi and Tomasi. The cv::GoodFeaturesToTrackDetector() algorithm (by default) uses the latter, while ORB uses the former.

⁴⁶ Recall from our earlier discussion of BRIEF that the BRIEF descriptor used a random array of “tests.” This array, however, had no ability to be “aligned” with the feature (e.g., in the manner of SIFT).

⁴⁷ The astute reader will recognize that the first two properties are actually the same property for a distribution of binary variables.

⁴⁸ This data set is known as the PASCAL-2006 data set, and is publicly available on the Internet. It is a well-known benchmark data set widely used in computer vision research and widely cited in computer vision papers. It is an open question, however, whether performance on any specialized type of data set might be affected by the choice to train the ORB feature set on PASCAL-2006 (rather generic) data rather than data from the specialized type.

⁴⁹ As implemented, when you request some specific number of features, exactly twice that many will be retained using the FAST metric, and then for those, the Harris metric will be computed and the best half by that metric will be kept.

⁵⁰ Recall that a similar trade-off was possible between the intrinsically oriented SURF features and the “upright” (U-SURF) features.

⁵¹ If you are wondering why the terminology for this is “octave,” remember that the cv::KeyPoint object has an element octave, which indicates the scale at which that keypoint was found. The nOctaves argument to cv::FREAK::FREAK() corresponds to this same octave value. nOctaves should be set to (at least) the maximum octave of the keypoint detector used, or to the max octave in the set of keypoints for which we want to compute a description.

⁵² The pairs are indexed from 0 to 902, starting at (1,0), (2,0), (2,1), and so on up to (42,41). There is no natural explanation for this particular ordering, other than that the authors of this module opted to enclose a loop running from index i = 1 to i < 43 around a loop running from j = 0 to j < i, and to construct all of the pairs as pair[k] = (i,j). What is most important, however, is that you will be unlikely to interact with these indices directly. If you are using these indices, chances are that OpenCV constructed them for you with the cv::FREAK::selectPairs() function.

⁵³ DenseFeatureDetector is not (yet?) available in OpenCV 3.0.

⁵⁴ This density of features is unrealistically low, but was chosen for clarity of illustration.

⁵⁵ The default value for the scale multiplier is 0.1. It can be a number larger than 1 as well. It is essentially a matter of personal style as to whether you like to think of the grids as getting “finer” or “coarser” with each step.

⁵⁶ Recall that these masks were the ones that did not mask the image, but rather indicated which features should and should not be compared during matching.

⁵⁷ This is kind of a subtle point, but what happens is that there is one master list of nearest neighbors, and as each tree is descended, the nearest neighbor candidates are compared not only with what has been found so far on that tree, but with what has been found on all trees thus far.

⁵⁸ We will get to the k-means algorithm when we get to the machine learning library in Chapter 20. For now, it is enough to know that it attempts to organize some large number of points into k distinct clusters.

⁵⁹ Recall that k-means is an NP-hard problem, so most algorithms used to compute k-means clustering are approximate, and use an iterative scheme to find a locally optimal solution given some particular starting set of cluster center candidates.

⁶⁰ If this parameter is nonzero, it indicates that the algorithm should take into account the overall size of the domain as part of its consideration of where to go next.

⁶¹ If you are an expert in the FLANN library, you will know that the eps parameter is used by the variant of kd-tree called KDTreeSingleIndex (which is not currently exposed in the OpenCV interface). In that context, it determines when a search down a particular branch can be terminated because a found point is considered close enough that a yet closer point is unlikely to be found.

⁶² The sorted argument has no effect on kNN search, because in that case the returned hits will already always be in ascending order.

⁶³ Many of the figures appearing earlier in this chapter were made with cv::drawKeypoints. In these figures, DrawMatchesFlags::DRAW_RICH_KEYPOINTS was always used, but where scale or angle information was not present for the feature type, it was not displayed by cv::drawKeypoints. Figures 16-11, 16-13, and 16-17 are examples where only location data was available; only location and scale were available; and location, scale, and orientation date were available, respectively.

⁶⁴ Actually, the value cv::DrawMatchesFlags::DEFAULT is numerically equal to zero, so combining it with the others, while legal, will be meaningless.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 16. Keypoints and Descriptors

Create new playlist

Sign In

Sign Up

Chapter 16. Keypoints and Descriptors

Keypoints and the Basics of Tracking

Corner Finding

Figure 16-1. The points marked with circles here are good points to track, whereas those marked with boxes—even the ones that are sharply defined edges—are poor choices

Finding corners using cv::goodFeaturesToTrack()

Subpixel corners

Figure 16-2. Finding corners to subpixel accuracy: (a) the image area around the point p is uniform and so its gradient is 0; (b) the gradient at the edge is orthogonal to the vector q-p along the edge; in either case, the dot product between the gradient at p and the vector q-p is 0 (see text)

Introduction to Optical Flow

Figure 16-3. Optical flow: target features (left) are tracked over time and their movement is converted into velocity vectors (right); original images courtesy of Jean-Yves Bouguet

Lucas-Kanade Method for Sparse Optical Flow

How Lucas-Kanade works

Figure 16-5. Lucas-Kanade optical flow in one dimension: we can estimate the velocity of the moving edge (upper panel) by measuring the ratio of the derivative of the intensity over time divided by the derivative of the intensity over space

Figure 16-6. Iterating to refine the optical flow solution (Newton’s method): using the same two images and the same spatial derivative (slope) we solve again for the time derivative; convergence to a stable solution usually occurs within a few iterations

Figure 16-7. Two-dimensional optical flow at a single pixel: optical flow at one pixel is underdetermined and so can yield at most motion, which is perpendicular (“normal”) to the line described by the flow equation (figure courtesy of Michael Black)

Figure 16-8. Aperture problem: a) An object is moving to the right and down. (b) Through a small aperture, we see an edge moving to the right but cannot detect the downward part of the motion

Pyramid Lucas-Kanade code: cv::calcOpticalFlowPyrLK()

A worked example

Example 16-1. Pyramid Lucas-Kanade optical flow code

Generalized Keypoints and Descriptors

Figure 16-10. Two images containing text, with the Harris corners shown on the right as white circles; note that in the portion of the image that is slightly out of focus, no corners are detected at all

Optical Flow, Tracking, and Recognition

How OpenCV Handles Keypoints and Descriptors, the General Case

The cv::KeyPoint object

The (abstract) class that finds keypoints and/or computes descriptors for them: cv::Feature2D

The cv::DMatch object

The (abstract) keypoint matching class: cv::DescriptorMatcher

Note

Core Keypoint Detection Methods

Note

The Harris-Shi-Tomasi feature detector and cv::GoodFeaturesToTrackDetector

Figure 16-11. Two images of the same vehicle; in each image are the 1,000 strongest Harris-Shi-Tomasi corners. Notice that, in the image on the right, the corners in the background are stronger than those on the car, and so most of the corners in that image originate from there

Note

Keypoint finder

Additional functions

A brief look under the hood

The simple blob detector and cv::SimpleBlobDetector

Figure 16-13. The simple blob detector run on two similar images of the same vehicle. There is little consistency between the blobs found in the two images. Blob detection works best in simple environments where there are expected to be a few very well-defined objects to locate

Keypoint finder

The FAST feature detector and cv::FastFeatureDetector

Figure 16-16. The point P is a keypoint candidate for the FAST algorithm. The ring of points that contribute to the classification of P are identified by a circle around p. In this case there are 16 pixels on that circle, numbered 0–15 here

Keypoint finder

Note

The SIFT feature detector and cv::xfeatures2d::SIFT

Figure 16-19. The Gaussian kernels G1 and G2, along with the difference G1–G2

Keypoint finder and feature extractor

Note

The SURF feature detector and cv::xfeatures2d::SURF

Figure 16-22. The difference of two continuous Gaussian kernels is shown on the left (a). A discrete 9 × 9 filter kernel is shown in the center (b) that approximates the second derivative in the vertical direction. A box filter approximation of the DoG filter kernel is shown on the right (c)

Keypoint finder and feature extractor

Note

Additional functions provided by cv::xfeatures2d::SURF

The Star/CenSurE feature detector and cv::xfeatures2d::StarDetector

Figure 16-26. Box representation of DoG kernel used by CenSureE keypoint finder; for a particular size S, the center portion of the feature is of size S/2

Keypoint finder

The BRIEF descriptor extractor and cv::BriefDescriptorExtractor

Figure 16-27. A visualization of pixel-to-pixel tests that collectively comprise a single BRIEF descriptor; each line connects a pair of pixels that are compared by the test

Feature extractor

The BRISK algorithm

Keypoint finder and feature extractor

Additional functions provided by cv::BRISK

The ORB feature detector and cv::ORB

Figure 16-31. Two images of the same vehicle each produce 500 ORB features. An interesting characteristic of ORB is visible here, namely that if a corner is large in the image, many ORB features of differing sizes will be found on that same corner

Keypoint finder and feature extractor

Note

Additional functions provided by cv::ORB

The FREAK descriptor extractor and cv::xfeatures2d::FREAK

Feature extractor

Dense feature grids and the cv::DenseFeatureDetector class

Figure 16-34. Here dense features are generated on the two views of an automobile; in this case, there are three levels, with a starting spatial step of 50, a scale step of 2.0, and feature step being scaled as well as feature size.54

Keypoint finder

Figure 16-35. cv::DenseFeatureDetector generates one or more grids of keypoints of varying scale

Keypoint Filtering

The cv::KeyPointsFilter class

Matching Methods

Brute force matching with cv::BFMatcher

Fast approximate nearest neighbors and cv::FlannBasedMatcher

Table of Contents for
16. Keypoints and Descriptors

Figure 16-19. The Gaussian kernels G₁ and G₂, along with the difference G₁–G₂

Figure 16-34. Here dense features are generated on the two views of an automobile; in this case, there are three levels, with a starting spatial step of 50, a scale step of 2.0, and feature step being scaled as well as feature size.⁵⁴