Local feature representations

Unlike the previous features we used, local features are computed on a small region of the image. Mahotas supports computing types of features called Speeded Up Robust Features (SURF). These features are designed to be robust against rotational or illumination changes (that is, they only change their values slightly when illumination changes).

When using these features, we have to decide where to compute them. There are three possibilities that are commonly used:

Randomly
In a grid
Detecting interesting areas of the image (a technique known as keypoint detection or interest point detection)

All of these are valid and will, under the right circumstances, give good results. Mahotas supports all three. Using interest point detection works best if you have a reason to expect that your interest point will correspond to areas of importance in the image.

We will be using the interest point method. Computing the features with mahotas is easy: import the right submodule and call the surf.surf function as follows:

descriptors = surf.surf(im, descriptor_only=True)

The descriptors_only=True flag means that we are only interested in the local features themselves, and not in their pixel location, size, or orientation (the word descriptor is often used to refer to these local features). Alternatively, we could have used the dense sampling method, using the surf.dense function as follows:

from mahotas.features import surf 
descriptors = surf.dense(im, spacing=16)

This returns the value of the descriptors computed on points that are at a distance of 24 pixels from each other. Since the position of the points is fixed, the meta information on the interest points is not very interesting and is not returned by default. In either case, the result (descriptors) is an nx 64 array, where n is the number of points sampled. The number of points depends on the size of your images, their content, and the parameters you pass to the functions. In this example, we are using the default settings, and we obtain a few hundred descriptors per image.

We cannot directly feed these descriptors to a support vector machine, logistic regressor, or similar classification system. In order to use the descriptors from the images, there are several solutions. We could just average them, but the results of doing so are not very good as they throw away all location-specific information. In that case, we would have just another global feature set based on edge measurements.

The solution we will use here is the bag of words model. It was published first in 2004, but this is one of those obvious-in-hindsight ideas; it is very simple to implement and achieves very good results.

It may seem strange to speak of words when dealing with images. It may be easier to understand if you think that you don't have written words, which are easy to distinguish from each other, but orally spoken audio. Now, each time a word is spoken, it will sound slightly different, and different speakers will have their own pronunciation. Thus, a word's waveform will not be identical every time it is spoken. However, by using clustering on these waveforms, we can hope to recover most of the structure so that all the instances of a given word are in the same cluster. Even if the process is not perfect (and it will not be), we can still talk of grouping the waveforms into words.

We perform the same operation with image data: we cluster together similar-looking regions from all images and call these visual words.

The number of words used does not usually have a big impact on the final performance of the algorithm. Naturally, if the number is extremely small (10 or 20, when you have a few thousand images), then the overall system will not perform well. Similarly, if you have too many words (many more than the number of images, for example), the system will also not perform well. However, in between these two extremes, there is often a very large plateau, where you can choose the number of words without a big impact on the result. As a rule of thumb, using a value such as 256, 512, or 1,024 if you have many images should give you a good result.

We are going to start by computing the features as follows:

alldescriptors = [] 
for im in images: 
    im = mh.imread(im, as_grey=True) 
    im = im.astype(np.uint8) 
    alldescriptors.append((surf.surf(im, descriptor_only=True)) 
# get all descriptors into a single array 
concatenated = np.concatenate(alldescriptors)

Now, we use k-means clustering to obtain the centroids. We could use all the descriptors, but we are going to use a smaller sample for extra speed. We have several million descriptors and it would not be wrong to use them all. However, it would require much more computation for little extra benefit. The sampling and clustering is as shown in the following code:

# use only every 64th vector 
concatenated = concatenated[::64] 
from sklearn.cluster import KMeans 
k = 256 
km = KMeans(k) 
km.fit(concatenated)

After this is done (which will take a while), the km object contains information about the centroids. We now go back to the descriptors and build feature vectors as follows:

sfeatures = [] 
for d in alldescriptors: 
    c = km.predict(d) 
    sfeatures.append(np.bincount(c, minlength=256)) 
# build single array and convert to float 
sfeatures = np.array(sfeatures, dtype=float)

The end result of this loop is that sfeatures[fi, fj] is the number of times that the image fi contains the element fj. The same could have been computed faster with the np.histogram function, but getting the arguments just right is a little tricky. We convert the result to floating point as we do not want integer arithmetic (with its rounding semantics).

The result is that each image is now represented by a single array of features of the same size (the number of clusters in our case is 256). Therefore, we can use our standard classification methods as follows:

scores = model_selection.cross_val_score( 
   clf, sfeatures, labels, cv=cv) 
print('Accuracy: {:.1%}'.format(scores.mean())) 
Accuracy: 62.4%

This is worse than before! Have we gained nothing?

In fact, we have, as we can combine all features together to obtain 76.7 percent accuracy, as follows:

allfeatures = np.hstack([ifeatures, sfeatures]) scores = model_selection.cross_val_score( clf, allfeatures, labels, cv=cv) print('Accuracy: {:.1%}'.format(scores.mean())) 
Accuracy: 76.7%

This is the best result we have, better than any single feature set. This is due to the fact that the local SURF features are different enough to add new information to the global image features we had before and improve the combined result.

Table of Contents for Local feature representations

Create new playlist

Sign In

Sign Up

Table of Contents for
Local feature representations