What is a visual dictionary?

We will be using the Bag of Words model to build our object recognizer. Each image is represented as a histogram of visual words. These visual words are basically the N centroids built using all the keypoints extracted from training images. The pipeline is as shown in the image that follows:

What is a visual dictionary?

From each training image, we detect a set of keypoints and extract features for each of those keypoints. Every image will give rise to a different number of keypoints. In order to train a classifier, each image must be represented using a fixed length feature vector. This feature vector is nothing but a histogram, where each bin corresponds to a visual word.

When we extract all the features from all the keypoints in the training images, we perform K-Means clustering and extract N centroids. This N is the length of the feature vector of a given image. Each image will now be represented as a histogram, where each bin corresponds to one of the 'N' centroids. For simplicity, let's say that N is set to 4. Now, in a given image, we extract K keypoints. Out of these K keypoints, some of them will be closest to the first centroid, some of them will be closest to the second centroid, and so on. So, we build a histogram based on the closest centroid to each keypoint. This histogram becomes our feature vector. This process is called vector quantization.

To understand vector quantization, let's consider an example. Assume we have an image and we've extracted a certain number of feature points from it. Now our goal is to represent this image in the form of a feature vector. Consider the following image:

What is a visual dictionary?

As you can see, we have 4 centroids. Bear in mind that the points shown in the figures represent the feature space and not the actual geometric locations of those feature points in the image. It is shown this way in the preceding figure so that it's easy to visualize. Points from many different geometric locations in an image can be close to each other in the feature space. Our goal is to represent this image as a histogram, where each bin corresponds to one of these centroids. This way, no matter how many feature points we extract from an image, it will always be converted to a fixed length feature vector. So, we "round off" each feature point to its nearest centroid, as shown in the next image:

What is a visual dictionary?

If you build a histogram for this image, it will look like this:

What is a visual dictionary?

Now, if you consider a different image with a different distribution of feature points, it will look like this:

What is a visual dictionary?

The clusters would look like the following:

What is a visual dictionary?

The histogram would look like this:

What is a visual dictionary?

As you can see, the histograms are very different for the two images even though the points seem to be randomly distributed. This is a very powerful technique and it's widely used in computer vision and signal processing. There are many different ways to do this and the accuracy depends on how fine-grained you want it to be. If you increase the number of centroids, you will be able to represent the image better, thereby increasing the uniqueness of your feature vector. Having said that, it's important to mention that you cannot just keep increasing the number of centroids indefinitely. If you do that, it will become too noisy and lose its power.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset