The curse of dimensionality

In order to better explain the curse of dimensionality and the problem of overfitting, we are going to go through an example in which we have a set of images. Each image has a cat or a dog in it. So, we would like to build a model that can distinguish between the images with cats and the ones with dogs. Like the fish recognition system in Chapter 1, Data science - Bird's-eye view, we need to find an explanatory feature that the learning algorithm can use to distinguish between the two classes (cats and dogs). In this example, we can argue that color is a good descriptor to be used to differentiate between cats and dogs. So the average red, average blue, and average green colors can be used as explanatory features to distinguish between the two classes.

The algorithm will then combine these three features in some way to form a decision boundary between the two classes.

A simple linear combination of the three features can be something like the following:

If 0.5*red + 0.3*green + 0.2*blue > 0.6 : return cat;
else return dog;

These descriptive features will not be enough to get a good performing classifie, so we can decide to add more features that will enhance the model predictivity to discriminate between cats and dogs. For example, we can consider adding some features such as the texture of the image by calculating the average edge or gradient intensity in both dimensions of the image, X and Y. After adding these two features, the model accuracy will improve. We can even make the model/classifier get more accurate classification power by adding more and more features that are based on color, texture histograms, statistical moments, and so on. We can easily add a few hundred of these features to enhance the model's predictivity. But the counter-intuitive results will be worse after increasing the features beyond some limit. You'll better understand this by looking at Figure 1:

Figure 1: Model performance versus number of features

Figure 1 shows that as the number of features increases, the classifier's performance increases as well, until we reach the optimal number of features. Adding more features based on the same size of the training set will then degrade the classifier's performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset