Avoiding the curse of dimensionality

In the previous sections, we showed that the classifier's performance will decrease when the number of features exceeds a certain optimal point. In theory, if you have infinite training samples, the curse of dimensionality won't exist. So, the optimal number of features is totally dependent on the size of your data.

An approach that will help you to avoid the harm of this curse is to subset M features from the large number of features N, where M << N. Each feature from M can be a combination of some features in N. There are some algorithms that can do this for you. These algorithms somehow try to find useful, uncorrelated, and linear combinations of the original N features. A commonly used technique for this is principle component analysis (PCA). PCA tries to find a smaller number of features that capture the largest variance of the original data. You can find more insights and a full explanation of PCA at this interesting blog: http://www.visiondummy.com/2014/05/feature-extraction-using-pca/.

A useful and easy way to apply PCA over your original training features is by using the following code:

# minimum variance percentage that should be covered by the reduced number of variables
variance_percentage = .99

# creating PCA object
pca_object = PCA(n_components=variance_percentage)

# trasforming the features
input_values_transformed = pca_object.fit_transform(input_values, target_values)

# creating a datafram for the transformed variables from PCA
pca_df = pd.DataFrame(input_values_transformed)

print(pca_df.shape[1], " reduced components which describe ", str(variance_percentage)[1:], "% of the variance")

In the Titanic example, we tried to build the classifier with and without applying PCA on the original features. Because we used the random forest classifier at the end, we found that applying PCA isn't very helpful; random forest works very well without any feature transformations, and even correlated features don't really affect the model much.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset