Feature selection

This recipe along with the two following it will be centered around automatic feature selection. I like to think of this as the feature analogue of parameter tuning. In the same way that we cross-validate to find an appropriately general parameter, we can find an appropriately general subset of features. This will involve several different methods.

The simplest idea is univariate selection. The other methods involve working with a combination of features.

An added benefit to feature selection is that it can ease the burden on the data collection. Imagine that you have built a model on a very small subset of the data. If all goes well, you might want to scale up to predict the model on the entire subset of data. If this is the case, you can ease the engineering effort of data collection at that scale.

Getting ready

With univariate feature selection, scoring functions will come to the forefront again. This time, they will define the comparable measure by which we can eliminate features.

In this recipe, we'll fit a regression model with a few 10,000 features, but only 1,000 points. We'll walk through the various univariate feature selection methods:

>>> from sklearn import datasets
>>> X, y = datasets.make_regression(1000, 10000)

Now that we have the data, we will compare the features that are included with the various methods. This is actually a very common situation when you're dealing in text analysis or some areas of bioinformatics.

How to do it...

First, we need to import the feature_selection module:

>>> from sklearn import feature_selection
>>> f, p = feature_selection.f_regression(X, y)

Here, f is the f score associated with each linear model fit with just one of the features. We can then compare these features and based on this comparison, we can cull features. p is also the p value associated with that f value.

In statistics, the p value is the probability of a value more extreme than the current value of the test statistic. Here, the f value is the test statistic:

>>> f[:5]
array([  1.06271357e-03, 2.91136869e+00, 1.01886922e+00,
         2.22483130e+00, 4.67624756e-01])
>>> p[:5]
array([ 0.97400066, 0.08826831, 0.31303204, 0.1361235, 0.49424067])

As we can see, many of the p values are quite large. We would rather want that the p values be quite small. So, we can grab NumPy out of our tool box and choose all the p values less than .05. These will be the features we'll use for the analysis:

>>> import numpy as np
>>> idx = np.arange(0, X.shape[1])
>>> features_to_keep = idx[p < .05]
>>> len(features_to_keep)

501

As you can see, we're actually keeping a relatively large amount of features. Depending on the context of the model, we can tighten this p value. This will lessen the number of features kept.

Another option is using the VarianceThreshold object. We've learned a bit about it, but it's important to understand that our ability to fit models is largely based on the variance created by features. If there is no variance, then our features cannot describe the variation in the dependent variable. A nice feature of this, as per the documentation, is that because it does not use the outcome variable, it can be used for unsupervised cases.

We will need to set the threshold for which we eliminate features. In order to do that, we just take the median of the feature variances and supply that:

>>> var_threshold = feature_selection.VarianceThreshold(np.median(np.var(X, axis=1)))

>>> var_threshold.fit_transform(X).shape

(1000, 4835)

As we can see, we eliminated roughly half the features, more or less what we would expect.

How it works...

In general, all these methods work by fitting a basic model with a single feature. Depending on whether we have a classification problem or a regression problem, we can use the appropriate scoring function.

Let's look at a smaller problem and visualize how feature selection will eliminate certain features. We'll use the same scoring function from the first example, but just 20 features:

>>> X, y = datasets.make_regression(10000, 20)


>>> f, p = feature_selection.f_regression(X, y)

Now let's plot the p values of the features, we can see which feature will be eliminated and which will be kept:

>>> from matplotlib import pyplot as plt


>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.bar(np.arange(20), p, color='k')
>>> ax.set_title("Feature p values")

The output will be as follows:

How it works...

As we can see, many of the features won't be kept, but several will be.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset