Feature selection

We will often have a large number of features to choose from, but we wish to select only a small subset. There are many possible reasons for this:

  • Reducing complexity: Many data mining algorithms need more time and resources with increase in the number of features. Reducing the number of features is a great way to make an algorithm run faster or with fewer resources.
  • Reducing noise: Adding extra features doesn't always lead to better performance. Extra features may confuse the algorithm, finding correlations and patterns that don’t have meaning (this is common in smaller datasets). Choosing only the appropriate features is a good way to reduce the chance of random correlations that have no real meaning.
  • Creating readable models: While many data mining algorithms will happily compute an answer for models with thousands of features, the results may be difficult to interpret for a human. In these cases, it may be worth using fewer features and creating a model that a human can understand.

Some classification algorithms can handle data with issues such as these. Getting the data right and getting the features to effectively describe the dataset you are modeling can still assist algorithms.

There are some basic tests we can perform, such as ensuring that the features are at least different. If a feature's values are all same, it can't give us extra information to perform our data mining.

The VarianceThreshold transformer in scikit-learn, for instance, will remove any feature that doesn't have at least a minimum level of variance in the values. To show how this works, we first create a simple matrix using NumPy:

import numpy as np
X = np.arange(30).reshape((10, 3))

The result is the numbers zero to 29, in three columns and 10 rows. This represents a synthetic dataset with 10 samples and three features:

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23],
       [24, 25, 26],
       [27, 28, 29]])

Then, we set the entire second column/feature to the value 1:

X[:,1] = 1

The result has lots of variance in the first and third rows, but no variance in the second row:

array([[ 0,  1,  2],
       [ 3,  1,  5],
       [ 6,  1,  8],
       [ 9,  1, 11],
       [12,  1, 14],
       [15,  1, 17],
       [18,  1, 20],
       [21,  1, 23],
       [24,  1, 26],
       [27,  1, 29]])

We can now create a VarianceThreshold transformer and apply it to our dataset:

from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold()
Xt = vt.fit_transform(X)

Now, the result Xt does not have the second column:

array([[ 0,  2],
       [ 3,  5],
       [ 6,  8],
       [ 9, 11],
       [12, 14],
       [15, 17],
       [18, 20],
       [21, 23],
       [24, 26],
       [27, 29]])

We can observe the variances for each column by printing the vt.variances_ attribute:


The result shows that while the first and third column contains at least some information, the second column had no variance:

array([ 74.25,   0.  ,  74.25])

A simple and obvious test like this is always good to run when seeing data for the first time. Features with no variance do not add any value to a data mining application; however, they can slow down the performance of the algorithm.

Selecting the best individual features

If we have a number of features, the problem of finding the best subset is a difficult task. It relates to solving the data mining problem itself, multiple times. As we saw in Chapter 4, Recommending Movies Using Affinity Analysis, subset-based tasks increase exponentially as the number of features increase. This exponential growth in time needed is also true for finding the best subset of features.

A workaround to this problem is not to look for a subset that works well together, rather than just finding the best individual features. This univariate feature selection gives us a score based on how well a feature performs by itself. This is usually done for classification tasks, and we generally measure some type of correlation between a variable and the target class.

The scikit-learn package has a number of transformers for performing univariate feature selection. They include SelectKBest, which returns the k best performing features, and SelectPercentile, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature.

There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy.

We can observe single-feature tests in action using our Adult dataset. First, we extract a dataset and class values from our pandas DataFrame. We get a selection of the features:

X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values

We will also create a target class array by testing whether the Earnings-Raw value is above $50,000 or not. If it is, the class will be True. Otherwise, it will be False. Let's look at the code:

y = (adult["Earnings-Raw"] == ' >50K').values

Next, we create our transformer using the chi2 function and a SelectKBest transformer:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer = SelectKBest(score_func=chi2, k=3)

Running fit_transform will call fit and then transform with the same dataset. The result will create a new dataset, choosing only the best three features. Let's look at the code:

Xt_chi2 = transformer.fit_transform(X, y)

The resulting matrix now only contains three features. We can also get the scores for each column, allowing us to find out which features were used. Let's look at the code:


The printed results give us these scores:

[  8.60061182e+03   2.40142178e+03   8.21924671e+07   1.37214589e+06

The highest values are for the first, third, and fourth columns correlates to the Age, Capital-Gain, and Capital-Loss features. Based on a univariate feature selection, these are the best features to choose.


If you'd like to find out more about the features in the Adult dataset, take a look at the adult.names file that comes with the dataset and the academic paper it references.

We could also implement other correlations, such as the Pearson's correlation coefficient. This is implemented in SciPy, a library used for scientific computing (scikit-learn uses it as a base).


If scikit-learn is working on your computer, so is SciPy. You do not need to install anything further to get this sample working.

First, we import the pearsonr function from SciPy:

from scipy.stats import pearsonr

The preceding function almost fits the interface needed to be used in scikit-learn's univariate transformers. The function needs to accept two arrays (x and y in our example) as parameters and returns two arrays, the scores for each feature and the corresponding p-values. The chi2 function we used earlier only uses the required interface, which allowed us to just pass it directly to SelectKBest.

The pearsonr function in SciPy accepts two arrays; however, the X array it accepts is only one dimension. We will write a wrapper function that allows us to use this for multivariate arrays like the one we have. Let's look at the code:

def multivariate_pearsonr(X, y):

We create our scores and pvalues arrays, and then iterate over each column of the dataset:

    scores, pvalues = [], []
    for column in range(X.shape[1]):

We compute the Pearson correlation for this column only and the record both the score and p-value.

        cur_score, cur_p = pearsonr(X[:,column], y)


The Pearson value could be between -1 and 1. A value of 1 implies a perfect correlation between two variables, while a value of -1 implies a perfect negative correlation, that is, high values in one variable give low values in the other and vice versa. Such features are really useful to have, but would be discarded. For this reason, we have stored the absolute value in the scores array, rather than the original signed value.

Finally, we return the scores and p-values in a tuple:

    return (np.array(scores), np.array(pvalues))

Now, we can use the transformer class as before to rank the features using the Pearson correlation coefficient:

transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)
Xt_pearson = transformer.fit_transform(X, y)

This returns a different set of features! The features chosen this way are the first, second, and fifth columns: the Age, Education, and Hours-per-week worked. This shows that there is not a definitive answer to what the best features are— it depends on the metric.

We can see which feature set is better by running them through a classifier. Keep in mind that the results only indicate which subset is better for a particular classifier and/or feature combination—there is rarely a case in data mining where one method is strictly better than another in all cases! Let's look at the code:

from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy')
scores_pearson = cross_val_score(clf, Xt_pearson, y, scoing='accuracy')

The chi2 average here is 0.83, while the Pearson score is lower at 0.77. For this combination, chi2 returns better results!

It is worth remembering the goal of this data mining activity: predicting wealth. Using a combination of good features and feature selection, we can achieve 83 percent accuracy using just three features of a person!

