We will often have a large number of features to choose from, but we wish to select only a small subset. There are many possible reasons for this:
Some classification algorithms can handle data with issues such as these. Getting the data right and getting the features to effectively describe the dataset you are modeling can still assist algorithms.
There are some basic tests we can perform, such as ensuring that the features are at least different. If a feature's values are all same, it can't give us extra information to perform our data mining.
The VarianceThreshold
transformer in scikit-learn, for instance, will remove any feature that doesn't have at least a minimum level of variance in the values. To show how this works, we first create a simple matrix using NumPy:
import numpy as np X = np.arange(30).reshape((10, 3))
The result is the numbers zero to 29, in three columns and 10 rows. This represents a synthetic dataset with 10 samples and three features:
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]])
Then, we set the entire second column/feature to the value 1:
X[:,1] = 1
The result has lots of variance in the first and third rows, but no variance in the second row:
array([[ 0, 1, 2], [ 3, 1, 5], [ 6, 1, 8], [ 9, 1, 11], [12, 1, 14], [15, 1, 17], [18, 1, 20], [21, 1, 23], [24, 1, 26], [27, 1, 29]])
We can now create a VarianceThreshold
transformer and apply it to our dataset:
from sklearn.feature_selection import VarianceThreshold vt = VarianceThreshold() Xt = vt.fit_transform(X)
Now, the result Xt
does not have the second column:
array([[ 0, 2], [ 3, 5], [ 6, 8], [ 9, 11], [12, 14], [15, 17], [18, 20], [21, 23], [24, 26], [27, 29]])
We can observe the variances for each column by printing the vt.variances_
attribute:
print(vt.variances_)
The result shows that while the first and third column contains at least some information, the second column had no variance:
array([ 74.25, 0. , 74.25])
A simple and obvious test like this is always good to run when seeing data for the first time. Features with no variance do not add any value to a data mining application; however, they can slow down the performance of the algorithm.
If we have a number of features, the problem of finding the best subset is a difficult task. It relates to solving the data mining problem itself, multiple times. As we saw in Chapter 4, Recommending Movies Using Affinity Analysis, subset-based tasks increase exponentially as the number of features increase. This exponential growth in time needed is also true for finding the best subset of features.
A workaround to this problem is not to look for a subset that works well together, rather than just finding the best individual features. This univariate feature selection gives us a score based on how well a feature performs by itself. This is usually done for classification tasks, and we generally measure some type of correlation between a variable and the target class.
The scikit-learn
package has a number of transformers for performing univariate feature selection. They include SelectKBest
, which returns the k best performing features, and SelectPercentile
, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature.
There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy.
We can observe single-feature tests in action using our Adult
dataset. First, we extract a dataset and class values from our pandas DataFrame
. We get a selection of the features:
X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values
We will also create a target class array by testing whether the Earnings-Raw
value is above $50,000 or not. If it is, the class will be True
. Otherwise, it will be False
. Let's look at the code:
y = (adult["Earnings-Raw"] == ' >50K').values
Next, we create our transformer using the chi2
function and a SelectKBest
transformer:
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 transformer = SelectKBest(score_func=chi2, k=3)
Running fit_transform
will call fit and then transform with the same dataset. The result will create a new dataset, choosing only the best three features. Let's look at the code:
Xt_chi2 = transformer.fit_transform(X, y)
The resulting matrix now only contains three features. We can also get the scores for each column, allowing us to find out which features were used. Let's look at the code:
print(transformer.scores_)
The printed results give us these scores:
[ 8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06 6.47640900e+03]
The highest values are for the first, third, and fourth columns Correlates to the Age
, Capital-Gain
, and Capital-Loss
features. Based on a univariate feature selection, these are the best features to choose.
We could also implement other correlations, such as the Pearson's correlation coefficient. This is implemented in SciPy, a library used for scientific computing (scikit-learn uses it as a base).
First, we import the pearsonr
function from SciPy:
from scipy.stats import pearsonr
The preceding function almost fits the interface needed to be used in scikit-learn's univariate transformers. The function needs to accept two arrays (x
and y
in our example) as parameters and returns two arrays, the scores for each feature and the corresponding p-values. The chi2
function we used earlier only uses the required interface, which allowed us to just pass it directly to SelectKBest.
The pearsonr
function in SciPy accepts two arrays; however, the X array it accepts is only one dimension. We will write a wrapper function that allows us to use this for multivariate arrays like the one we have. Let's look at the code:
def multivariate_pearsonr(X, y):
We create our scores
and pvalues
arrays, and then iterate over each column of the dataset:
scores, pvalues = [], [] for column in range(X.shape[1]):
We compute the Pearson correlation for this column only and the record both the score and p-value.
cur_score, cur_p = pearsonr(X[:,column], y) scores.append(abs(cur_score)) pvalues.append(cur_p)
The Pearson value could be between -1
and 1
. A value of 1
implies a perfect correlation between two variables, while a value of -1 implies a perfect negative correlation, that is, high values in one variable give low values in the other and vice versa. Such features are really useful to have, but would be discarded. For this reason, we have stored the absolute value in the scores
array, rather than the original signed value.
Finally, we return the scores and p-values
in a tuple:
return (np.array(scores), np.array(pvalues))
Now, we can use the transformer class as before to rank the features using the Pearson correlation coefficient:
transformer = SelectKBest(score_func=multivariate_pearsonr, k=3) Xt_pearson = transformer.fit_transform(X, y) print(transformer.scores_)
This returns a different set of features! The features chosen this way are the first, second, and fifth columns: the Age
, Education
, and Hours-per-week
worked. This shows that there is not a definitive answer to what the best features are— it depends on the metric.
We can see which feature set is better by running them through a classifier. Keep in mind that the results only indicate which subset is better for a particular classifier and/or feature combination—there is rarely a case in data mining where one method is strictly better than another in all cases! Let's look at the code:
from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import cross_val_score clf = DecisionTreeClassifier(random_state=14) scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy') scores_pearson = cross_val_score(clf, Xt_pearson, y, scoing='accuracy')
The chi2
average here is 0.83, while the Pearson score is lower at 0.77. For this combination, chi2
returns better results!
It is worth remembering the goal of this data mining activity: predicting wealth. Using a combination of good features and feature selection, we can achieve 83 percent accuracy using just three features of a person!