In previous sections, you learned how to use L1 regularization to zero out irrelevant features via logistic regression, and use the SBS algorithm for feature selection and apply it to a KNN algorithm. Another useful approach to select relevant features from a dataset is to use a random forest, an ensemble technique that we introduced in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn. Using a random forest, we can measure the feature importance as the averaged impurity decrease computed from all decision trees in the forest, without making any assumptions about whether our data is linearly separable or not. Conveniently, the random forest implementation in scikit-learn already collects the feature importance values for us so that we can access them via the feature_importances_
attribute after fitting a RandomForestClassifier
. By executing the following code, we will now train a forest of 500 trees on the Wine dataset and rank the 13 features by their respective importance measures—remember from our discussion in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn that we don't need to use standardized or normalized features in tree-based models:
>>> from sklearn.ensemble import RandomForestClassifier >>> feat_labels = df_wine.columns[1:] >>> forest = RandomForestClassifier(n_estimators=500, ... random_state=1) >>> forest.fit(X_train, y_train) >>> importances = forest.feature_importances_ >>> indices = np.argsort(importances)[::-1] >>> for f in range(X_train.shape[1]): ... print("%2d) %-*s %f" % (f + 1, 30, ... feat_labels[indices[f]], ... importances[indices[f]])) >>> plt.title('Feature Importance') >>> plt.bar(range(X_train.shape[1]), ... importances[indices], ... align='center') >>> plt.xticks(range(X_train.shape[1]), ... feat_labels[indices], rotation=90) >>> plt.xlim([-1, X_train.shape[1]]) >>> plt.tight_layout() >>> plt.show() 1) Proline 0.185453 2) Flavanoids 0.174751 3) Color intensity 0.143920 4) OD280/OD315 of diluted wines 0.136162 5) Alcohol 0.118529 6) Hue 0.058739 7) Total phenols 0.050872 8) Magnesium 0.031357 9) Malic acid 0.025648 10) Proanthocyanins 0.025570 11) Alcalinity of ash 0.022366 12) Nonflavanoid phenols 0.013354 13) Ash 0.013279
After executing the code, we created a plot that ranks the different features in the Wine dataset, by their relative importance; note that the feature importance values are normalized so that they sum up to 1.0:
We can conclude that the proline and flavonoid levels, the color intensity, the OD280/OD315 diffraction, and the alcohol concentration of wine are the most discriminative features in the dataset based on the average impurity decrease in the 500 decision trees. Interestingly, two of the top-ranked features in the plot are also in the three-feature subset selection from the SBS algorithm that we implemented in the previous section (alcohol concentration and OD280/OD315 of diluted wines). However, as far as interpretability is concerned, the random forest technique comes with an important gotcha that is worth mentioning. If two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don't need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importance values.
To conclude this section about feature importance values and random forests, it is worth mentioning that scikit-learn also implements a SelectFromModel
object that selects features based on a user-specified threshold after model fitting, which is useful if we want to use the RandomForestClassifier
as a feature selector and intermediate step in a scikit-learn Pipeline
object, which allows us to connect different preprocessing steps with an estimator, as we will see in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning. For example, we could set the threshold
to 0.1
to reduce the dataset to the five most important features using the following code:
>>> from sklearn.feature_selection import SelectFromModel >>> sfm = SelectFromModel(forest, threshold=0.1, prefit=True) >>> X_selected = sfm.transform(X_train) >>> print('Number of features that meet this threshold criterion:', ... X_selected.shape[1]) Number of features that meet this threshold criterion: 5 >>> for f in range(X_selected.shape[1]): ... print("%2d) %-*s %f" % (f + 1, 30, ... feat_labels[indices[f]], ... importances[indices[f]])) 1) Proline 0.185453 2) Flavanoids 0.174751 3) Color intensity 0.143920 4) OD280/OD315 of diluted wines 0.136162 5) Alcohol 0.118529