Feature selection

With respect to the machine learning algorithm that you are going to use, irrelevant and redundant features may play a role in the lack of interpretability of the resulting model, long training times and, most importantly, overfitting and poor generalization.

Overfitting is related to the ratio of the number of observations and the variables available in your dataset. When the variables are many compared to the observations, your learning algorithm will have more chance of ending up with some local optimization or the fitting of some spurious noise due to the correlation between variables.

Apart from dimensionality reduction, which requires you to transform data, feature selection can be the solution to the aforementioned problems. It simplifies high-dimensional structures by choosing the most predictive set of variables; that is, it picks the features that work well together, even if some of them are not such good predictors on an independent level.

The Scikit-learn package offers a wide range of feature selection methods:

  • Selection based on the variance
  • Univariate selection
  • Recursive elimination
  • Randomized logistic regression/stability selection
  • L1-based feature selection
  • Tree-based feature selection

Variance, univariate, and recursive elimination can be found in the feature_selection module. The others are a byproduct of specific machine learning algorithms. Apart from tree-based selection (which will be mentioned in Chapter 4, Machine Learning), we are going to present all the preceding methods and point out how they can help you improve your learning from the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset