© Karthik Ramasubramanian and Abhishek Singh 2017

Karthik Ramasubramanian and Abhishek Singh, Machine Learning Using R, 10.1007/978-1-4842-2334-5_5

5. Feature Engineering

Karthik Ramasubramanian and Abhishek Singh1

(1)New Delhi, Delhi, India

In machine learning, feature engineering is a blanket term covering both statistical and business judgment aspects of modeling real-world problems. Feature engineering is a new term coined recently to give due importance to the domain knowledge required to select sets of features for machine learning algorithms. It is one of the reasons that most of the machine learning professionals call it an informal process. In this chapter, we will provide an easy-to-use guide of key terms and methodology used in feature engineering. The chapter will give due weight to the domain knowledge and some common business limitations while using machine learning algorithms to solve business problems.

The discussions will throw light on both aspects of feature engineering:

  • Domain knowledge and business limitations

  • Statistical principles

Before we set the layout for learning objectives of this chapter, let’s spend some time understanding how feature engineering is different from what we learned so far in previous chapters. We will explain it with two questions:

  • What are my features and their properties?

  • How do my features interact with each other to fit a model?

In order to quantify meaningful relationships between the response variable and predictor variables, we need to know the individual properties of the features and how they interact with each other. Descriptive statistics and distribution of features provide us with insight into what they are and how they behave in our dataset. Our previous chapters have addressed this first question.

The next step in machine learning involves asking questions and choosing the right set of features (or variables) and the criteria to choose them. These questions cannot be answered by just studying the individual properties of the features, but we need to understand their interactions with each other and with response variable. That is what we have to search the answer for the second question and quantify the relations to get a set of features that are best for the machine learning algorithm.

Learning objectives:

  • Introduction to feature engineering

  • Feature ranking

  • Variable subset selection

  • Dimensionality reduction

The chapter discusses some hands-on examples to apply the general statistical method to these concepts within the feature engineering space. The later part of the chapter will discuss some examples to show how business-critical thinking helps feature selection.

The illustrations in this chapter are based on loan default data. The data contain the loss on a given loan. The loss on each loan is graded between 1 and 100. For the cases where the full loan was recovered, the value of loan loss is set to 0, which means there was no default on that loan. A loss of 60 means that only 40% of the loan was recovered. The data is set up to create a default prediction model.

The data feature names are annonymized to bring focus on the statistical quantification of relationship among features. There are some key terms associated with the loan default in financial services industry, Probability of Default (PD), Exposure at Default (EAD), and Loss Given Default (LGD). While the focus of this chapter is to show how statistical methods work, you are encouraged to draw parallel analogies to your business problems, in which case a good reference point could be loan default.

5.1 Introduction to Feature Engineering

Feature engineering has become a core process in developing any data solution. The emergence of feature engineering as an integral part of the machine learning solution development is mainly driven by two factors:

  • Increase in a set of features/variables

  • Time and complexity of machine learning algorithms

With technological advances, it’s now possible to collect a lot of data at just a fraction of the cost. In many cases, to improve modeling output, we are merging lots of data from third-party sources, external open sources into internal data. This create huge sets of features for machine learning algorithms. All the features in our consideration set might not be important from a machine learning perspective and, even if they are, all of them might not be needed to attain a level of confidence in model predictions.

The other aspect is time and complexity; the machine learning algorithms are resource intensive and time increases exponentially for each feature added to the model. A data scientist has to bring in a balance between this complexity and benefit in the final model accuracy.

To completely understand the feature engineering concepts, we have to decouple this terminology into two separate but supporting processes:

  • Feature selection (or variable selection)

  • Business/domain knowledge

The former is statistics-intensive and provides empirical evidence as to why a certain feature or set of features is important for the machine learning algorithm. This is based on quantifiable and comparable metrics created either independent of the response variable or otherwise. The later is more to put the business logic to make sure the features make sense and provide the right insights the business is looking for.

In many cases, business logic takes precedence over statistical results. This precedence is not a hard and fast rule but business insights are not always driven by sound statistical results. When there is a conflict, business requirements take precedence over statistical inferences. For instance, suppose the unemployment rate is used for identifying loan defaults in a region. For the set of data, it might be possible that unemployment rate might not be significant at the 95% confidence level, but is significant at the 90% confidence level. If a business believes that unemployment rate is an important variable, then we might want to create an exception in the variable selection where the unemployment rate is captured with relaxed statistical constraints.

Business/domain knowledge varies with industry and application. Business needs are evolving and are very difficult to capture in a time-bound manner. We will discuss an example from the financial services domain to explain how variable selection and domain knowledge come together in deciding which features to use in the model. The main focus of the chapter is on statistical aspects of feature engineering, which we discuss under the sections of variable selection and feature creation.

The main benefits that come out of a robust and structured variable selection are:

  • Improved predictive performance of the model

  • Faster and less complex machine learning process

  • Better understanding of underlying data relationships

  • Explainable and implementable machine learning models/solutions

The first three benefits are intuitive and can be relayed back to our prior discussion. Let's invest some time to give due importance to the fourth point here. Business insights are generally driven from simple and explainable models. The more complicated a machine is, the more difficult it is to explain. Try to think about features as business action points. If the machine being built has features that cannot be explained in clear terms back to the business, the business loses the value as the model output doesn’t back to actionable points. That means the whole purpose of machine learning is lost.

Any model that you develop has to be deployed in the live environment for use by the end users. For a live environment, each added feature in the model means an added data feed into the live system, which in turn may mean accessing a whole new database. This creates a lot of IT system changes and dependencies within the system. The implementation and maintenance costs then have to be weighted upon the inalienability of the model and the essence of keeping so many variables. If the same underlying behavior can be explained with fewer features, implementation should be done with fewer features. Agility to compute and provide quick results often outweighs a better model with more features.

The feature selection methods are broadly divided into three groups—filter, wrapper, and embedded.

5.1.1 Filter Methods

Filter methods select variables regardless of the model. They put the features in an ordinal list by general features like correlation with the variable to predict or the variance in them. The ranked features then provide a list to make a decision of keeping or removing features based on ranks. Filter methods are often univariate and consider the features independently of other features. The scoring can be done by univariate or with regard to the dependent variable.

Some of the best known filter techniques include chi square test, correlation coefficients, and information gain metrics. For example, we know that high variance in the data normally reflects more information in it. In filter methods, we can filter out the features that have low variance and keep the ones with high variance for further analysis .

5.1.2 Wrapper Methods

Wrapper methods consider a set of features to find the best subset of features for a modeling problem. This method treats the features selection process as a search problem, where different combinations of features are tested against performance criteria and compared with other combinations. A predictive model is used to evaluate the different sets of features and an accuracy metric is used to score the set of features. The set of features with the highest accuracy measure is chosen for modeling.

The search process may use heuristics like forward selection, backward selection, and so on, or be probabilistic such as random hill-climbing algorithm. Or it may also methodological, like best-fit search or full brute force search. Another advanced example of a wrapper method is the recursive feature elimination algorithm. A simple example can be constructed around forward selection of variable subset; the model starts with a single variable and then starts adding more variables by measuring how much improvement the new variable brings into the model. When addition of a variable doesn’t bring any improvement in the model, we stop. This way, we can search model subset space to find the best subset.

5.1.3 Embedded Methods

Embedded methods are improved versions of wrapper algorithms. They introduce a penalty factor to the evaluation criteria of the model to bias the model toward lower complexity. The algorithms try to balance between the complexity and accuracy of the model. Regularization is the most common embedded method for variable subset selection, e.g., L1 and L2 regularizations, ridge regression, etc. LASSO stands for least absolute shrinkage and selection operator; it will be discussed later in this chapter .

5.2 Understanding the Working Data

The data used in this chapter is credit risk data from a public competition. Credit risk modeling is one of the most involved modeling problems in the banking industry. The process of building a credit risk model is not only complicated in terms of data but also requires in-depth knowledge of business and market dynamics.

A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments.

A little more background on key terms from credit risk modeling will be helpful for you to relate these data problems to other similar domain problems. We briefly introduce few key concepts in credit risk modeling.

  • Willingness to pay and ability to pay: The credit risk model tries to quantify these two aspects of any borrower. Ability to pay can be quantified by studying the financial conditions of the borrower (variable like income, wealth, etc.), while the tough part is measuring willingness to pay, where we use a variable which captures behavioral properties (variables like default history, fraudulent activities, etc.).

  • Probability of default (PD): PD is a measure that indicates how likely the borrower is going to default in the next period. The higher the value, the higher the chances of default. It is a measure having value between 0 and 1 (boundary inclusive). Banks want to lend money to borrowers having a low PD.

  • Loss Given Default (LGD): LGD is a measure of how much the lender is likely to lose if the borrower defaults in the next period. Generally, lenders have some kind of collateral with them to limit downside risk of default. In simplistic term, this measure is the amount lent minus the value of the collateral. This is usually measured as a percentage. Borrowers having high LGDs are a risk.

  • Exposure at Default (EAD): EAD is the amount that the bank/lender is exposed at the current point in time. This is the amount that the lender is likely to lose if the borrower defaults right now. This is one of the closely watched metrics in any bank credit risk division.

These terms will help you think through how we can influence the information from same data with a tweaked way to do feature engineering. All these metrics can be predicted from the same loan default data, but the way we go about selecting features will differ.

5.2.1 Data Summary

Data summary of the input data will provide vital information about the data. For this chapter, we need to understand some of the features of the data before we apply different techniques. To show how to apply statistical methods, select the feature set for modeling. The important features that we will be looking are as follows:

  • Properties of dependent variable

  • Feature availability: continuous or categorical

  • Setting up data assumptions

5.2.2 Properties of Dependent Variable

In our dataset, loss is the variable used as the dependent variable in the figures in this chapter. The modeling is to be done for credit loss. Loan is a type of credit and we will use credit loss and loan loss interchangeably. The loss variable has values between 0 and 100. We will see the loss variable’s distribution in this chapter.

The following code loads the data and shows the dimension of the dataset created. Dimension means the number of records multiplied by number of features.

Input the data and store in data table

library(data.table)

data <-fread ("Dataset/Loan Default Prediction.csv",header=T, verbose =FALSE, showProgress =TRUE)

Read 105471 rows and 771 (of 771) columns from 0.476 GB file in 00:01:02
dim(data)
 [1] 105471    771

There are 105,471 records with 771 attributes. Out of 771, there is one dependent series and one primary key. We have 769 features to create a feature set for this credit loss model.

We know that the dependent variable is loss on a scale of 0 to 100. For analysis purposes, we will analyze the dependent variable as continuous and discrete. As a continuous variable, we will look at descriptive statistics and, as a discrete variable, we will look at the distribution .

#Summary of the data                
summary(data$loss)
     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
   0.0000   0.0000   0.0000   0.7996   0.0000 100.0000

The highest loss that is recorded is 100, which is equivalent to saying that all the outstanding credit on the loan was lost. The mean is close to 0 and the first and third quartiles are 0. Certainly the loss cannot be dealt with as a continuous variable, as most of the values are concentrated toward 0. In other words, the number of cases with default is low.

hist(data$loss,
main="Histogram for Loss Distribution ",
xlab="Loss",
border="blue",
col="red",
las=0,
breaks=100,
prob =TRUE)

The distribution of loss in Figure 5-1 shows that loss is equal to zero for most part of the distribution. We can see that using loss as a continuous variable is not possible in this setting. So we will convert our dependent variable into a dichotomous variable, with 0 representing a non-default and 1 a default. The problem is to reduce to the default prediction, and we now know what kind of machine learning algorithm we intend to use down the line. This prior information will help us choose the appropriate feature selection methods and metrics to use in feature selection.

A416805_1_En_5_Fig1_HTML.jpg
Figure 5-1. Distribution of loss (including no default)

Let's now see for the cases where there is default (i.e., loss not equal to zero), how the loss is distributed (recall the LGD measure).

#Sub-set the data into NON Loss and Loss ( e.g., loss > 0)                  

subset_loss <-subset(data,loss !=0)

#Distribution of cases where there is some loss registered

hist(subset_loss$loss,
main="Histogram for Loss Distribution ( Only Default cases) ",
xlab="Loss",
border="blue",
col="red",
las=0,
breaks=100,
prob =TRUE)


Below distribution plot exclude non-default cases, in other words for only use cases where Loss >0.

In more than 90% of the cases, we have a loss below 25%, hence the Loss Given Default (LGD) is low (see Figure 5-2). The company can recover a high amount of due. For further discussion around feature selection, we will create a dichotomous variable called default, which will be 0 if the loss is equal to 0 and 1 otherwise .

A416805_1_En_5_Fig2_HTML.jpg
Figure 5-2. Distribution of loss (excluding no default)
default = 0 , there is no default and hence no lossdefault = 1, there is a default
#Create the default variable


data[,default :=ifelse(data$loss ==0, 0,1)]

#Distribution of defaults
table(data$default)


     0     1
 95688  9783
#Event rate is defined as ratio of default cases in total population


print(table(data$default)*100/nrow(data))

         0         1
 90.724465  9.275535

So we have converted our dependent variable into a dichotomous variable and our features selection problem will be geared toward finding the best set of features to model this default behavior for our data. The distribution table states that we have 9.3% of the cases of default in our dataset. This is sometime called an event rate in the model development data.

5.2.3 Features Availability: Continuous or Categorical

The data has 769 features to create a model for the credit loss. We have to identify how many of these features are continuous and categorical. This will allow us to design the feature selection process appropriately, as many metrics are not directly comparable for ordering, e.g., correlation of the continuous variable is different than the correlation measure for categorical variables.

Tip

If you don't have any prior knowledge of a feature’s valid values, you can treat variables with more than 30 levels as continuous and ones with fewer than 30 levels as categorical variables.

The following code snippet does three things to identify the type of treatment a variable needs to be given, i.e., continuous or categorical:

  • Remove the id,loss, and default indicators from this analysis, as these variables are identifier or dependent variable.

  • Find the unique values in each feature; if the number of unique values is less than or equal to 30, assign that feature to categorical set.

  • If the number of unique values is greater than 30, assign it to be continuous.

This idea is working for us; however, you have be cautious about variables like ZIP code (it is a nominal variable), states (number of states can be more than 30 and they are characters), and other features having character values.

continuous <-character()                                                                                                    
categorical <-character()
#Write a loop to go over all features and find unique values
p<-1
q<-1
for (i in names(data))
{
  unique_levels =length(unique(data[,get(i)]))


  if(i %in%c("id","loss","default"))
  {
    next;
  }


  else
  {
    if (unique_levels <=30 |is.character(data[,get(i)]))
    {
#    cat("The feature ", i, " is a categorical variable")
      categorical[p] <-i
      p=p+1
#  Making the
      data[[i]] <-factor(data[[i]])
    }
    else
    {
#    cat("The feature ", i, " is a continuous variable")
      continuous[q] <-i
      q=q+1


    }
  }
}


# subtract 1 as one is dependent variable = default
cat(" Total number of continuous variables in feature set ", length(continuous) -1)


 Total number of continuous variables in feature set  717
# subtract 2 as one is loss and one is id
cat(" Total number of categorical variable in feature set ", length(categorical) -2)


 Total number of categorical variable in feature set  49

These iterations have divided the data into categorical and continuous variables with each having 49 and 717 features in them, respectively. We will ignore the domain specific meaning of these features, as our focus is on statistical aspects of feature selection .

5.2.4 Setting Up Data Assumptions

To explain the different aspects of feature selection, we will be using some assumptions :

  • We do not have any prior knowledge of feature importance or domain-specific restrictions.

  • The machine/model we want to create will predict the dichotomous variable default.

  • The order of steps is just for illustration; multiple variations do exist.

5.3 Feature Ranking

Feature ranking is one of the most popular methods of identifying the explanatory power of a feature against the set purpose of the model. In our case the purpose is to predict a 0 or 1. The explanatory power has to be captured in a predefined metric, so we can put the features in an ordinal manner.

In our problem setup, we can use following steps to get feature rankings:

  • For each feature fit, use a logistic model (a more elaborate treatment of this topic is covered in Chapter 6) with dependent variable being default.

  • Calculate the Gini coefficient. Here, the Gini coefficient is the metric we defined to measure the explanatory power of the feature.

  • Rank order features using the Gini coefficient, where the higher Gini coefficient means greater explanatory power of the feature.

Package "MLmetrics"

This is a collection of evaluation metrics, including loss, score, and utility functions, that measure regression, classification, and ranking performance. This is a useful package for calculating classifiers performance metrics. We will be using the function Gini() in this package to get the Gini coefficient.

The following code snippet implement the following steps:

  1. For each feature in data, fit a logistic regression using the logit link function.

  2. Calculate the Gini coefficient on all the data (you can also train on taraining data and calculate Gini on testing data).

  3. Order all the features by the Gini coefficient (higher to lower) .

library(MLmetrics)
performance_metric_gini <-data.frame(feature =character(), Gini_value =numeric())


#Write a loop to go over all features and find unique values
for (feature in names(data))
{
  if(feature %in%c("id","loss","default"))
  {
    next;
  }
  else
  {
tryCatch({glm_model <-glm(default ∼get(feature),data=data,family=binomial(link="logit"));


      predicted_values <-predict.glm(glm_model,newdata=data,type="response");

      Gini_value <-Gini(predicted_values,data$default);

      performance_metric_gini <-rbind(performance_metric_gini,cbind(feature,Gini_value));},error=function(e){})

  }
}


performance_metric_gini$Gini_value <-as.numeric(as.character(performance_metric_gini$Gini_value))
#Rank the features by value of Gini Coefficient


Ranked_Features <-performance_metric_gini[order(-performance_metric_gini$Gini_value),]

print("Top 5 Features by Gini Coefficients ")
 [1] "Top 5 Features by Gini Coefficients "
head(Ranked_Features)
     feature Gini_value
 710    f766  0.2689079
 389    f404  0.2688113
 584    f629  0.2521622
 585    f630  0.2506394
 269    f281  0.2503371
 310    f322  0.2447725
Tip

When you are running loops over large datasets, it is possible that the loop might stop due to some errors. To escape that, consider using the trycatch() function in R.

The ranking methods tells us that the top six features by their individual predicted power are f766, f404, f629, f630, f281, and f322. The top feature in the Gini coefficient is 0.268 (or 26.8%). Now using the set of top five features, let’s create a logistical model and see the same performance metric.

The following code uses the top six features to fit a logistical model on our data. After fitting the model, it them print out the Gini coefficient of the model.

#Create a logistic model with top 6 features (f766,f404,f629,f630,f281 and f322)                

glm_model <-glm(default ∼f766 +f404 +f629 +f630 +f281 +f322,data=data,family=binomial(link="logit"));

predicted_values <-predict.glm(glm_model,newdata=data,type="response");

Gini_value <-Gini(predicted_values,data$default);

summary(glm_model)

 Call:
 glm(formula = default ∼ f766 + f404 + f629 + f630 + f281 + f322,
     family = binomial(link = "logit"), data = data)


 Deviance Residuals:
     Min       1Q   Median       3Q      Max  
 -0.7056  -0.4932  -0.4065  -0.3242   3.3407  


 Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
 (Intercept) -3.071639   2.160885  -1.421    0.155    
 f766        -1.609598   2.150991  -0.748    0.454    
 f404         0.351095   2.147072   0.164    0.870    
 f629        -0.505835   0.077767  -6.505 7.79e-11 ***
 f630        -0.090988   0.057619  -1.579    0.114    
 f281        -0.004073   0.008245  -0.494    0.621    
 f322         0.262128   0.055992   4.682 2.85e-06 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 (Dispersion parameter for binomial family taken to be 1)

     Null deviance: 65044  on 105147  degrees of freedom
 Residual deviance: 62855  on 105141  degrees of freedom
   (323 observations deleted due to missingness)
 AIC: 62869


 Number of Fisher Scoring iterations: 6
Gini_value
 [1] 0.2824955

The model result shows that four features (f766, f404, f630, and f281) are not significant. The standard errors are very high for these features. This gives us an indication that the features themselves are highly correlated and hence are not adding value by being in the model. As you can see, the Gini coefficient has not improved, even after adding more variables. The reason for the top features being insignificant could be that all of them are highly correlated. To investigate this multi-correlated issue, we will create the correlation matrix for the six features.

#Create the correlation matrix for 6 features (f766,f404,f629,f630,f281 and f322)                

top_6_feature <-data.frame(data$f766,data$f404,data$f629,data$f630,data$f281,data$f322)

cor(top_6_feature, use="complete")
            data.f766  data.f404  data.f629  data.f630  data.f281
 data.f766  1.0000000  0.9996710  0.6830923  0.64202380.8067094
 data.f404  0.9996710  1.0000000  0.6827368  0.6416069  0.8065005
 data.f629  0.6830923  0.6827368  1.0000000  0.9114775  0.6515478
 data.f630  0.6420238  0.6416069  0.9114775  1.0000000  0.6102867
 data.f281  0.8067094  0.8065005  0.6515478  0.6102867  1.0000000
 data.f322 -0.7675846 -0.7675819 -0.5536863 -0.5127184 -0.7280321
            data.f322
 data.f766 -0.7675846
 data.f404 -0.7675819
 data.f629 -0.5536863
 data.f630 -0.5127184
 data.f281 -0.7280321
 data.f322  1.0000000

It’s clear from the correlation structure that the features f766, f404, f630, and f281 are highly correlated and hence the model results shows them to be insignificant. This exercise shows that while feature ranking helps in measuring and quantifying the individual power of variables, it might not be directly used as a method of variable selection for model development.

Guyon and Elisseeff provide the following criticism for this variable ranking method:

[The] variable ranking method leads to the selection of a redundant subset. The same performance could possibly be achieved with a smaller subset of complementary variable.

You can verify this fact by looking at the correlation matrix and the significant variables in the model. The two significant variables are complementary and provide the similar Gini coefficient.

5.4 Variable Subset Selection

Variable subset selection is the process of selecting a subset of features (or variables) to use in the machine learning model. In previous section, we tried to create a subset of variables using the individual ranking of variables but observed the limitations of feature ranking as a variable selection method. Now we formally introduce the process of variable subset selection. We will be discussing one method from each broad category and will show an example using the credit loss data. You are encouraged to compare the results and assess what method suits your machine learning problem best.

Isabelle Guyon and Andre Elisseeff provided comprehensive introduction to various methods of variable (or feature) selection. They call the criteria for different methods as measuring "usefulness" or "relevance" of features to qualify them to be part of the variable subset. The three broad methods—filter, wrapper, and embedded—are illustrated with our credit loss data.

5.4.1 Filter Method

The filter method uses the intrinsic properties of variables, ignoring the machine learning method itself. This method is useful for classification problems where each variable adds incremental classification power.

Criterion : Measure feature/feature subset "relevance"

Search: Order features by individual feature ranking or nested subset of features

Assessment : Using statistical tests

Statistical Approaches

  1. Information gain

  2. Chi-square test

  3. Fisher score

  4. Correlation coefficient

  5. Variance threshold

Results

  1. Relatively more robust against overfitting

  2. Might not select the most "useful" set of features

For this method we will be showing the variance threshold approach, which is based on the basic concept that the variables that have high variability also have higher information in them. Variance threshold is a simple baseline approach. In this method, we remove all the variables having variance less than a threshold. This method automatically removes the variables having zero variance.

Note

The features in our dataset are not standardized and hence we cannot do direct comparison of variances. We will be using the coefficient of variation (CV) to choose the top five features for model building. Also the following exercise is shown only for continuous features; for categorical variables, use a chi.square test.

Coefficient of Variance (CoV) , also known as relative standard deviation, provides a standardized measure of dispersion of a variable. It is defined as the ratio of standard deviation to the mean of the variable: $$ {c}_v=frac{sigma }{mu } $$

Here, we calculate the mean and variance of each continuous variable, then we take a ratio of them to calculate the Coefficient of Variance (CoV) . The features are then ordered by decreasing coefficient of variance.

#Calculate the variance of each individual variable and standardize the variance by dividing with mean()                  

coefficient_of_variance <-data.frame(feature =character(), cov =numeric())

#Write a loop to go over all features and calculate variance
for (feature in names(data))
{
  if(feature %in%c("id","loss","default"))
  {
    next;
  }
  else if(feature %in%continuous)
  {
tryCatch(
      {cov  <-abs(sd(data[[feature]], na.rm =TRUE)/mean(data[[feature]],na.rm =TRUE));
      if(cov !=Inf){
      coefficient_of_variance <-rbind(coefficient_of_variance,cbind(feature,cov));} else {next;}},error=function(e){})


  }
  else
  {
    next;
  }
}


coefficient_of_variance$cov <-as.numeric(as.character(coefficient_of_variance$cov))

#Order the list by highest to lowest coefficient of variation

Ranked_Features_cov <-coefficient_of_variance[order(-coefficient_of_variance$cov),]

print("Top 5 Features by Coefficient of Variance ")
 [1] "Top 5 Features by Coefficient of Variance "
head(Ranked_Features_cov)
     feature       cov
 295    f338 164.46714
 378    f422 140.48973
 667    f724  87.22657
 584    f636  78.06823
 715    f775  70.24765
 666    f723  46.31984

The coefficient of variance provided the top six features by order of their CoV values. The features that show up in the top six (f338, f422, f724, f636, f775, and f723) are then used to fit a binomial logistic model. We calculate the Gini coefficient of the model to assess if these variables improve the Gini over individual features, as discussed earlier.

#Create a logistic model with top 6 features (f338,f422,f724,f636,f775 and f723)                  

glm_model <-glm(default ∼f338 +f422 +f724 +f636 +f775 +f723,data=data,family=binomial(link="logit"));

predicted_values <-predict.glm(glm_model,newdata=data,type="response");

Gini_value <-Gini(predicted_values,data$default);

summary(glm_model)

 Call:
 glm(formula = default ∼ f338 + f422 + f724 + f636 + f775 + f723,
     family = binomial(link = "logit"), data = data)


 Deviance Residuals:
     Min       1Q   Median       3Q      Max  
 -1.0958  -0.4839  -0.4477  -0.4254   2.6363  


 Coefficients:
               Estimate Std. Error  z value Pr(>|z|)    
 (Intercept) -2.206e+00  1.123e-02 -196.426  < 2e-16 ***
 f338        -1.236e-25  2.591e-25   -0.477    0.633    
 f422         1.535e-01  1.373e-02   11.183  < 2e-16 ***
 f724         1.392e+01  9.763e+00    1.426    0.154    
 f636        -1.198e-06  2.198e-06   -0.545    0.586    
 f775         6.412e-02  1.234e-02    5.197 2.03e-07 ***
 f723        -5.181e+00  4.623e+00   -1.121    0.262    
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 (Dispersion parameter for binomial family taken to be 1)

     Null deviance: 59064  on 90687  degrees of freedom
 Residual deviance: 58898  on 90681  degrees of freedom
   (14783 observations deleted due to missingness)
 AIC: 58912


 Number of Fisher Scoring iterations: 6
cat("The Gini Coefficient for the fitted model is  ",Gini_value);
 The Gini Coefficient for the fitted model is   0.1445109

This method does not show any improvement on the number of significant variables among the top six, i.e., only two features are significant—f422 and f775. Also, the model’s overall performance is worse, i.e., the Gini coefficient is 0.144 (14.4% only). For completeness of analysis purposes, let’s create the correlation matrix for these six features. We want to see if the variables are correlated and hence are insignificant.

#Create the correlation matrix for 6 features (f338,f422,f724,f636,f775 and f723)                  

top_6_feature <-data.frame(as.double(data$f338),as.double(data$f422),as.double(data$f724),as.double(data$f636),as.double(data$f775),as.double(data$f723))

cor(top_6_feature, use="complete")
                      as.double.data.f338. as.double.data.f422.
 as.double.data.f338.         1.000000e+00          0.009542857
 as.double.data.f422.         9.542857e-03          1.000000000
 as.double.data.f724.         4.335480e-02          0.006249059
 as.double.data.f636.        -6.708839e-05          0.011116608
 as.double.data.f775.         5.537591e-03          0.050666549
 as.double.data.f723.         5.048078e-02          0.005556227
                      as.double.data.f724. as.double.data.f636.
 as.double.data.f338.         0.0433548003        -6.708839e-05
 as.double.data.f422.         0.0062490589         1.111661e-02
 as.double.data.f724.         1.0000000000        -1.227539e-04
 as.double.data.f636.        -0.0001227539         1.000000e+00
 as.double.data.f775.         0.0121451180        -7.070228e-03
 as.double.data.f723.         0.9738147134        -2.157437e-04
                      as.double.data.f775. as.double.data.f723.
 as.double.data.f338.          0.005537591         0.0504807821
 as.double.data.f422.          0.050666549         0.0055562270
 as.double.data.f724.          0.012145118         0.9738147134
 as.double.data.f636.         -0.007070228        -0.0002157437
 as.double.data.f775.          1.000000000         0.0190753853
 as.double.data.f723.          0.019075385         1.0000000000

You can clearly see that the correlation structure is not dominating the feature set, but the individual feature relevance is driving their selection into the modeling subset. This is expected as well as we selected the variables based on CoV, which is independent of any other variable.

5.4.2 Wrapper Methods

Wrapper methods use a search algorithm to search the space of possible feature subsets and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of overfitting to the model.

Criterion : Measure feature subset "usefulness"

Search : Search the space of all feature subsets and select the set with the highest score

Assessment : Cross validation

Statistical Approaches

  1. Recursive feature elimination

  2. Sequential feature selection algorithms

    • 1. Sequential Forward Selection

    • 2. Sequential Backward Selection

    • 3. Plus-l Minus-r Selection

    • 4. Bidirectional Search

    • 5. Sequential Floating Selection

  3. Genetic algorithm

Results

  1. Give the most useful features for model building

  2. Can cause overfitting

We will be discussing sequential methods for illustration purposes. The most popular sequential methods are forward and backward selection. A similar variation of both combined is called a stepwise method.

Steps in a forward variable selection algorithm are as follows:

  1. Choose a model with only one variable, which gives the maximum value in your evaluation function.

  2. Add the next variable that improves the evaluation function by a maximum value.

  3. Keep repeating Step 2 until there is no more improvement by adding a new variable.

As you can see. this method is computationally intensive and iterative. It’s important to start with a set of variables carefully chosen for the problem. Using all the features available might not be cost effective. Filter methods can help shorten your list of variables to a manageable set for wrapper methods.

To set up the illustrative example, let’s take a subset of 10 features from the total set of features. Let's have the top five continuous variables from our filter method output and randomly choose five from the categorical variables.

#Pull 5 variables we had from highest coefficient of variation (from filter method)(f338,f422,f724,f636 and f775)                  

predictor_set <-c("f338","f422","f724","f636","f775")

#Randomly Pull 5 variables from categorical variable set ( Reader can apply filter method to categorical variable and can choose these 5 variables systematically as well)
set.seed(101);
ind <-sample(1:length(categorical), 5, replace=FALSE)
p<-1
for (i in ind)
{
  predictor_set [5+p] <-categorical[i]
  p=p+1
}


#Print the set of 10 variables we will be working with

print(predictor_set)
  [1] "f338" "f422" "f724" "f636" "f775" "f222" "f33"  "f309" "f303" "f113"
#Replaced f33 by f93 as f33 does not have levels


predictor_set[7] <- "f93"

#Print final list of variables

print(predictor_set)
  [1] "f338" "f422" "f724" "f636" "f775" "f222" "f93"  "f309" "f303" "f113"

We are preparing to predict the probability of someone defaulting in the next one-year time period. Our objective is to select the model based on following characteristics:

  • A fewer number of predictors is preferable

  • Penalize a model having a lot of predictors

  • Penalize a model for a bad fit

To measure these effects, we will be using the Akaike Information Criterion (AIC) measure as the evaluation metric. AIC is founded on the information theory; it measures the quality of a model relative to other models. While comparing it to other models, it deals with the tradeoff between the goodness of the fit of the model and the complexity of the model. Complexity of the model is represented by the number of variables in the model, where more variables mean greater complexity.

In statistics, AIC is defined as: $$ mathrm{A}mathrm{I}mathrm{C}kern0.5em =kern0.5em 2mathrm{k}hbox{--} 2 ln left(mathrm{L}
ight)kern0.5em =kern0.5em 2mathrm{k}kern0.5em +kern0.5em mathrm{Deviance} $$
where k is the number of parameters (or features).

Note

AIC is a relative measure; hence, it does not tell you anything about the quality of the model in the absolute sense.

To illustrate the feature selection by forward selection, we need to first develop two models, one with all features and one with no features:

  • Full model: A model with all the variables included in it. This model provides an upper limit on the complexity of model

  • Null model: A model with no variables in it, just an intercept term. This model provides a lower limit on the complexity of model.

Once we have these two models, we can start the feature selection based on the AIC measure. These models are important for AIC to use as a measure of model fit, as AIC will be measured relative to these extreme cases in the model. Let's first create a full model with all the predictors and see its summary (the output is truncated):

# Create a small modeling dataset with only predictors and dependent variable                  
library(data.table)
data_model <-data[,.(id,f338,f422,f724,f636,f775,f222,f93,f309,f303,f113,default),]
#make sure to remove the missing cases to resolve errors regarding null values


data_model<-na.omit(data_model)

#Full model uses all the 10 variables
full_model <-glm(default ∼f338 +f422 +f724 +f636 +f775 +f222 +f93 +f309 +f303 +f113,data=data_model,family=binomial(link="logit"))


#Summary of the full model
summary(full_model)


 Call:
 glm(formula = default ∼ f338 + f422 + f724 + f636 + f775 + f222 +
     f93 + f309 + f303 + f113, family = binomial(link = "logit"),
     data = data_model)


 Deviance Residuals:
     Min       1Q   Median       3Q      Max  
 -0.9844  -0.4803  -0.4380  -0.4001   2.7606  


 Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
 (Intercept) -2.423e+00  3.146e-02 -77.023  < 2e-16 ***
 f338        -1.379e-25  2.876e-25  -0.480 0.631429    
 f422         1.369e-01  1.387e-02   9.876  < 2e-16 ***
 f724         3.197e+00  1.485e+00   2.152 0.031405 *  
 f636        -9.976e-07  1.851e-06  -0.539 0.589891    
 f775         5.965e-02  1.287e-02   4.636 3.55e-06 ***
......Output truncated
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 (Dispersion parameter for binomial family taken to be 1)

     Null deviance: 58874  on 90287  degrees of freedom
 Residual deviance: 58189  on 90214  degrees of freedom
 AIC: 58337


 Number of Fisher Scoring iterations: 12

This output shows the summary of the full model build using all 10 variables. Now, let’s similarly create the null model:

#Null model uses no variables                  
null_model <-glm(default ∼1 ,data=data_model,family=binomial(link="logit"))


#Summary of the full model
summary(null_model)


 Call:
 glm(formula = default ∼ 1, family = binomial(link = "logit"),
     data = data_model)


 Deviance Residuals:
     Min       1Q   Median       3Q      Max  
 -0.4601  -0.4601  -0.4601  -0.4601   2.1439  


 Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
 (Intercept) -2.19241    0.01107    -198   <2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


 (Dispersion parameter for binomial family taken to be 1)

     Null deviance: 58874  on 90287  degrees of freedom
 Residual deviance: 58874  on 90287  degrees of freedom
 AIC: 58876


 Number of Fisher Scoring iterations: 4

At this stage, we have seen the extreme model performance, having all the variables in the model and a model without any variables (basically the historical average of dependent variable). With these extreme models, we will perform forward selection with the null model and start adding variables to it.

Forward selection will be done in iterations over the variable subset. Observe that the base model for the first iteration is the null model with AIC of 58876. Below that is the list of variables to choose from to add to the model.

#summary of forward selection method                  
forwards <-step(null_model,scope=list(lower=formula(null_model),upper=formula(full_model)), direction="forward")
 Start:  AIC=58876.26
 default ∼ 1


        Df Deviance   AIC
 + f222  7    58522 58538
 + f422  1    58743 58747
 + f113  7    58769 58785
 + f303 24    58780 58830
 + f775  1    58841 58845
 + f93   7    58837 58853
 + f309 23    58806 58854
 + f724  1    58870 58874
 <none>       58874 58876
 + f636  1    58873 58877
 + f338  1    58874 58878


Iteration 1: The model added f222 in the model.

 Step:  AIC=58538.39
 default ∼ f222


        Df Deviance   AIC
 + f422  1    58405 58423
 + f113  7    58461 58491
 + f303 24    58434 58498
 + f775  1    58495 58513
 + f93   7    58486 58516
 + f309 23    58462 58524
 + f724  1    58518 58536
 <none>       58522 58538
 + f636  1    58522 58540
 + f338  1    58522 58540


Iteration 2: The model added f422 in the model.

 Step:  AIC=58422.87
 default ∼ f222 + f422


        Df Deviance   AIC
 + f113  7    58346 58378
 + f303 24    58323 58389
 + f93   7    58370 58402
 + f775  1    58383 58403
 + f309 23    58353 58417
 + f724  1    58401 58421
 <none>       58405 58423
 + f636  1    58404 58424
 + f338  1    58404 58424


Iteration 3: The model added f113 in the model.

 Step:  AIC=58377.8
 default ∼ f222 + f422 + f113


        Df Deviance   AIC
 + f303 24    58265 58345
 + f775  1    58325 58359
 + f309 23    58295 58373
 + f724  1    58342 58376
 <none>       58346 58378
 + f636  1    58345 58379
 + f338  1    58345 58379
 + f93   7    58338 58384


Iteration 4: The model added f303 in the model.

 Step:  AIC=58345.04
 default ∼ f222 + f422 + f113 + f303


        Df Deviance    AIC
 + f775  1    58245 58327
 + f724  1    58261 58343
 <none>       58265 58345
 + f636  1    58264 58346
 + f338  1    58265 58347
 + f309 23    58225 58351
 + f93   7    58257 58351


Iteration 5: The model added f775 in the model.

 Step:  AIC=58326.96
 default ∼ f222 + f422 + f113 + f303 + f775


        Df Deviance   AIC
 + f724  1    58241 58325
 <none>       58245 58327
 + f636  1    58244 58328
 + f338  1    58244 58328
 + f309 23    58202 58330
 + f93   7    58237 58333


Iteration 6: The model added f724 in the model.

 Step:  AIC=58325.08
 default ∼ f222 + f422 + f113 + f303 + f775 + f724


        Df Deviance   AIC
 <none>       58241 58325
 + f636  1    58240 58326
 + f338  1    58240 58326
 + f309 23    58199 58329
 + f93   7    58233 58331

In last iteration, i.e., iteration six, you can see that our model has reached the minimal set of variables. The next suggestion is <none>, which means we are better off not adding any variables to the model. Now let’s see how our final forward selection model looks:

#Summary of final model with forward selection process                
formula(forwards)
 default ∼ f222 + f422 + f113 + f303 + f775 + f724

The forward selection method says that the best model with AIC criteria can be created with these six features: f222, f422, f113, f303, f775, and f724. Other features selection methods, like backward selection, stepwise selection, etc. can be done in a similar manner. In the next section, we will be introducing embedded methods that are computationally better than wrapper methods.

5.4.3 Embedded Methods

Embedded methods are similar to wrapper methods because they also optimize the objective function, usually a model of performance evaluation functions. The difference with the wrapper method is that an intrinsic model building metric is used during the learning of the model. Essentially, this is a search problem but a guided search, and hence is computationally less expensive.

Criterion : Measure feature subset "usefulness"

Search : Search the space of all feature subsets guided by the learning process

Assessment : Cross validation

Statistical Approaches

  1. L1 (LASSO) regularization

  2. Decision tree

  3. Forward selection with Gram-Schimdth orthogonalization

  4. Gradient descent methods

Results

  1. Similar to wrapper but with guided search

  2. Less computationally expensive

  3. Less prone to overfitting

For this method, we will be showing a regularization technique. In machine learning space, regularizationis a process of introducing additional information to prevent overfitting while searching through the variable subset space. In this section, we will show an illustration of L1 regularization for variable selection.

L1 regularization for variable selection is also called LASSO (Least Absolute Shrinkage and Selection Operator ). This method was introduced by Robert Tibshirani in his famous 1996 paper titled “Regression Shrinkage and Selection via the Lasso,” published in the Journal of the Royal Statistical Society.

In L1 or LASSO regression, we add a penalty term against the complexity to reduce the degree of overfitting or the variance of the model by adding additional bias. So the objective function to minimize looks like this: $$ mathrm{regularizationcost}kern0.5em =kern0.5em mathrm{cost}kern0.5em +kern0.5em mathrm{regularizationpenalty} $$
In LASSO regularization, the general form is given for the objective function, $$ frac{1}{N}{displaystyle sum_{i=1}^Nfleft({x}_i,{y}_i,alpha, eta 
ight)} $$

The lasso regularized version of the estimator will be the solution to: $$ underset{alpha, eta }{ min}frac{1}{N}{displaystyle sum_{i=1}^Nfleft({x}_i,{y}_i,alpha, eta 
ight)kern0.5em  subjectkern0.5em tokern0.5em left|
ight|eta left|
ight|{}_1}le t $$where only β is penalized while α is free to take any allowed value. Adding the regularization cost makes our objective function minimize the regularization cost.

The objective function for the penalized logistic regression uses the negative binomial log-likelihood, and is as follows: $$ underset{left({eta}_0,eta 
ight)in {mathbb{R}}^{p+1}}{ min }-left[frac{1}{N}{displaystyle sum_{i=1}^N{y}_icdot left({eta}_0+{x}_i^Teta 
ight)- log left(1+{e}^{left({eta}_0+{x}_i^Teta 
ight)}
ight)}
ight]kern0.5em +lambda left[left(1-alpha 
ight)left|
ight|eta left|
ight|{}_2^2/2+alpha left|
ight|eta left|
ight|{}_1
ight]. $$

Logistic regression is often plagued with degeneracy when p>Np>N and exhibits wild behavior even when N is close to p; the elastic-net penalty alleviates these issues and regularizes and selects variables as well. Source: https://web.stanford.edu/~hastie/glmnet/glmnet_beta.html .

We will run this example on a set of 10 continuous variables in the dataset.

#Create data frame with dependent and independent variables (Remove NA)                  

data_model <-na.omit(data)

y <-as.matrix(data_model$default)

x <-as.matrix(subset(data_model, select=continuous[250:260]))

library("glmnet")
We will be using package glmnet() to show  the
#Fit a model with dependent variable of binomial family
fit =glmnet(x,y, family="binomial")


#Summary of fit model
summary(fit)
          Length Class     Mode     
a0          44    -none-    numeric  
beta       440    dgCMatrix S4       
df          44    -none-    numeric  
dim          2    -none-    numeric  
lambda      44    -none-    numeric  
dev.ratio   44    -none-    numeric  
nulldev      1    -none-    numeric  
npasses      1    -none-    numeric  
jerr         1    -none-    numeric  
offset       1    -none-    logical  
classnames   2    -none-    character
call         4    -none-    call     
nobs         1    -none-    numeric

Figure 5-3 shows the plot between the fraction of deviance explained by each of these 10 variables.

A416805_1_En_5_Fig3_HTML.jpg
Figure 5-3. Coefficient and fraction of deviance explained by each feature/variable
#Plot the output of                                      glmnet fit model                                                                                              
plot (fit, xvar="dev", label=TRUE)

In the plot with 10 variables shown in Figure 5-3, you can see the coefficient of all the variables except that #7 and #5 are 0. As the next step, we will cross-validate our fit. For logistic regression, we will use cv.glmnet, which has similar arguments and usage in Gaussian. For instance, let's use a misclassification error as the criteria for 10-fold cross-validation.

#Fit a cross validated binomial model                  
fit_logistic =cv.glmnet(x,y, family="binomial", type.measure ="class")


#Summary of fitted Cross Validated Linear Model

summary(fit_logistic)
Length Class  Mode     
lambda     43     -none- numeric  
cvm        43     -none- numeric  
cvsd       43     -none- numeric  
cvup       43     -none- numeric  
cvlo       43     -none- numeric  
nzero      43     -none- numeric  
name        1     -none- character
glmnet.fit 13     lognet list     
lambda.min  1     -none- numeric  
lambda.1se  1     -none- numeric

The plot in Figure 5-4 is explaining how the missclassification rate changes over our set of features brought into the model. The plot shows that the model is pretty bad, as the variables we provided perform badly on the data.

A416805_1_En_5_Fig4_HTML.jpg
Figure 5-4. Misclassification error and log of penalization factor (lambda)
#Plot the results
plot (fit_logistic)

For a good model, Figure 5-4 will show an upward trend in the red dots. This is when you know what variability you are measuring in your dataset.

We can now pull the regularization factor from the glmnet() fit model. We pulled out the variable coefficient and variable names action.

#Print the minimum lambda - regularization factor                  
print(fit_logistic$lambda.min)
 [1] 0.003140939
print(fit_logistic$lambda.1se)
 [1] 0.03214848
#Against the lambda minimum value we can get the coefficients
param <-coef(fit_logistic, s="lambda.min")


param <-as.data.frame(as.matrix(param))

param$feature<-rownames(param)

#The list of variables suggested by the embedded method

param_embeded <-param[param$`1`>0,]

print(param_embeded)
                 1 feature
 f279 8.990477e-03    f279
 f298 2.275977e-02    f298
 f322 1.856906e-01    f322
 f377 1.654554e-04    f377
 f452 1.326603e-04    f452
 f453 1.137532e-05    f453
 f471 1.548517e+00    f471
 f489 1.741923e-02    f489

The final features suggested by the LASSO method are f279, f298, f322, f377, f452, f453, f471, and f489. Feature selection is a very statistically intense topic. You are encouraged to read more about the methods and make sure their chosen methodology fits the business problem you are trying to solve. In most of the real scenarios, data scientists have to design a mixture of techniques to get the desired set of variables for machine learning.

5.5 Dimensionality Reduction

In recent years, there has been explosion in the amount as well as type of data available at the data scientist's disposal. The traditional machine learning algorithms partly break down because of the volume of data and mostly because of the number of variables associated with each observation. The dimension of the data is the number of variables we have for each observation in our data.

Higher dimensions mean both opportunity and challenge for machine learning algorithms. Higher dimensions can allow you to capture events that can't be observed at low dimensions and, at the same time, they make the machine learning problem hard to converge. Within the same framework, Richard E. Bellman coined the term Curse of Dimensionality, which refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experiences.

In machine learning problems, the addition of each feature into the dataset exponentially increases the requirement of data points to train the model. The learning algorithm needs an enormous amount of data to search the right model in the higher dimensional space. With a fixed number of training samples, the predictive power reduces as the dimensionality increases, and this is known as the Hughes phenomenon (named after Gordon F. Hughes).

Dimensionality reduction is a process of deriving a set of degrees of freedom that can be used to reproduce most of the variability of a dataset. Essentially, you are creating new orthogonal features from raw data, which can essentially explain the large part of variance in actual features.

In mathematical terms, the problem we investigate can be stated as follows: given the p-dimensional random variable x = (x1, . . . , xp)T, find a lower dimensional representation of it, s = (s1, . . . , sk) T with k ≤ p, that captures the content in the original data, according to some criterion.

Dimensionality reduction is a process of features extraction rather than a feature selection process. Feature extraction is a process of transforming the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal Component Analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning. For example, by use of PCA, you can reduce a set of variables into a smaller set of variables (principal components) to model with, e.g., rather than using all 100 features in raw form, you can use the top 10 PCA factors to build the model with similar performance to the actual full model.

Within scope of this chapter, we will discuss the most popular technique, Principal Component Analysis (PCA). PCA is based on the covariance matrix; it is a second order method. A covariance matrix is a matrix whose element in the i, j position is the covariance between the ith and jth elements of a random vector. The covariance matrix plays a key role in financial economics, especially in portfolio theory and its mutual fund separation theorem and in the capital asset pricing model. It creates linear mapping for data from low dimension to space such that the variance of the data in low-dimensional space is maximized. The method is also known by other names, e.g., Singular Value Decomposition (SVD), Hoteling transformation, etc.

For illustration of PCA, we will work with 10 randomly chosen continuous variables from our data and create the principal components and check their significance in explaining the data.

Here are the steps for principal component analysis :

  1. Load the data as a data.frame.

  2. Normalize/scale the data.

  3. Apply the prcomp() function to get the principal components.

This performs a principal components analysis on the given data matrix and returns the results as an object of class prcomp.

#Take a subset of 10 features                
pca_data <-data[,.(f381,f408,f495,f529,f549,f539,f579,f634,f706,f743)]


pca_data <-na.omit(pca_data)

head(pca_data)
       f381 f408   f495       f529  f549 f539   f579   f634    f706
 1: 1598409    5 238.58 1921993.90 501.0  552 462.61  0.261  4.1296
 2:  659959    6   5.98  224932.72 110.0   76  93.77 11.219  4.1224
 3: 2036578   13  33.61  192046.42 112.0  137 108.60 16.775  9.2215
 4:  536256    4 258.23  232373.41 161.0  116 127.84  1.120  3.2036
 5: 2264524   26   1.16   52265.58  21.0   29  20.80 17.739 21.0674
 6: 5527421   22  38.91  612209.01 375.9  347 317.27 11.522 17.8663
         f743
 1:    -21.82
 2:    -72.44
 3:    -79.48
 4:     18.15
 5: -10559.05
 6:   8674.08


#Normalise the data before applying PCA analysis mean=0, and sd=1
scaled_pca_data <-scale(pca_data)


head(scaled_pca_data)
            f381       f408       f495       f529       f549       f539
 [1,] -0.5692025 -0.6724669  1.7551841  0.4825810  0.9085923  0.9507127
 [2,] -0.6549414 -0.6186983 -0.9505976 -0.4712597 -0.5448800 -0.6449880
 [3,] -0.5291705 -0.2423176 -0.6291842 -0.4897436 -0.5374454 -0.4404970
 [4,] -0.6662432 -0.7262356  1.9837680 -0.4670777 -0.3552967 -0.5108955
 [5,] -0.5083448  0.4566750 -1.0066675 -0.5683081 -0.8757215 -0.8025467
 [6,] -0.2102394  0.2416004 -0.5675306 -0.2535894  0.4435555  0.2634886
            f579        f634       f706       f743
 [1,]  1.0324757 -0.30383519 -0.5885608 -0.1716417
 [2,] -0.5546476  0.06876713 -0.5890247 -0.1751343
 [3,] -0.4908339  0.25768651 -0.2604470 -0.1756200
 [4,] -0.4080440 -0.27462681 -0.6482307 -0.1688839
 [5,] -0.8686385  0.29046517  0.5028836 -0.8986722
 [6,]  0.4070758  0.07906997  0.2966099  0.4283437

Do the decomposition on the scaled series:

pca_results <-prcomp(scaled_pca_data)

print(pca_results)
 Standard deviations:
  [1] 1.96507747 1.63138621 0.98482612 0.96399979 0.92767640 0.61171578
  [7] 0.55618915 0.13051700 0.12485945 0.03347933


 Rotation:
              PC1          PC2         PC3         PC4          PC5
 f381  0.05378102  0.467799305  0.12132602 -0.42802089  0.126159741
 f408  0.15295858  0.564941709 -0.01768741 -0.07653169  0.024978144
 f495 -0.20675453 -0.006500783 -0.16011133 -0.40648723 -0.872112347
 f529 -0.43704261  0.071515698  0.03229563  0.02515962  0.023404863
 f549 -0.48355364  0.131867970  0.03001595  0.07933850  0.098468782
 f539 -0.49110704  0.119977024  0.03264945  0.06070189  0.084331260
 f579 -0.48599970  0.130907456  0.03066637  0.07796726   0.098436970
 f634  0.08047589  0.148642810  0.80498132  0.42520275 -0.369686177
 f706  0.13666005  0.563301330 -0.06671534 -0.04782415  0.003828164
 f743  0.05999412  0.261771729 -0.55039555  0.66778245 -0.243211544
              PC6         PC7          PC8         PC9          PC10
 f381 -0.73377400 -0.14656999  0.020865868  0.06391263  2.449224e-03
 f408  0.33818854  0.09731467  0.100148531 -0.71887123  1.864559e-03
 f495  0.05113531 -0.05328517  0.010515158 -0.01387541  2.371417e-03
 f529 -0.16222155  0.87477550  0.099118491  0.01647113  3.417335e-03
 f549  0.10180105 -0.29558279  0.504078886  0.07149433 -6.123361e-01
 f539  0.02135767 -0.16116039 -0.811700032 -0.13982619 -1.664926e-01
 f579  0.09037093 -0.27164477  0.222859021  0.03262620  7.728324e-01
 f634 -0.07273691 -0.01754913 -0.002658235  0.01905427 -2.230924e-05
 f706  0.42035273  0.10600450 -0.130052945  0.67259788 -5.277646e-03
 f743 -0.34249087 -0.04793683  0.007771732 -0.01404485  3.873828e-04

Here is the summary of 10 principal components we get after applying the prcomp() function.

summary(pca_results)
 Importance of components:
                           PC1    PC2     PC3     PC4     PC5     PC6
 Standard deviation     1.9651 1.6314 0.98483 0.96400 0.92768 0.61172
 Proportion of Variance 0.3861 0.2661 0.09699 0.09293 0.08606 0.03742
 Cumulative Proportion  0.3861 0.6523 0.74928 0.84221 0.92827 0.96569
                            PC7    PC8     PC9    PC10
 Standard deviation     0.55619 0.1305 0.12486 0.03348
 Proportion of Variance 0.03093 0.0017 0.00156 0.00011
 Cumulative Proportion  0.99663 0.9983 0.99989 1.00000

The plot in Figure 5-5 shows the variance explained by each principal component. You can see that the first five principal components will be able to present ∼90% of the information stored in 10 variables.

A416805_1_En_5_Fig5_HTML.jpg
Figure 5-5. Variance explained by principal components
plot(pca_results)

The plot in Figure 5-6 is a relationship between principal component 1 and principal component 2. As we know, the decomposition is orthogonal, and we can see the orthogonality in the plot by looking at the 90 degrees between PC1 and PC2.

A416805_1_En_5_Fig6_HTML.jpg
Figure 5-6. Orthogonality of principal components 1 and 2
#Create the biplot with principle components
biplot(pca_results, col =c("red", "blue"))

So instead of using 10 variables for machine learning, you can use these top five principal components to train the model and still preserve 90% of the information.

Advantages of principal component analysis include:

  • Reduces the time and storage space required.

  • Remove multi-collinearity and improves the performance of the machine learning model.

  • Makes it easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

5.6 Feature Engineering Checklist

The feature selection checklist is a great source for decision-making steps in the variable selection process. The list is sourced from the “An Introduction to Variable and Feature Selection” paper by Isabelle Guyon and Andre Elisseeff. For a more in-depth understanding, reference the paper.

Selection problem Checklist :

  1. Do you have domain knowledge? If yes, construct a better set of features.

  2. Are your features commensurate? If no, consider normalizing them.

  3. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow.

  4. Do you need to prune the input variables (e.g., for cost, speed, or data understanding reasons)? If no, construct disjunctive features or weighted sums of features (e.g., by clustering or matrix factorization).

  5. Do you need to assess features individually (e.g., to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; otherwise, do it anyway to get baseline results.

  6. Do you need a predictor? If no, stop.

  7. Do you suspect your data is dirty (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.

  8. Do you know what to try first? If no, use a linear predictor and forward selection method with the method as a stopping criterion or use the 0-norm Embedded method. For comparison, following the ranking of step 5, construct a sequence of predictors of the same nature using increasing subsets of features. Can you match or improve the performance with a smaller subset? If yes, try a non-linear predictor with that subset.

  9. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection, and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection.

  10. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several bootstraps.

5.7 Summary

Feature engineering is an integral part of machine learning model development. The volume of data can be reduced by applying sampling techniques. Feature selection helps reduce the width of the data by selecting the most powerful features. We developed understanding of three core methods of variable selection—filter, wrapper, and embedded. Toward the end of this chapter, we showed examples of the Principal Component Analysis and learned how PCA can reduce dimensionality without losing the taste and value.

The next chapter is core of this book, chapter 6. The chapter will show you how to bring your business problems to your IT system and then try solving the problem using the R tool.

5.8 References

  1. “An Introduction to Variable and Feature Selection,” by Isabelle Guyon and Andre Elisseeff, published in Journal of Machine Learning Research 3, 2003.

  2. Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space" (PDF). Philosophical Magazine.

  3. Jolliffe I.T. “Principal Component Analysis,” Series: Springer Series in Statistics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset