Data preprocessing

In the data preprocessing step, we will be focusing on two things mainly: data type transformations and data normalization. Finally we will split the data into training and testing datasets for predictive modeling. You can access the code for this section in the data_preparation.R file. We will be using some utility functions, which are mentioned in the following code snippet. Remember to load them up in memory by running them in the R console:

## data type transformations - factoring
to.factors <- function(df, variables){
  for (variable in variables){
    df[[variable]] <- as.factor(df[[variable]])
  }
  return(df)
}

## normalizing - scaling
scale.features <- function(df, variables){
  for (variable in variables){
    df[[variable]] <- scale(df[[variable]], center=T, scale=T)
  }
  return(df)
}

The preceding functions operate on the data frame to transform the data. For data type transformations, we mainly perform factoring of the categorical variables, where we transform the data type of the categorical features from numeric to factor. There are several numeric variables, which include credit.amount, age, and credit.duration.months, which all have various values and if you remember the distributions in the previous chapter, they were all skewed distributions. This has multiple adverse effects, such as induced collinearity, gradients being affected, and models taking longer times to converge. Hence, we will be using z-score normalization, where each value represented by, let's say, Data preprocessing, for a feature named E, can be calculated using the formula Data preprocessing where Data preprocessing represents the overall mean and Data preprocessing represents the standard deviation of the feature E. We use the following code snippet to perform these transformations on our data:

> # normalize variables
> numeric.vars <- c("credit.duration.months", "age", 
                    "credit.amount")
> credit.df <- scale.features(credit.df, numeric.vars)
> # factor variables
> categorical.vars <- c('credit.rating', 'account.balance', 
+                       'previous.credit.payment.status',
+                       'credit.purpose', 'savings', 
+                       'employment.duration', 'installment.rate',
+                       'marital.status', 'guarantor', 
+                       'residence.duration', 'current.assets',
+                       'other.credits', 'apartment.type', 
+                       'bank.credits', 'occupation', 
+                       'dependents', 'telephone', 
+                       'foreign.worker')
> credit.df <- to.factors(df=credit.df, 
                          variables=categorical.vars)

Once the preprocessing is complete, we will split our data into training and test datasets in the ratio of 60:40, where 600 tuples will be in the training dataset and 400 tuples will be in the testing dataset. They will be selected in a random fashion as follows:

> # split data into training and test datasets in 60:40 ratio
> indexes <- sample(1:nrow(credit.df), size=0.6*nrow(credit.df))
> train.data <- credit.df[indexes,]
> test.data <- credit.df[-indexes,]

Now that we have our datasets ready, we will explore feature importance and selection in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset