Data analysis and transformation

Now that we have processed our data, it is ready for analysis. We will be carrying out descriptive and exploratory analysis in this section, as mentioned earlier. We will analyze the different dataset attributes and talk about their significance, semantics, and relationship with the credit risk attribute. We will be using statistical functions, contingency tables, and visualizations to depict all of this.

Besides this, we will also be doing data transformation for some of the features in our dataset, namely the categorical variables. We will be doing this to combine the category classes which have similar semantics and remove the classes having very less proportion by merging them with a similar class. Some reasons for doing this include preventing the overfitting of our predictive models, which we will be building in Chapter 6, Credit Risk Detection and Prediction – Predictive Analytics, linking semantically similar classes together and also because modeling techniques like logistic regression do not handle categorical variables with a large number of classes very well. We will analyze each feature/variable in the dataset first and then perform any transformations if necessary.

Building analysis utilities

Before we begin our analysis, we will be developing some utility functions which we will be using to analyze the dataset features. Do note that all the utility functions are defined in a separate .R file called descriptive_analytics_utils.R. You can load all the functions in memory or in any other R script file by using the command source('descriptive_analytics_utils.R') and then start using them. We will be talking about these utility functions now.

We will now talk about the various packages we have used. We have used some packages such as pastecs and gmodels for getting summary statistics of features and for building contingency tables. The packages gridExtra and ggplot2 have been used for grid layouts and building visualizations respectively. If you do not have them installed, you can use the install.packages command to install them. Next, load the packages as shown in the following code snippet:

# load dependencies
library(gridExtra) # grid layouts
library(pastecs) # details summary stats
library(ggplot2) # visualizations
library(gmodels) # build contingency tables

Now that we have all the required dependencies, we will first implement a function to get summary statistics about the numerical variables. The following code snippet achieves the same. If you see, we have made use of the stat.desc and summary functions for getting detailed and condensed summary statistics about the variable. The conventions for independent variables and dependent variables are denoted by indep.var and dep.var in the code segments that follow and in other functions later on.

# summary statistics
get.numeric.variable.stats <- function(indep.var, detailed=FALSE){
  options(scipen=100)
  options(digits=2)
  if (detailed){
    var.stats <- stat.desc(indep.var)
  }else{
    var.stats <- summary(indep.var)
  }
  
  df <- data.frame(round(as.numeric(var.stats),2))
  colnames(df) <- deparse(substitute(indep.var))
  rownames(df) <- names(var.stats)
  
  if (names(dev.cur()) != "null device"){
    dev.off()
  }
  grid.table(t(df))
}

Next, we will build some functions for visualizing the numeric variables. We will be doing that by using histogramsdensity plots and box plots to depict the attribute distributions.

# visualizations
# histogramsdensity
visualize.distribution <- function(indep.var){
  pl1 <- qplot(indep.var, geom="histogram", 
               fill=I('gray'), binwidth=5,
               col=I('black'))+ theme_bw()
  pl2 <- qplot(age, geom="density",
               fill=I('gray'), binwidth=5, 
               col=I('black'))+ theme_bw()
  
  grid.arrange(pl1,pl2, ncol=2)
}

# box plots
visualize.boxplot <- function(indep.var, dep.var){
  pl1 <- qplot(factor(0),indep.var, geom="boxplot", 
               xlab = deparse(substitute(indep.var)), 
               ylab="values") + theme_bw()
  pl2 <- qplot(dep.var,indep.var,geom="boxplot",
               xlab = deparse(substitute(dep.var)),
               ylab = deparse(substitute(indep.var))) + theme_bw()
  
  grid.arrange(pl1,pl2, ncol=2)
}

We have used the qplot function from the ggplot2 package for building the visualizations which we will be seeing in action soon. Now we will be shifting our focus to categorical variables. We will start with building a function to get summary statistics of any categorical variable.

# summary statistics
get.categorical.variable.stats <- function(indep.var){
  
  feature.name = deparse(substitute(indep.var))
  df1 <- data.frame(table(indep.var))
  colnames(df1) <- c(feature.name, "Frequency")
  df2 <- data.frame(prop.table(table(indep.var)))
  colnames(df2) <- c(feature.name, "Proportion")
  
  df <- merge(
    df1, df2, by = feature.name
  )
  ndf <- df[order(-df$Frequency),]
  if (names(dev.cur()) != "null device"){
    dev.off()
  }
  grid.table(ndf)
}

The preceding function will summarize the categorical variable and talk about how many classes or categories are present in it and some other details such as frequency and proportion. If you remember, we had mentioned earlier that we will also be depicting the relationship of categorical variables with the class/dependent variable credit.risk. The following function will help us achieve the same in the form of contingency tables:

# generate contingency table
get.contingency.table <- function(dep.var, indep.var, 
                                          stat.tests=F){
  if(stat.tests == F){
    CrossTable(dep.var, indep.var, digits=1, 
               prop.r=F, prop.t=F, prop.chisq=F)
  }else{
    CrossTable(dep.var, indep.var, digits=1, 
               prop.r=F, prop.t=F, prop.chisq=F,
               chisq=T, fisher=T)
  }
}

We will also build some functions for depicting visualizations. We will be visualizing categorical variable distribution using bar charts by using the following function:

# visualizations
# barcharts
visualize.barchart <- function(indep.var){
  qplot(indep.var, geom="bar", 
        fill=I('gray'), col=I('black'),
        xlab = deparse(substitute(indep.var))) + theme_bw()
}

We will use mosaic plots to depict visualizations of the previously mentioned contingency tables using the following function:

# mosaic plots
visualize.contingency.table <- function(dep.var, indep.var){
  if (names(dev.cur()) != "null device"){
    dev.off()
  }
  mosaicplot(dep.var ~ indep.var, color=T,  
             main = "Contingency table plot")
}

Now that we have built all the necessary utilities, we will begin analyzing our data in the following section.

Analyzing the dataset

We will be analyzing each feature of the dataset in this section and depicting our analysis in the form of summary statistics, relationships, statistical tests, and visualizations wherever necessary. We will denote necessary analysis which will be carried out for each variable in a table. An important point to remember is that the dependent feature denoted in code by dep.var will always be credit.rating since this is the variable which is dependent on the other features; these features are independent variables and will be denoted as indep.var in the tables and plots often.

We will carry out detailed analysis and transformations for some of the important features which have a lot of significance, especially data features having a large number of classes, so that we can clearly understand data distributions and how they change on transformation of the data. For the remaining features, we will not focus too much on the summary statistics but emphasize more on feature engineering through transformations and their relationships with the dependent credit.rating variable.

Now we will attach the data frame so that we can access the individual features easily. You can do that using the following code snippet:

> # access dataset features directly
> attach(credit.df)

Now we will be starting our analysis with the dependent variable credit.risk, also known as the class variable in our dataset, which we will be trying to predict in the next chapter.

The following code snippet helps us in getting the required summary statistics for this feature:

> # credit.rating stats
> get.categorical.variable.stats(credit.rating)
> # credit.rating visualizations
> visualize.barchart(credit.rating)

The following visualizations tell us that credit.rating has two classes, 1 and 0, and gives the necessary statistics. Basically, customers with a credit rating of 1 are credit worthy and those with a rating of 0 are not credit worthy. We also observe from the bar chart that the proportion of credit worthy customers in the bank is significantly high compared to the rest.

Analyzing the dataset

Next, we will analyze the account.balance feature. Basically, this attribute indicates the current balance of the current account of the customer.

We will start with getting the summary statistics and plotting a bar-chart using the following code snippet. We will include both the outputs together for better understanding.

> # account.balance stats and bar chart
> get.categorical.variable.stats(account.balance)
> visualize.barchart(account.balance)

From the following visualizations, you can see that there are four distinct classes for account.balance and they each have some specific semantics which we will be talking about soon.

Analyzing the dataset

From the preceding output you can see that there are four distinct classes for account.balance, and they each have some semantics, as defined next. The currency DM indicates Deutsche Mark, the old currency name of Germany.

The four classes indicate the following as the main semantics or checking account held for at least a year:

  • 1: No running bank account
  • 2: No balance or debit
  • 3: Balance of < 200 DM
  • 4: Balance of >=200 DM

The currency DM indicates Deutsche Mark, the old currency name of Germany. We will be doing some feature engineering here and will combine classes 3 and 4 together to indicate customers who have a positive balance in their account. We will do this because the proportion of class 3 is quite small compared to the rest and we don't want to unnecessarily keep too many classes per feature unless they are critical. We will achieve this by using the following code snippets.

First, we will load the necessary package for doing this. Install it using the command install.packages("car") in case you do not have the package installed.

> #load dependencies
> library(car)

Now we will recode the necessary classes, as shown next:

> # recode classes and update data frame
> new.account.balance <- recode(account.balance,
+                           "1=1;2=2;3=3;4=3")
> credit.df$account.balance <- new.account.balance

We will now see the relationship between new.account.balance and credit.rating using a contingency table, as discussed earlier, and visualize it using a mosaic plot by using the following code snippet. We will also perform some statistical tests which I will explain in brief later.

> # contingency table and mosaic plot 
> get.contingency.table(credit.rating, new.account.balance, 
                                                  stat.tests=T)
> visualize.contingency.table(credit.rating, new.account.balance)

In the following figure, you can now see how the various classes for account.balance are distributed with regards to credit.rating in both the table and the plot. An interesting thing to see is that 90% of people with funds in their account are not potential credit risks, which sounds reasonable.

Analyzing the dataset

We also perform two statistical tests here: the Chi-squared test and Fisher's test, both relevant tests in contingency tables used extensively for hypothesis testing. Going into details of the statistical calculations involved in these tests is out of scope of this chapter. I will put it in a way which is easy to understand. Usually, we start with a null hypothesis that between the two variables as depicted previously, there exists no association or relationship, as well as an alternative hypothesis that there is a possibility of a relationship or association between the two variables. If the p-value obtained from the test is less than or equal to 0.05, only then can we reject the null hypothesis in favor of the alternative hypothesis. In this case, you can clearly see that both the tests give p-values < 0.05, which definitely favors the alternative hypothesis that there is some association between credit.rating and account.balance. These types of tests are extremely useful when we build statistical models. You can look up the preceding tests on the internet or any statistics book to get a deeper insight into what p-values signify and how they work.

Note

Do note that going forward we will show only the most important analysis results for each feature. However, you can always try getting relevant information for the various analysis techniques using the functions we explained earlier. For contingency tables, use the get.contingency.table() function. Statistical tests can be performed by setting the stat.tests parameter as TRUE in the get.contingency.table() function. You can also use the visualize.contingency.table() function to view mosaic plots.

Now we will look at credit.duration.months, which signifies the duration of the credit in months. This is a numerical variable and the analysis will be a bit different from the other categorical variables.

> # credit.duration.months analysis
> get.numeric.variable.stats(credit.duration.months)

We can visualize the same from the following figure:

Analyzing the dataset

The values we see are in months and we get the typical summary statistics for this feature, including the mean, median, and quartiles. We will now visualize the overall distribution of the values for this feature using both histograms/density plots and boxplots.

> # histogramdensity plot
> visualize.distribution(credit.duration.months)

The preceding snippet produces the following plots. We can clearly observe that this is a multimodal distribution with several peaks.

Analyzing the dataset

We now visualize the same in the form of box plots, including the one showing associations with credit.rating next.

> # box plot
> visualize.boxplot(credit.duration.months, credit.rating)

Interestingly, from the following plots we see that the median credit duration for people who have a bad credit rating is higher than those who have a good credit rating. This seems to be plausible if we assume that many customers with long credit durations defaulted on their payments.

Analyzing the dataset

Moving on to the next variable, previous.credit.payment.status indicates what is the status of the customer with regards to paying his previous credits. This is a categorical variable and we get the statistics for it as shown next:

> # previous.credit.payment.status stats and bar chart
> get.categorical.variable.stats(previous.credit.payment.status)
> visualize.barchart(previous.credit.payment.status)

This gives us the following table and bar chart depicting the data distribution:

Analyzing the dataset

The classes indicate the following as the main semantics:

  • 0: Hesitant payment
  • 1: Problematic running account
  • 2: No previous credits left
  • 3: No problem with the current credits at this bank
  • 4: Paid back the previous credits at this bank

We will be applying the following transformations to this feature, so the new semantics will be:

  • 1: Some problems with payment
  • 2: All credits paid
  • 3: No problems and credits paid in this bank only

We will perform the transformations in the following code snippet:

> # recode classes and update data frame
> new.previous.credit.payment.status <- 
                           recode(previous.credit.payment.status,
+                                           "0=1;1=1;2=2;3=3;4=3")
> credit.df$previous.credit.payment.status <-      
                                new.previous.credit.payment.status

The contingency table for the transformed feature is obtained as follows:

> # contingency table
> get.contingency.table(credit.rating,
                             new.previous.credit.payment.status)

We observe from the following table that maximum people who have a good credit rating have paid their previous credits without any problem and those who do not have a good credit rating had some problem with their payments, which makes sense!

Analyzing the dataset

The next feature we will look at is credit.purpose, which signifies the purpose of the credit amount. This is also a categorical variable and we get its summary statistics and plot the bar chart showing the frequency of its various classes as follows:

> # credit.purpose stats and bar chart
> get.categorical.variable.stats(credit.purpose)
> visualize.barchart(credit.purpose)

This gives us the following table and bar chart depicting the data distribution:

Analyzing the dataset

We observe that there are a staggering 11 classes just for this feature. Besides this, we also observe that several classes have extremely low proportions compared to the top 5 classes and class label 7 doesn't even appear in the dataset! This is exactly why we need to do feature engineering by grouping some of these classes together, as we did previously.

The classes indicate the following as the main semantics:

  • 0: Others
  • 1: New car
  • 2: Used car
  • 3: Furniture items
  • 4: Radio or television
  • 5: Household appliances
  • 6: Repair
  • 7: Education
  • 8: Vacation
  • 9: Retraining
  • 10: Business

We will be transforming this feature by combining some of the existing classes and the new semantics after transformation will be the following:

  • 1: New car
  • 2: Used car
  • 3: Home related items
  • 4: Others

We will do this by using the following code snippet:

> # recode classes and update data frame
> new.credit.purpose <- recode(credit.purpose,"0=4;1=1;2=2;3=3;
+                                              4=3;5=3;6=3;7=4;
+                                              8=4;9=4;10=4")
> credit.df$credit.purpose <- new.credit.purpose

The contingency table for the transformed feature is then obtained by the following code snippet:

> # contingency table
> get.contingency.table(credit.rating, new.credit.purpose)

Based on the following table, we see that the customers who have credit purposes of home related items or other items seem to have the maximum proportion in the bad credit rating category:

Analyzing the dataset

The next feature we will analyze is credit.amount, which basically signifies the amount of credit in DM being asked from the bank by the customer. This is a numerical variable and we use the following code for getting the summary statistics:

> # credit.amount analysis
> get.numeric.variable.stats(credit.amount)
Analyzing the dataset

We see the normal statistics, such as the average credit amount as 3270 DM and the median as around 3270 DM. We will now visualize the distribution of the preceding data using a histogram and density plot as follows:

> # histogramdensity plot
> visualize.distribution(credit.amount)

This will give us the histogram and density plot for credit.amount, and you can see that it is a right-skewed distribution in the following figure:

Analyzing the dataset

Next, we will visualize the data using boxplots to see the data distribution and its relationship with credit.rating using the following code snippet:

> # box plot
> visualize.boxplot(credit.amount, credit.rating)

This generates the following boxplots where you can clearly see the right skew in the distribution shown by the numerous dots in the boxplots. We also see an interesting insight that the median credit rating was bad for those customers who asked for a higher credit amount, which seems likely assuming many of them may have failed to make all the payments required to pay off the credit amount.

Analyzing the dataset

Now that you have a good idea about how to perform descriptive analysis for categorical and numerical variables, going forward we will not be showing outputs of all the different analysis techniques for each feature. Feel free to experiment with the functions we used earlier on the remaining variables to obtain the summary statistics and visualizations if you are interested in digging deeper into the data!

The next feature is savings, which is a categorical variable having the following semantics for the 5 class labels:

  • 1: No savings
  • 2: < 100 DM
  • 3: Between [100, 499] DM
  • 4: Between [500, 999] DM
  • 5: >= 1000 DM

The feature signifies the average amount of savings/stocks belonging to the customer. We will be transforming it to the following four class labels:

  • 1: No savings
  • 2: < 100 DM
  • 3: Between [100, 999] DM
  • 4: >= 1000 DM

We will be using the following code snippet:

> # feature: savings - recode classes and update data frame
> new.savings <- recode(savings,"1=1;2=2;3=3;
+                                4=3;5=4")
> credit.df$savings <- new.savings

Now we analyze the relationship between savings and credit.rating using the following code for the contingency table:

> # contingency table
> get.contingency.table(credit.rating, new.savings)

This generates the following contingency table. On observing the table values, it is clear that people with no savings have the maximum proportion among customers who have a bad credit rating, which is not surprising! This number is also high for customers with a good credit rating since the total number of good credit rating records is also high compared to the total records in bad credit rating. However, we also see that the proportion of people having > 1000 DM and a good credit rating is quite high in comparison to the proportion of people having both a bad credit rating and > 1000 DM in their savings account.

Analyzing the dataset

We will now look at the feature named employment.duration, which is a categorical variable signifying the duration for which the customer has been employed until present. The semantics for the five classes of the feature are:

  • 1: Unemployed
  • 2: < 1 year
  • 3: Between [1, 4] years
  • 4: Between [4, 7] years
  • 5: >= 7 years

We will be transforming it to the following four classes:

  • 1: Unemployed or < 1 year
  • 2: Between [1,4] years
  • 3: Between [4,7] years
  • 4: >= 7 years

We will be using the following code:

> # feature: employment.duration - recode classes and update data frame
> new.employment.duration <- recode(employment.duration,
+                                   "1=1;2=1;3=2;4=3;5=4")
> credit.df$employment.duration <- new.employment.duration

Now we analyze its relationship using the contingency table, as follows:

> # contingency table
> get.contingency.table(credit.rating, new.employment.duration)

What we observe from the following table is that the proportion of customers having none or a significantly low number of years in employment and a bad credit rating is much higher than similar customers having a good credit rating. In the case of employment.duration feature, the value 1 indicates the people who are unemployed or have < 1 year of employment. The proportion of these people having a bad credit rating in 93 out of 300 people. This gives 31% which is lot higher compared to the same metric for the customers having a good credit rating which is 141 out of 700 customers, or 20%.

Analyzing the dataset

We now move on to the next feature named installment.rate, which is a categorical variable with the following semantics:

  • 1: >=35%
  • 2: Between [25, 35]%
  • 3: Between [20, 25]%
  • 4: < 20% for the four classes

There wasn't too much information in the original metadata for this attribute so there is some ambiguity, but what we assumed is that it indicates the percentage of the customer's salary which was used to pay the credit loan as monthly installments. We won't be doing any transformations here so we will directly go to the relationships.

> # feature: installment.rate - contingency table and statistical tests
> get.contingency.table(credit.rating, installment.rate, 
+                      stat.tests=TRUE)

We performed the statistical tests for this variable in the code snippet because we weren't really sure if our assumption for its semantics was correct or whether it could be a significant variable. From the following results, we see that both statistical tests yield p-values of > 0.05, thus ruling the null hypothesis in favor of the alternative. This tells us that these two variables do not have a significant association between them and this feature might not be one to consider when we make feature sets for our predictive models. We will look at feature selection in more detail in the next chapter.

Analyzing the dataset

The next variable we will analyze is marital.status, which indicates the marital status of the customer and is a categorical variable. It has four classes with the following semantics:

  • 1: Male divorced
  • 2: Male single
  • 3: Male married/widowed
  • 4: Female

We will be transforming them into three classes with the following semantics:

  • 1: Male divorced/single
  • 2: Male married/widowed
  • 3: Female

We will be using the following code:

> # feature: marital.status - recode classes and update data frame
> new.marital.status <- recode(marital.status, "1=1;2=1;3=2;4=3")
> credit.df$marital.status <- new.marital.status

We now observe the relationship between marital.status and credit.rating by building a contingency table using the following code snippet:

> # contingency table
> get.contingency.table(credit.rating, new.marital.status)

From the following table, we notice that the ratio of single men to married men for customers with a good credit rating is 1:2 compared to nearly 1:1 for customers with a bad credit rating. Does this mean that maybe more married men tend to pay their credit debts in time? That could be a possibility for this dataset, but do remember that correlation does not imply causation in general.

Analyzing the dataset

The p-values from the statistical tests give us a value of 0.01, indicating that there might be some association between the features.

The next feature is guarantor, which signifies if the customer has any further debtors or guarantors. This is a categorical variable with three classes having the following semantics:

  • 1: None
  • 2: Co-applicant
  • 3: Guarantor

We transform them into two variables with the following semantics:

  • 1: No
  • 2: Yes

For the transformation, we use the following code snippet:

> # feature: guarantor - recode classes and update data frame
> new.guarantor <- recode(guarantor, "1=1;2=2;3=2")
> credit.df$guarantor <- new.guarantor

Performing statistical tests on this yield a p-value of 1, which is much greater than 0.05, thus ruling the null hypothesis in favor and implying that there is probably no association between guarantor and credit.rating.

Tip

You can also run the statistical tests using direct functions instead of calling the get.contingency.table(…) function each time. For Fisher's exact test, call fisher.test(credit.rating, guarantor), and for Pearson's Chi-squared test, call chisq.test(credit,rating, guarantor). Feel free to substitute guarantor with any of the other independent variables to carry out these tests.

The next feature is residence.duration, which signifies how long the customer has been residing at his current address.

This is a categorical variable with the following semantics for the four classes:

  • 1: < 1 year
  • 2: Between [1,4] years
  • 3: Between [4,7] years
  • 4: >= 7 years

We will not be doing any transformations and will be directly doing statistical tests to see if this feature has any association with credit,rating. From in the previous tip, using the functions fisher.test and chisq.test both give us a p-value of 0.9, which is significantly > 0.05 and thus there is no significant relationship between them. We will show the outputs of both the statistical tests here, just so you can get an idea of what they depict.

> # perform statistical tests for residence.duration
> fisher.test(credit.rating, residence.duration)
> chisq.test(credit.rating, residence.duration)

You can see from the following outputs that we get the same p-value from both the tests we talked about earlier:

Analyzing the dataset

We now shift our focus to current.assets, which is a categorical variable having the following semantics for the four classes:

  • 1: No assets
  • 2: Car/other
  • 3: Life insurance/savings contract
  • 4: House/land ownership

We will not be doing any transformations on this data and will directly run the same statistical tests to check if it has any association with credit.rating. We get a p-value of 3 x 10-5, which is definitely < 0.05, and thus we can conclude that the alternative hypothesis holds good that there is some association between the variables.

The next variable we will analyze is age. This is a numeric variable and we will get its summary statistics as follows:

> # age analysis
> get.numeric.variable.stats(age)

Output:

Analyzing the dataset

We can observe that the average age of customers is 35.5 years and the median age is 33 years. To view the feature distributions, we will visualize it using a histogram and density plot using the following code snippet:

> # histogramdensity plot
> visualize.distribution(age)

We can observe from the following plots that the distribution is a right-skewed distribution with the majority of customer ages ranging from 25 to 45 years:

Analyzing the dataset

We will now observe the relationship between age and credit.rating by visualizing it through boxplots, as follows:

> # box plot
> visualize.boxplot(age, credit.rating)

The right-skew from the following plots is clearly distinguishable in the boxplots by the cluster of dots we see at the extreme end. The interesting observation we can make from the right plot is that people who have a bad credit rating have a lower median age than people who have a good credit rating.

Analyzing the dataset

One reason for this association could be that younger people who are still not well settled and employed have failed to repay the credit loans which they had taken from the bank. But, once again, this is just an assumption which we cannot verify unless we look into the full background of each customer.

Next, we will look at the feature other.credits, which has the following semantics for the three classes:

  • 1: At other banks
  • 2: At stores
  • 3: No further credits

This feature indicates if the customer has any other pending credits elsewhere. We will transform this to two classes with the following semantics:

  • 1: Yes
  • 2: No

We will be using the following code snippet:

> # feature: other.credits - recode classes and update data frame
> new.other.credits <- recode(other.credits, "1=1;2=1;3=2")
> credit.df$other.credits <- new.other.credits

On performing statistical tests on the newly transformed feature, we get a p-value of 0.0005, which is < 0.05, and thus favors the alternative hypothesis over the null, indicating that there is some association between this feature and credit.rating, assuming there is no influence from anything else.

The next feature apartment.type is a categorical variable having the following semantics for the three classes:

  • 1: Free apartment
  • 2: Rents flat
  • 3: Owns occupied flat

This feature basically signifies the type of apartment in which the customer resides. We will not be doing any transformation to this variable and will be directly moving on to the statistical tests. Both the tests give us a p-value of < 0.05, which signifies that some association is present between apartment.type and credit.rating, assuming no other factors affect it.

Now we will look at the feature bank.credits, which is a categorical variable having the following semantics for the four classes:

  • 1: One
  • 2: Two/three
  • 3: Four/five
  • 4: Six or more

This feature signifies the total number of credit loans taken by the customer from this bank including the current one. We will transform this into a binary feature with the following semantics for the two classes:

  • 1: One
  • 2: More than one

We will be using the following code:

> # feature: bank.credits - recode classes and update data frame
> new.bank.credits <- recode(bank.credits, "1=1;2=2;3=2;4=2")
> credit.df$bank.credits <- new.bank.credits

Carrying out statistical tests on this transformed feature gives us a p-value of 0.2, which is much > 0.05, and hence we know that the null hypothesis still holds good that there is no significant association between bank.credits and credit.rating. Interestingly, if you perform statistical tests with the untransformed version of bank.credits, you will get an even higher p-value of 0.4, which indicates no significant association.

The next feature is occupation, which obviously signifies the present job of the customer. This is a categorical variable with the following semantics for its four classes:

  • 1: Unemployed with no permanent residence
  • 2: Unskilled with permanent residence
  • 3: Skilled worker/minor civil servant
  • 4: Executive/self-employed/higher civil servant

We won't be applying any transformations on this feature since each class is quite distinct in its characteristics. Hence, we will be moving on directly to analyzing the relationships with statistical tests. Both the tests yield a p-value of 0.6, which is definitely > 0.05, and the null hypothesis holds good that there is no significant relationship between the two features.

We will now look at the next feature dependents, which is a categorical variable having the following semantics for its two class labels:

  • 1: Zero to two
  • 2: Three or more

This feature signifies the total number of people who are dependents for the customer. We will not be applying any transformations since it is already a binary variable. Carrying out statistical tests on this feature yields a p-value of 1, which tells us that this feature does not have a significant relationship with credit.rating.

Next up is the feature telephone, which is a binary categorical variable which has two classes with the following semantics indicating whether the customer has a telephone:

  • 1: No
  • 2: Yes

We do not need any further transformations here since it is a binary variable. So, we move on to the statistical tests which give us a p-value of 0.3, which is > 0.05, ruling the null hypothesis in favor of the alternative, thus indicating that no significant association exists between telephone and credit.rating.

The final feature in the dataset is foreign.worker, which is a binary categorical variable having two classes with the following semantics indicating if the customer is a foreign worker:

  • 1: Yes
  • 2: No

We do not perform any transformations since it is already a binary variable with two distinct classes and move on to the statistical tests. Both the tests give us a p-value of < 0.05, which might indicate that this variable has a significant relationship with credit.rating.

With this, we come to an end of our data analysis phase for the dataset.

Saving the transformed dataset

We have performed a lot of feature engineering using data transformations for several categorical variables and since we will be building predictive models on the transformed feature sets, we need to store this dataset separately to disk. We use the following code snippet for the same:

> ## Save the transformed dataset
> write.csv(file='credit_dataset_final.csv', x = credit.df, 
+           row.names = F)

We can load the above file into R directly the next time start building predictive models, which we will be covering in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset