Getting ready

In this example, we use a dataset from the UCI ML repository on credit card defaults. This dataset contains the following information:

Default payments
Demographic factors
Credit data
History of payments
Bill statements of credit card clients

The data and the data descriptions are provided in the GitHub folder:

We will start by loading the required libraries and reading our dataset:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

We set our working folder as follows:

# Set your working directory according to your requirement
os.chdir(".../Chapter 6/Random Forest")
os.getcwd()

Let's now read our data. We will prefix the DataFrame name with df_ so that we can understand it easily:

df_creditcarddata = pd.read_csv("UCI_Credit_Card.csv")

We check the shape of the dataset:

df_creditcarddata.shape

We check the datatypes:

df_creditcarddata.dtypes

We drop the ID column, as this is not required:

df_creditcarddata = df_creditcarddata.drop("ID", axis= 1)

We can explore our data in various ways. Let's take a look at a couple of different methods:

selected_columns = df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 'LIMIT_BAL']]

selected_columns.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

Note that we have used a semicolon in the last line in the preceding code block. The semicolon helps to hide the verbose information produced by Matplotlib. xlabelsize and ylabelsize are used to adjust the font size in the x-axis and the y-axis.

The following plot shows the distribution of the numeric variables:

We will now explore the payment defaults by age group. We bucket the age variable and store the binned values in a new variable, age_group, in df_creditcarddata:

df_creditcarddata['agegroup'] = pd.cut(df_creditcarddata['AGE'], range(0, 100, 10), right=False)
df_creditcarddata.head()

We then use our new age_group variable to plot the number of defaults per age group:

# Default vs Age
pd.crosstab(df_creditcarddata.age_group, 
           df_creditcarddata["default.payment.next.month"]).plot(kind='bar',stacked=False, grid=True) 

plt.title('Count of Defaults by AGE')
plt.xlabel('AGE')
plt.ylabel('# of Default')
plt.legend(loc='upper left')

The following screenshot shows the amount of defaults per age:

We can drop the age_group variable from df_creditcarddata since we do not need it anymore:

df_creditcarddata = df_creditcarddata.drop(columns = ['age_group'])
df_creditcarddata.head()

We will now look at the payment defaults according to the credit limits of the account holders:

fig_facetgrid = sns.FacetGrid(df_creditcarddata, hue='default.payment.next.month', aspect=4)
fig_facetgrid.map(sns.kdeplot, 'LIMIT_BAL', shade=True)
max_limit_bal = df_creditcarddata['LIMIT_BAL'].max()
fig_facetgrid.set(xlim=(0,max_limit_bal));
fig_facetgrid.set(ylim=(0.0,0.000007));
fig_facetgrid.set(title='Distribution of limit balance by default.payment')
fig_facetgrid.add_legend()

The preceding code gives us the following plot:

We can also assign labels to some of our variables to make the interpretations better. We assign labels for the Gender, Marriage, and Education variables.

We also change the datatype of the pay variables to the string:

GenderMap = {2:'female', 1:'male'}
MarriageMap = {1:'married', 2:'single', 3:'other', 0: 'other'}
EducationMap = {1:'graduate school', 2:'university', 3:'high school', 4:'others', 5:'unknown', 6:'unknown', 0:'unknown'}

df_creditcarddata['SEX'] = df_creditcarddata.SEX.map(GenderMap)
df_creditcarddata['MARRIAGE'] = df_creditcarddata.MARRIAGE.map(MarriageMap) 
df_creditcarddata['EDUCATION'] = df_creditcarddata.EDUCATION.map(EducationMap)
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].astype(str) 
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].astype(str) 
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].astype(str) 
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].astype(str) 
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].astype(str) 
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].astype(str)

There are more explorations available in the code bundle provided with this book. We now move on to training our random forest model.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready