We'll use Google Colab to build our model. In Chapter 10, Heterogeneous Ensemble Classifiers Using H2O, we explained how to use Google Colaboratory in the There's more section.
We'll start by installing H2O in Google Colab as follows:
! pip install h2o
Executing the preceding command will show you a few instructions, with the final line showing the following message (the version number of H2O will be different depending on the latest version available):
Successfully installed colorama-0.4.1 h2o-3.22.1.2
We import all the required libraries, as follows:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn import tree
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
We'll then initialize H2O:
# Initialize H2o
h2o.init()
Upon successful initialization, we'll see the information shown in the following screenshot. This information might be different, depending on the environment:
We'll read our dataset from Google Drive. In order to do this, we first need to mount the drive:
from google.colab import drive
drive.mount('/content/drive')
It will instruct you to go to a URL to get the authorization code. You'll need to click on the URL, copy the authorization code, and paste it. Upon successful mounting, you can read your file from the respective folder in Google Drive:
# Reading dataset from Google drive
df_creditcarddata = h2o.import_file("/content/drive/My Drive/Colab Notebooks/UCI_Credit_Card.csv")
Note that with h2o.import_file, we create h2o.frame.H2OFrame. This is similar to a pandas DataFrame. However, in the case of a pandas DataFrame, the data is held in the memory, while in this case, the data is located on an H2O cluster.
You can run similar methods on an H2O DataFrame as you can on pandas. For example, in order to see the first 10 observations in the DataFrame, you can use the following command:
df_creditcarddata.head()
To check the dimensions of the DataFrame, we use the following command:
df_creditcarddata.shape
In order to see all the column names, we run the following syntax:
df_creditcarddata.columns
In the pandas DataFrame, we used dtypes to see the datatypes of each column. In the H2o DataFrame, we would use the following:
df_creditcarddata.types
This gives us the following output. Note that the categorical variables appear as 'enum':
We have our target variable, default.payment.next.month, in the dataset. This tells us which customers have and have not defaulted on their payments. We want to see the distribution of the defaulters and non-defaulters:
df_creditcarddata['default.payment.next.month'].table()
This gives us the count of each class in the default.payment.next.month variable:
We don't need the ID column for predictive modeling, so we remove it from our DataFrame:
df_creditcarddata = df_creditcarddata.drop(["ID"], axis = 1)
We can see the distribution of the numeric variables using the hist() method:
import pylab as pl
df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 'LIMIT_BAL']].as_data_frame().hist(figsize=(20,20))
pl.show()
The following screenshot shows us the plotted variables . This can help us in our analysis of each of the variables:
To extend our analysis, we can see the distribution of defaulters and non-defaulters by gender, education, and marital status:
# Defaulters by Gender
columns = ["default.payment.next.month","SEX"]
default_by_gender = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_gender.get_frame())
# Defaulters by education
columns = ["default.payment.next.month","EDUCATION"]
default_by_education = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_education.get_frame())
# Defaulters by MARRIAGE
columns = ["default.payment.next.month","MARRIAGE"]
default_by_marriage = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_marriage.get_frame())
In the following screenshot, we get to see the distribution of defaulters by different categories:
We'll now convert the categorical variables into factors:
# Convert the categorical variables into factors
df_creditcarddata['SEX'] = df_creditcarddata['SEX'].asfactor()
df_creditcarddata['EDUCATION'] = df_creditcarddata['EDUCATION'].asfactor()
df_creditcarddata['MARRIAGE'] = df_creditcarddata['MARRIAGE'].asfactor()
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].asfactor()
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].asfactor()
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].asfactor()
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].asfactor()
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].asfactor()
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].asfactor()
We also encode the dichotomous target variable, default.payment.next.month, as a factor variable. After the conversion, we check the classes of the target variable with the levels() method:
# Also, encode the binary response variable as a factor
df_creditcarddata['default.payment.next.month'] = df_creditcarddata['default.payment.next.month'].asfactor()
df_creditcarddata['default.payment.next.month'].levels()
We'll then define our predictor and target variables:
# Define predictors manually
predictors = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0','PAY_2','PAY_3',
'PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4',
'BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']
target = 'default.payment.next.month'
We then split our DataFrame using the split_frame() method:
splits = df_creditcarddata.split_frame(ratios=[0.7], seed=1)
The following code gives us two split output:
splits
In the following screenshot, we get to see the following two splits:
We separate the splits into train and test subsets:
train = splits[0]
test = splits[1]