Getting ready

We'll use Google Colab to build our model. In Chapter 10, Heterogeneous Ensemble Classifiers Using H2O, we explained how to use Google Colaboratory in the There's more section.

We'll start by installing H2O in Google Colab as follows:

! pip install h2o

Executing the preceding command will show you a few instructions, with the final line showing the following message (the version number of H2O will be different depending on the latest version available):

Successfully installed colorama-0.4.1 h2o-3.22.1.2

We import all the required libraries, as follows:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn import tree

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator


import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We'll then initialize H2O:

# Initialize H2o
h2o.init()

Upon successful initialization, we'll see the information shown in the following screenshot. This information might be different, depending on the environment:

We'll read our dataset from Google Drive. In order to do this, we first need to mount the drive:

from google.colab import drive
drive.mount('/content/drive')

It will instruct you to go to a URL to get the authorization code. You'll need to click on the URL, copy the authorization code, and paste it. Upon successful mounting, you can read your file from the respective folder in Google Drive:

# Reading dataset from Google drive
df_creditcarddata = h2o.import_file("/content/drive/My Drive/Colab Notebooks/UCI_Credit_Card.csv")

Note that with h2o.import_file, we create h2o.frame.H2OFrame. This is similar to a pandas DataFrame. However, in the case of a pandas DataFrame, the data is held in the memory, while in this case, the data is located on an H2O cluster.

You can run similar methods on an H2O DataFrame as you can on pandas. For example, in order to see the first 10 observations in the DataFrame, you can use the following command:

df_creditcarddata.head()

To check the dimensions of the DataFrame, we use the following command:

df_creditcarddata.shape

In order to see all the column names, we run the following syntax:

df_creditcarddata.columns

In the pandas DataFrame, we used dtypes to see the datatypes of each column. In the H2o DataFrame, we would use the following:

df_creditcarddata.types

This gives us the following output. Note that the categorical variables appear as 'enum':

We have our target variable, default.payment.next.month, in the dataset. This tells us which customers have and have not defaulted on their payments. We want to see the distribution of the defaulters and non-defaulters:

df_creditcarddata['default.payment.next.month'].table()

This gives us the count of each class in the default.payment.next.month variable:

We don't need the ID column for predictive modeling, so we remove it from our DataFrame:

df_creditcarddata = df_creditcarddata.drop(["ID"], axis = 1)

We can see the distribution of the numeric variables using the hist() method:

import pylab as pl
df_creditcarddata[['AGE','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6', 'LIMIT_BAL']].as_data_frame().hist(figsize=(20,20))
pl.show()

The following screenshot shows us the plotted variables . This can help us in our analysis of each of the variables:

To extend our analysis, we can see the distribution of defaulters and non-defaulters by gender, education, and marital status:

# Defaulters by Gender
columns = ["default.payment.next.month","SEX"]
default_by_gender = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_gender.get_frame())

# Defaulters by education
columns = ["default.payment.next.month","EDUCATION"]
default_by_education = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_education.get_frame())

# Defaulters by MARRIAGE
columns = ["default.payment.next.month","MARRIAGE"]
default_by_marriage = df_creditcarddata.group_by(by=columns).count(na ="all")
print(default_by_marriage.get_frame())

In the following screenshot, we get to see the distribution of defaulters by different categories:

We'll now convert the categorical variables into factors:

# Convert the categorical variables into factors

df_creditcarddata['SEX'] = df_creditcarddata['SEX'].asfactor()
df_creditcarddata['EDUCATION'] = df_creditcarddata['EDUCATION'].asfactor()
df_creditcarddata['MARRIAGE'] = df_creditcarddata['MARRIAGE'].asfactor()
df_creditcarddata['PAY_0'] = df_creditcarddata['PAY_0'].asfactor()
df_creditcarddata['PAY_2'] = df_creditcarddata['PAY_2'].asfactor()
df_creditcarddata['PAY_3'] = df_creditcarddata['PAY_3'].asfactor()
df_creditcarddata['PAY_4'] = df_creditcarddata['PAY_4'].asfactor()
df_creditcarddata['PAY_5'] = df_creditcarddata['PAY_5'].asfactor()
df_creditcarddata['PAY_6'] = df_creditcarddata['PAY_6'].asfactor()

We also encode the dichotomous target variable, default.payment.next.month, as a factor variable. After the conversion, we check the classes of the target variable with the levels() method:

# Also, encode the binary response variable as a factor
df_creditcarddata['default.payment.next.month'] = df_creditcarddata['default.payment.next.month'].asfactor() 
df_creditcarddata['default.payment.next.month'].levels()

We'll then define our predictor and target variables:

# Define predictors manually
predictors = ['LIMIT_BAL','SEX','EDUCATION','MARRIAGE','AGE','PAY_0','PAY_2','PAY_3',
 'PAY_4','PAY_5','PAY_6','BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4',
 'BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6']

target = 'default.payment.next.month'

We then split our DataFrame using the split_frame() method:

splits = df_creditcarddata.split_frame(ratios=[0.7], seed=1)

The following code gives us two split output:

splits

In the following screenshot, we get to see the following two splits:

We separate the splits into train and test subsets:

train = splits[0]
test = splits[1]

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready