How it works...

We used Google Colab to train our models. After we installed H2O in Google Colab, we initialized the H2O instance. We also imported the required libraries.

In order to use the H2O libraries, we imported H2OGeneralizedLinearEstimator, H2ORandomForestEstimator, and H2OGradientBoostingEstimator from h2o.estimators. We also imported H2OStackedEnsembleEstimator to train our model using a stacked ensemble.

We mounted Google Drive and read our dataset using h2o.import_file(). This created an H2O DataFrame, which is very similar to a pandas DataFrame. Instead of holding it in the memory, however, the data is located in one of the remote H2O clusters.

We then performed basic operations on the H2O DataFrame to analyze our data. We took a look at the dimensions, the top few rows, and the data types of each column. The shape attribute returned a tuple with the number of rows and columns. The head() method returned the top 10 observations. The types attribute returned the data types of each column.

Note that a categorical variable in an H2O DataFrame is marked as an enum.

Our target variable was default.payment.next.month. With the table() method, we saw the distribution of both classes of our target variable. The table() method returned the count for classes 1 and 0 in this case.

We didn't need the ID column, so we removed it using the drop() method with axis=1 as a parameter. With axis=1, it dropped the columns. Otherwise, the default value of axis=0 would have dropped the labels from the index.

We analyzed the distribution of the numeric variables. There's no limit to how far you can explore your data. We also saw the distribution of both of the classes of our target variable by various categories, such as gender, education, and marriage.

We then converted the categorical variables to factor type with the asfactor() method. This was done for the target variable as well.

We created a list of predictor variables and target variables. We split our DataFrame into the train and test subsets with the split_frame() method. 

We passed ratios to the split_frame() method. In our case, we split the dataset into 70% and 30%. However, note that this didn't give an exact split of 70%-30%. H2O uses a probabilistic splitting method instead of using the exact ratios to split the dataset. This is to make the split more efficient on big data.

After we split our datasets into train and test subsets, we moved onto training our models. We used GLM, random forest, a gradient-boosting machine (GBM), and stacked ensembles to train the stacking model.

In the How to do it... section, in Step 1 and Step 2, we showcased the code to train a GLM model with the default settings. We used cross-validation to train our model. 

In Step 3, we trained a GLM model with lambda_search, which helps to find the optimal regularization parameter.

In Step 4, we used grid-search parameters to train our GLM model. We set our hyper-parameters and provided these to the H2OGridSearch() method. This helps us search for the optimum parameters across models. In the H2OGridSearch() method, we used the RandomDiscrete search-criteria strategy

The default search-criteria strategy is Cartesian, which covers the entire space of hyperparameter combinations. The random discrete strategy carries out a random search of all the combinations of the hyperparameters provided.

In Step 5, with the get_grid() method, we looked at the AUC score of each model built with different combinations of the parameters provided. In Step 6, we extracted the best model from the random grid search. We can also use the print() method on the best model to see the model performance metrics on both the train data and the cross-validation data.

In Step 7, we trained a random forest model with default settings and looked at the summary of the resulting model in step 8. In Step 9 and Step 10, we showcased the code to train a random forest model using grid-search. We set multiple values for various acceptable hyper-parameters, such as sample_rate, col_sample_rate_per_tree, max_depth, and ntrees. sample_rate refers to row sampling without replacement. It takes a value between 0 and 1, indicating the sampling percentage of the data. col_sample_rate_per_tree is the column sampling for each tree without replacement. max_depth is set to specify the maximum depth to which each tree should be built. Deeper trees may perform better on the training data but will take more computing time and may overfit and fail to generalize on unseen data. The ntrees parameter is used for tree-based algorithms to specify the number of trees to build on the model.

In Step 11 and Step 12, we printed the AUC score of each model generated by the grid-search and extracted the best model from it.

We also trained GBM models to fit our data. In Step 13, we built the GBM using the default settings. In Step 14, we set the hyperparameter space for the grid search. We used this in Step 15, where we trained our GBM. In the GBM, we set values for hyperparameters, such as learn_rate, sample_rate, col_sample_rate, max_depth, and ntrees. The learn_rate parameter is used to specify the rate at which the GBM algorithm trains the model. A lower value for the learn_rate parameter is better and can help in avoiding overfitting, but can be costly in terms of computing time.

In H2O, learn_rate is available in GBM and XGBoost.

Step 16 showed us the AUC score of each resulting model from the grid search. We extracted the best grid-searched GBM in Step 17.

In Step 18 through to Step 20, we trained our stacked ensemble model using H2OStackedEnsembleEstimator from H2O. We evaluated the performance of the resulting model on the test data.

In Step 21, we evaluated all the GLM models we built on our test data. We did the same with all the models we trained using RF and GBM. Step 22 gave us the model with the maximum AUC score. In Step 23, we evaluated the AUC score of the stacked ensemble model on the test data in order to compare the performance of the stacked ensemble model with the individual base learners.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset