How it works...

In Step 1, we used the h2o.import_file() function to read our dataset.

The h2o.import_file() function returns an H2OFrame instance.

In Step 2, we split our H2OFrame into training and testing subsets. In Step 3, we checked the dimensions of these subsets to verify that our split is adequate for our requirements.

In Step 4, we took a look at the first few rows to check if the data is correctly loaded. In Step 5, we separated out the column names of our response and predictor variables, and in Step 6, we converted the response variables into a categorical type with the asfactor() function.

We defined a variable called nfolds in Step 7, which we used for cross-validation. We have also defined a variable encoding which we used in the next steps to instruct H2O to use one-hot encoding for categorical variables. In Step 8 to Step 10, we built our base learners.

In Step 11, we trained a Gradient Boosting Machine model. We passed some values to a few hyperparameters as follows:

nfolds: Number of folds for K-fold cross-validation.
fold_assignment: This option specifies the scheme to use for cross-validation fold assignment. This option is only applicable if a value for nfolds is specified and a fold_column isn't specified.
distribution: Specifies the distribution. In our case, since the response variable has two classes, we set distribution to "bernoulli".
ntrees: Number of trees.
max_depth: Denotes the maximum tree depth.
min_rows: Fewest allowed observations in a leaf.
learn_rate: Learning rate takes value from 0.0 to 1.0.

Note that for all base learners, cross-validation folds must be the same and keep_cross_validation_predictions must be set to True.

In Step 9, we trained a random forest base learner using the following hyperparameters: ntrees, nfolds, fold_assignment.

In Step 10, we trained our algorithm with a GLM. Note that we have not encoded the categorical variables in GLM.

H2O recommends users to allow GLM handle categorical columns, as it can take advantage of the categorical column for better performance and efficient memory utilization.
From H2o.ai: "We strongly recommend avoiding one-hot encoding categorical columns with any levels into many binary columns, as this is very inefficient. This is especially true for Python users who are used to expanding their categorical variables manually for other frameworks".

In Step 11, we generated the test AUC values for each of the base learners and printed the best AUC.

In Step 12, we trained a stacked ensemble model by combining the output of the base learners using H2OStackedEnsembleEstimator. We used the trained ensemble model on our test subset. Note that by default GLM is used as the meta-learner for H2OStackedEnsembleEstimator. However, we have used deep learning as the meta-learner in our example.

Note that we have used default hyperparameters values for our meta-learner. We can specify the hyperparameter values with metalearner_params. The metalearner_params option allows you to pass in a dictionary/list of hyperparameters to use for the algorithm that is used as meta-learner.

Fine-tuning the hyperparameters can deliver better results.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...