How to do it...

Let's move on to training our models using the algorithms we mentioned earlier in this chapter. We'll start by training our generalized linear model (GLM) models. We'll build three GLM models:

  • A GLM model with default values for the parameters
  • A GLM model with Lambda search (regularization)
  • A GLM model with grid search

Now we will start with training our models in the following section.

  1. Let's train our first model:
GLM_default_settings = H2OGeneralizedLinearEstimator(family='binomial', 
model_id='GLM_default',nfolds = 10,
fold_assignment = "Modulo",
keep_cross_validation_predictions = True)

H2OGeneralizedLinearEstimator fits a generalized linear model. It takes in a response variable and a set of predictor variables. 

H2OGeneralizedLinearEstimator can handle both regression and classification tasks. In the case of a regression problem, it returns an H2ORegressionModel subclass, while for classification, it returns an H2OBinomialModel subclass. 

  1. We created predictor and target variables in the Getting ready section. Pass the predictor and target variables to the model:
GLM_default_settings.train(x = predictors, y = target, training_frame = train)
  1. Train the GLM model using the lambda_search parameter:
GLM_regularized = H2OGeneralizedLinearEstimator(family='binomial', model_id='GLM', 
lambda_search=True, nfolds = 10,
fold_assignment = "Modulo",
keep_cross_validation_predictions = True)

GLM_regularized.train(x = predictors, y = target, training_frame = train)

lambda_search helps the GLM to find an optimal regularization parameter, λ. The lambda_search parameter takes in a Boolean value. When set to True, the GLM will first fit a model with the highest lambda value, which is known as maximum regularization. It then decreases this at each step until it reaches the minimum lambda. The resulting optimum model is based on the best lambda value.

  1. Train the model using the GLM with a grid search:
hyper_parameters = { 'alpha': [0.001, 0.01, 0.05, 0.1, 1.0],
'lambda': [0.001, 0.01, 0.1, 1] }
search_criteria = { 'strategy': "RandomDiscrete", 'seed': 1,
'stopping_metric': "AUTO",
'stopping_rounds': 5 }

GLM_grid_search = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial',
nfolds = 10, fold_assignment = "Modulo",
keep_cross_validation_predictions = True),
hyper_parameters, grid_id="GLM_grid", search_criteria=search_criteria)

GLM_grid_search.train(x= predictors,y= target, training_frame=train)
  1. We get the grid result sorted by the auc value with the get_grid() method:
# Get the grid results, sorted by validation AUC
GLM_grid_sorted = GLM_grid_search.get_grid(sort_by='auc', decreasing=True)
GLM_grid_sorted

In the following screenshot, we can see the auc score for each model, which consists of different combinations of the alpha and lambda parameters:

  1. We can see the model metrics on our train data and our cross-validation data:
# Extract the best model from random grid search
Best_GLM_model_from_Grid = GLM_grid_sorted.model_ids[0]

# model performance
Best_GLM_model_from_Grid = h2o.get_model(Best_GLM_model_from_Grid)
print(Best_GLM_model_from_Grid)

From the preceding code block, you can evaluate the model metrics, which include MSE, RMSE, Null and Residual Deviance, AUC, and Gini, along with the Confusion Matrix. At a later stage, we will use the best model from the grid search for our stacked ensemble.

Let us look at the following image and evaluate the model metrics:

  1. Train the model using random forest. The code for random forest using default settings looks as follows:
# Build a RF model with default settings
RF_default_settings = H2ORandomForestEstimator(model_id = 'RF_D',
nfolds = 10, fold_assignment = "Modulo",
keep_cross_validation_predictions = True)

# Use train() to build the model
RF_default_settings.train(x = predictors, y = target, training_frame = train)
  1. To get the summary output of the model, use the following code:
RF_default_settings.summary()
  1. Train the random forest model using a grid search. Set the hyperparameters as shown in the following code block:
hyper_params = {'sample_rate':[0.7, 0.9],
'col_sample_rate_per_tree': [0.8, 0.9],
'max_depth': [3, 5, 9],
'ntrees': [200, 300, 400]
}
  1. Use the hyperparameters on H2OGridSearch() to train the RF model using gridsearch:
RF_grid_search = H2OGridSearch(H2ORandomForestEstimator(nfolds = 10, 
fold_assignment = "Modulo",
keep_cross_validation_predictions = True,
stopping_metric = 'AUC',stopping_rounds = 5),
hyper_params = hyper_params,
grid_id= 'RF_gridsearch')

# Use train() to start the grid search
RF_grid_search.train(x = predictors, y = target, training_frame = train)
  1. Sort the results by AUC score to see which model performs best:
# Sort the grid models
RF_grid_sorted = RF_grid_search.get_grid(sort_by='auc', decreasing=True)
print(RF_grid_sorted)
  1. Extract the best model from the grid search result:
Best_RF_model_from_Grid = RF_grid_sorted.model_ids[0]

# Model performance
Best_RF_model_from_Grid = h2o.get_model(Best_RF_model_from_Grid)
print(Best_RF_model_from_Grid)

In the following screenshot, we see the model metrics for the grid model on the train data and the cross-validation data:

  1. Train the model using GBM. Here's how to train a GBM with the default settings:
GBM_default_settings = H2OGradientBoostingEstimator(model_id = 'GBM_default', 
nfolds = 10,
fold_assignment = "Modulo",
keep_cross_validation_predictions = True)

# Use train() to build the model
GBM_default_settings.train(x = predictors, y = target, training_frame = train)
  1. Use a grid search on the GBM. To perform a grid search, set the hyperparameters as follows:

hyper_params = {'learn_rate': [0.001,0.01, 0.1],
'sample_rate': [0.8, 0.9],
'col_sample_rate': [0.2, 0.5, 1],
'max_depth': [3, 5, 9]}
  1. Use the hyperparameters on H2OGridSearch() to train the GBM model using grid search:
GBM_grid_search = H2OGridSearch(H2OGradientBoostingEstimator(nfolds = 10, 
fold_assignment = "Modulo",
keep_cross_validation_predictions = True,
stopping_metric = 'AUC', stopping_rounds = 5),
hyper_params = hyper_params, grid_id= 'GBM_Grid')

# Use train() to start the grid search
GBM_grid_search.train(x = predictors, y = target, training_frame = train)
  1. As with the earlier models, we can view the results sorted by AUC:
# Sort and show the grid search results
GBM_grid_sorted = GBM_grid_search.get_grid(sort_by='auc', decreasing=True)
print(GBM_grid_sorted)
  1. Extract the best model from the grid search:
Best_GBM_model_from_Grid = GBM_grid_sorted.model_ids[0]

Best_GBM_model_from_Grid = h2o.get_model(Best_GBM_model_from_Grid)
print(Best_GBM_model_from_Grid)

We can use H2OStackedEnsembleEstimator to build a stacked ensemble ML model that can use the models we have built using H2O algorithms to improve the predictive performance. H2OStackedEnsembleEstimator helps us find the optimal combination of a collection of predictive algorithms. 

  1. Create a list of the best models from the earlier models that we built using grid search:
# list the best models from each grid
all_models = [Best_GLM_model_from_Grid, Best_RF_model_from_Grid, Best_GBM_model_from_Grid]
  1. Set up a stacked ensemble model using H2OStackedEnsembleEstimator:
# Set up Stacked Ensemble
ensemble = H2OStackedEnsembleEstimator(model_id = "ensemble", base_models = all_models, metalearner_algorithm = "deeplearning")

# uses GLM as the default metalearner
ensemble.train(y = target, training_frame = train)
  1. Evaluate the ensemble performance on the test data:
# Eval ensemble performance on the test data
Ens_model = ensemble.model_performance(test)
Ens_AUC = Ens_model.auc()
  1. Compare the performance of the base learners on the test data. The following code tests the model performance of all the GLM models we've built:
# Checking the model performance for all GLM models built
model_perf_GLM_default = GLM_default_settings.model_performance(test)
model_perf_GLM_regularized = GLM_regularized.model_performance(test)
model_perf_Best_GLM_model_from_Grid = Best_GLM_model_from_Grid.model_performance(test)

The following code tests the model performance of all the random forest models we've built:

# Checking the model performance for all RF models built
model_perf_RF_default_settings = RF_default_settings.model_performance(test)
model_perf_Best_RF_model_from_Grid = Best_RF_model_from_Grid.model_performance(test)

The following code tests the model performance of all the GBM models we've built:

# Checking the model performance for all GBM models built
model_perf_GBM_default_settings = GBM_default_settings.model_performance(test)
model_perf_Best_GBM_model_from_Grid = Best_GBM_model_from_Grid.model_performance(test)
  1. To get the best AUC from the base learners, execute the following commands:
# Best AUC from the base learner models
best_auc = max(model_perf_GLM_default.auc(), model_perf_GLM_regularized.auc(),
model_perf_Best_GLM_model_from_Grid.auc(),
model_perf_RF_default_settings.auc(),
model_perf_Best_RF_model_from_Grid.auc(),
model_perf_GBM_default_settings.auc(),
model_perf_Best_GBM_model_from_Grid.auc())

print("Best AUC out of all the models performed: ", format(best_auc))
  1. The following commands show the AUC from the stacked ensemble model:
# Eval ensemble performance on the test data
Ensemble_model = ensemble.model_performance(test)
Ensemble_model = Ensemble_model.auc()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset