There's more...

In our preceding example, we have an AUC of 0.76 and a log loss of 0.44:

We can apply cross-validation by passing nfolds as a parameter to H2ORandomForestEstimator():

RF_cv = H2ORandomForestEstimator(model_id = 'RF_cv', 
                                 seed = 12345, 
                                 ntrees = 500, 
                                 sample_rate = 0.9, 
                                 col_sample_rate_per_tree = 0.9, 
                                 nfolds = 10)
                                            
RF_cv.train(x = predictors, y = target, training_frame = train)
print(RF_cv.model_performance(test))

We notice that the AUC has slightly improved to 0.77 and that the log loss has dropped to 0.43:

We can also apply a grid search to extract the best model from the given options. We set our options as follows:

search_criteria = {'strategy': "RandomDiscrete"}

hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
                'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

We build the model with the preceding search parameters:

from h2o.grid.grid_search import H2OGridSearch

RF_Grid = H2OGridSearch(
                    H2ORandomForestEstimator(
                        model_id = 'RF_Grid', 
                        ntrees = 200, 
                        nfolds = 10,
                        stopping_metric = 'AUC', 
                        stopping_rounds = 25), 
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)
RF_Grid.train(x = predictors, y = target, training_frame = train)

We now sort all models by AUC in a descending manner and then pick the first model, which has the highest AUC:

RF_Grid_sorted = RF_Grid.get_grid(sort_by='auc',decreasing=True)
print(RF_Grid_sorted)

best_RF_model = RF_Grid_sorted.model_ids[0]
best_RF_from_RF_Grid = h2o.get_model(best_RF_model)

We apply the best model for our test data:

best_RF_from_RF_Grid.model_performance(test)

We can plot the variable importance from the best model that we have achieved so far:

best_RF_from_RF_G
rid.varimp_plot()

This gives us the following plot:

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...