There's more...

In our preceding example, we have an AUC of 0.76 and a log loss of 0.44:

  1. We can apply cross-validation by passing nfolds as a parameter to H2ORandomForestEstimator():
RF_cv = H2ORandomForestEstimator(model_id = 'RF_cv', 
seed = 12345,
ntrees = 500,
sample_rate = 0.9,
col_sample_rate_per_tree = 0.9,
nfolds = 10)

RF_cv.train(x = predictors, y = target, training_frame = train)
print(RF_cv.model_performance(test))

We notice that the AUC has slightly improved to 0.77 and that the log loss has dropped to 0.43:

  1. We can also apply a grid search to extract the best model from the given options. We set our options as follows:
search_criteria = {'strategy': "RandomDiscrete"}

hyper_params = {'sample_rate': [0.5, 0.6, 0.7],
'col_sample_rate_per_tree': [0.7, 0.8, 0.9],
'max_depth': [3, 5, 7]}
  1. We build the model with the preceding search parameters:
from h2o.grid.grid_search import H2OGridSearch

RF_Grid = H2OGridSearch(
H2ORandomForestEstimator(
model_id = 'RF_Grid',
ntrees = 200,
nfolds = 10,
stopping_metric = 'AUC',
stopping_rounds = 25),
search_criteria = search_criteria, # full grid search
hyper_params = hyper_params)
RF_Grid.train(x = predictors, y = target, training_frame = train)
  1. We now sort all models by AUC in a descending manner and then pick the first model, which has the highest AUC:
RF_Grid_sorted = RF_Grid.get_grid(sort_by='auc',decreasing=True)
print(RF_Grid_sorted)

best_RF_model = RF_Grid_sorted.model_ids[0]
best_RF_from_RF_Grid = h2o.get_model(best_RF_model)
  1. We apply the best model for our test data:
best_RF_from_RF_Grid.model_performance(test)
  1. We can plot the variable importance from the best model that we have achieved so far:
    best_RF_from_RF_G
    rid.varimp_plot()

    This gives us the following plot:

    ..................Content has been hidden....................

    You can't read the all page of ebook, please click here login for view all page.
    Reset