Estimating probabilities from an ensemble

Random Forests offer a large range of advantages, and they are deemed the first algorithm you should try on your data to figure out what kind of results can be obtained. This is because the Random Forests do not have too many hyperparameters to be fixed, and they work perfectly fine out of the box. They can naturally work with multiclass problems. Moreover, Random Forests offer a way to estimate the importance of variables for your insight or feature selection, and they help in estimating the similarity between the examples since similar cases should end up in the same terminal leaves of many trees of the ensemble.

However, in classification problems, the algorithm lacks the capability of predicting probabilities of an outcome (unless calibrated using the probability calibration offered in scikit-learn by the CalibratedClassifierCV). In classification problems, often it does not suffice to predict a response label; we also need the probability associated to it (how likely it is to be true; that's a confidence of the prediction). This is particularly useful for multiclass problems since the right answer may be the second- or the third-most probable one (therefore, probability provides ranks of answers).

However, when Random Forests is required to estimate the probability of the response classes, the algorithm will just report the number of times an example has been classified into a class in the ensemble with respect to the number of all the trees in the ensemble itself. Such a ratio actually doesn't correspond to the correct probability, but it is a biased one (the predicted probability is just correlated to the true one; it doesn't represent it in a numerically correct way).

To help Random Forests and other algorithms affected by a similar situation, such as Naive Bayes or linear SVM, to emit correct response probabilities, the CalibratedClassifierCV wrapper class has been introduced in scikit-learn.

CalibrateClassifierCV remaps the response of a machine learning algorithm to probabilities using two methods: Platt's scaling and Isotonic regression (the latter is a better performing non-parameter method with the condition that you have enough examples, that is, at least 1,000). Both approaches are, kind of, a second-level model aimed at just modeling a link between the original response of an algorithm and the expected probabilities. The results can be plotted by comparing the original probability distribution against the calibrated ones.

As an example, here we refit the Covertype problem using CalibratedClassifierCV:

In: import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.calibration import CalibratedClassifierCV
    from sklearn.calibration import calibration_curve
    hypothesis = RandomForestClassifier(n_estimators=100, random_state=101)
    calibration = CalibratedClassifierCV(hypothesis, method='sigmoid', 
                                         cv=5)
    covertype_X = covertype_dataset.data[:15000,:]
    covertype_y = covertype_dataset.target[:15000]
    covertype_test_X = covertype_dataset.data[15000:25000,:]
    covertype_test_y = covertype_dataset.target[15000:25000]

To evaluate the behavior of the calibration, we prepare a test set made of 10,000 examples that we do not use for training. Our calibration model will be based on Platt's model (method='sigmoid') and use five cross-validation folds to tune the calibration:

In: hypothesis.fit(covertype_X,covertype_y)
    calibration.fit(covertype_X,covertype_y)
    prob_raw = hypothesis.predict_proba(covertype_test_X)
    prob_cal = calibration.predict_proba(covertype_test_X)

After fitting both the raw and the calibrated model, we estimate the probabilities, and we now plot them in a scatterplot to highlight the differences. After projecting the estimated probabilities for the ponderosa pine, it appears that the original Random Forests probabilities (actual percentages of votes) have been rescaled to resemble a logistic curve. We now try to write some code and explore the type of changes that calibration brings about to probability outputs:

In: %matplotlib inline
    tree_kind = covertypes.index('Ponderosa Pine')
    probs = pd.DataFrame(list(zip(prob_raw[:,tree_kind], 
                                  prob_cal[:,tree_kind])), 
                         columns=['raw','calibrted'])
    plot = probs.plot(kind='scatter', x=0, y=1, s=64, 
                      c='blue', edgecolors='white')

Calibration, though not changing the performance of the model, by reshaping the probability output helps you obtain probabilities that are more correspondent to your training data. In the following plot you can observe how the calibration procedure has modified the original probabilities by adding some non-linearity as a correction:

Table of Contents for Estimating probabilities from an ensemble

Create new playlist

Sign In

Sign Up

Table of Contents for
Estimating probabilities from an ensemble