Stratified k-fold

In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions.

Getting ready

We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal.

We'll then plot the class proportions at each step to illustrate how the class proportions are maintained:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11])

Let's check the overall class weight distribution:

>>> y.mean()

0.90300000000000002

Roughly, 90.5 percent of the samples are 1, with the balance 0.

How to do it...

Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit:

>>> from sklearn import cross_validation

>>> n_folds = 50

>>> strat_kfold = cross_validation.StratifiedKFold(y, n_folds=n_folds)
>>> shuff_split = cross_validation.ShuffleSplit(n=len(y), n_iter=n_folds)

>>> kfold_y_props = []
>>> shuff_y_props = []

>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold, >>> shuff_split):  
       kfold_y_props.append(y[k_train].mean())
       shuff_y_props.append(y[s_train].mean())

Now, let's plot the proportions over each fold:

>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit", 
            color='k')
>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified", 
            color='k', ls='--')
>>> ax.set_title("Comparing class proportions.")

>>> ax.legend(loc='best')

The output will be as follows:

How to do it...

We can see that the proportion of each fold for stratified k-fold is stable across folds.

How it works...

Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels:

>>> import numpy as np

>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5], 
                    size=1000)

>>> import itertools as it

>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):
       print np.bincount(three_classes[train])

[  0  90 314 395]
[  0  90 314 395]
[  0  90 314 395]
[  0  91 315 395]
[  0  91 315 396]

As we can see, we got roughly the sample sizes of each class for our training and testing proportions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset