Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Stratified k-fold

In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions.

Getting ready

We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal.

We'll then plot the class proportions at each step to illustrate how the class proportions are maintained:

>>> from sklearn import datasets
>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11])

Let's check the overall class weight distribution:

>>> y.mean()

0.90300000000000002

Roughly, 90.5 percent of the samples are 1, with the balance 0.

How to do it...

Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit:

>>> from sklearn import cross_validation

>>> n_folds = 50

>>> strat_kfold = cross_validation.StratifiedKFold(y, n_folds=n_folds)
>>> shuff_split = cross_validation.ShuffleSplit(n=len(y), n_iter=n_folds)

>>> kfold_y_props = []
>>> shuff_y_props = []

>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold, >>> shuff_split):  
       kfold_y_props.append(y[k_train].mean())
       shuff_y_props.append(y[s_train].mean())

Now, let's plot the proportions over each fold:

>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit", 
            color='k')
>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified", 
            color='k', ls='--')
>>> ax.set_title("Comparing class proportions.")

>>> ax.legend(loc='best')

The output will be as follows:

We can see that the proportion of each fold for stratified k-fold is stable across folds.

How it works...

Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels:

>>> import numpy as np

>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5], 
                    size=1000)

>>> import itertools as it

>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):
       print np.bincount(three_classes[train])

[  0  90 314 395]
[  0  90 314 395]
[  0  90 314 395]
[  0  91 315 395]
[  0  91 315 396]

As we can see, we got roughly the sample sizes of each class for our training and testing proportions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Stratified k-fold

Create new playlist

Sign In

Sign Up

Stratified k-fold

Getting ready

How to do it...

How it works...

Table of Contents for
Stratified k-fold