K-fold cross validation

In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes.

Getting ready

We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters.

How to do it...

First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset:

>>> N = 1000­­
>>> holdout = 200

>>> from sklearn.datasets import make_regression
>>> X, y = make_regression(1000, shuffle=True)

Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would:

>>> X_h, y_h = X[:holdout], y[:holdout]
>>> X_t, y_t = X[holdout:], y[holdout:]

>>> from sklearn.cross_validation import KFold

K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True.

Let's create the cross validation object:

>>> kfold = KFold(len(y_t), n_folds=4)

Now, we can iterate through the k-fold object:

>>> output_string = "Fold: {}, N_train: {}, N_test: {}"

>>> for i, (train, test) in enumerate(kfold):
       print output_string.format(i, len(y_t[train]), len(y_t[test]))

Fold: 0, N_train: 600, N_test: 200
Fold: 1, N_train: 600, N_test: 200
Fold: 2, N_train: 600, N_test: 200
Fold: 3, N_train: 600, N_test: 200

Each iteration should return the same split size.

How it works...

It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t).

From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures.

We're going to mix it up and use pandas for this part:

>>> import numpy as np
>>> import pandas as pd

>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)

>>> measurements = pd.DataFrame({'patient_id': patients, 
                   'ys': np.random.normal(0, 1, 800)})

Now that we have the data, we only want to hold out certain customers instead of data points:

>>> custids = np.unique(measurements.patient_id)
>>> customer_kfold = KFold(custids.size, n_folds=4)

>>> output_string = "Fold: {}, N_train: {}, N_test: {}"

>>> for i, (train, test) in enumerate(customer_kfold):
       train_cust_ids = custids[train]
       training = measurements[measurements.patient_id.isin(
                  train_cust_ids)]
       testing = measurements[~measurements.patient_id.isin(
                 train_cust_ids)]
       print output_string.format(i, len(training), len(testing))

Fold: 0, N_train: 600, N_test: 200
Fold: 1, N_train: 600, N_test: 200
Fold: 2, N_train: 600, N_test: 200
Fold: 3, N_train: 600, N_test: 200
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset