ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified.
ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation.
First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean:
>>> import numpy as np >>> true_loc = 1000 >>> true_scale = 10 >>> N = 1000 >>> dataset = np.random.normal(true_loc, true_scale, N) >>> import matplotlib.pyplot as plt >>> f, ax = plt.subplots(figsize=(7, 5)) >>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled'); >>> ax.set_title("Histogram of dataset"); >>> f.savefig("978-1-78398-948-5_06_06.png")
NumPy will give the following output:
Now, let's take the first half of the data and guess the mean:
>>> from sklearn import cross_validation >>> holdout_set = dataset[:500] >>> fitting_set = dataset[500:] >>> estimate = fitting_set[:N/2].mean() >>> import matplotlib.pyplot as plt >>> f, ax = plt.subplots(figsize=(7, 5)) >>> ax.set_title("True Mean vs Regular Estimate") >>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean') >>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate') >>> ax.set_xlim(999, 1001) >>> ax.legend() >>> f.savefig("978-1-78398-948-5_06_07.png")
We'll get the following output:
Now, we can use ShuffleSplit to fit the estimator on several smaller datasets:
>>> from sklearn.cross_validation import ShuffleSplit >>> shuffle_split = ShuffleSplit(len(fitting_set)) >>> mean_p = [] >>> for train, _ in shuffle_split: mean_p.append(fitting_set[train].mean()) shuf_estimate = np.mean(mean_p) >>> import matplotlib.pyplot as plt >>> f, ax = plt.subplots(figsize=(7, 5)) >>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, alpha=.65, label='true mean') >>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, alpha=.65, label='regular estimate') >>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5, alpha=.65, label='shufflesplit estimate') >>> ax.set_title("All Estimates") >>> ax.set_xlim(999, 1001) >>> ax.legend(loc=3)
The output will be as follows:
As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate.