Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Cross validation with ShuffleSplit

ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified.

Getting ready

ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation.

How to do it...

First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean:

>>> import numpy as np

>>> true_loc = 1000
>>> true_scale = 10
>>> N = 1000

>>> dataset = np.random.normal(true_loc, true_scale, N)


>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');
>>> ax.set_title("Histogram of dataset");

>>> f.savefig("978-1-78398-948-5_06_06.png")

NumPy will give the following output:

Now, let's take the first half of the data and guess the mean:

>>> from sklearn import cross_validation


>>> holdout_set = dataset[:500]
>>> fitting_set = dataset[500:]


>>> estimate = fitting_set[:N/2].mean()


>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.set_title("True Mean vs Regular Estimate")

>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, 
              alpha=.65, label='true mean')
>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, 
              alpha=.65, label='regular estimate')

>>> ax.set_xlim(999, 1001)

>>> ax.legend()

>>> f.savefig("978-1-78398-948-5_06_07.png")

We'll get the following output:

Now, we can use ShuffleSplit to fit the estimator on several smaller datasets:

>>> from sklearn.cross_validation import ShuffleSplit


>>> shuffle_split = ShuffleSplit(len(fitting_set))


>>> mean_p = []

>>> for train, _ in shuffle_split:
       mean_p.append(fitting_set[train].mean())
       shuf_estimate = np.mean(mean_p)


>>> import matplotlib.pyplot as plt

>>> f, ax = plt.subplots(figsize=(7, 5))

>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5, 
              alpha=.65, label='true mean')
>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5, 
              alpha=.65, label='regular estimate')
>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5, 
              alpha=.65, label='shufflesplit estimate')

>>> ax.set_title("All Estimates")
>>> ax.set_xlim(999, 1001)

>>> ax.legend(loc=3)

The output will be as follows:

As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Cross validation with ShuffleSplit

Create new playlist

Sign In

Sign Up

Cross validation with ShuffleSplit

Getting ready

How to do it...

Table of Contents for
Cross validation with ShuffleSplit