Putting it all together with Pipelines

Now that we've used Pipelines and data transformation techniques, we'll walk through a more complicated example that combines several of the previous recipes into a pipeline.

Getting ready

In this section, we'll show off some more of Pipeline's power. When we used it earlier to impute missing values, it was only a quick taste; we'll chain together multiple preprocessing steps to show how Pipelines can remove extra work.

Let's briefly load the iris dataset and seed it with some missing values:

>>> from sklearn.datasets import load_iris
>>> import numpy as np

>>> iris = load_iris()
>>> iris_data = iris.data

>>> mask = np.random.binomial(1, .25, iris_data.shape).astype(bool)
>>> iris_data[mask] = np.nan
>>> iris_data[:5]
array([[ 5.1,  3.5,  1.4,  nan],
       [ nan,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  nan,  0.2]])

How to do it...

The goal of this chapter is to first impute the missing values of iris_data, and then perform PCA on the corrected dataset. You can imagine (and we'll do it later) that this workflow might need to be split between a training dataset and a holdout set; Pipelines will make this easier, but first we need to take a baby step.

Let's load the required libraries:

>>> from sklearn import pipeline, preprocessing, decomposition

Next, create the imputer and PCA classes:

>>> pca = decomposition.PCA()
>>> imputer = preprocessing.Imputer()

Now that we have the classes we need, we can load them into Pipeline:

>>> pipe = pipeline.Pipeline([('imputer', imputer), ('pca', pca)])
>>> iris_data_transformed = pipe.fit_transform(iris_data)
>>> iris_data_transformed[:5]
array([[ -2.42e+00,  -3.59e-01,  -6.88e-01,  -3.49e-01],
       [ -2.44e+00,  -6.94e-01,   3.27e-01,   4.87e-01], 
       [ -2.94e+00,   2.45e-01,  -1.85e-03,   4.37e-02], 
       [ -2.79e+00,   4.29e-01,  -8.05e-03,   9.65e-02], 
       [ -6.46e-01,   8.87e-01,   7.54e-01,  -5.19e-01]])

This takes a lot more management if we use separate steps. Instead of each step requiring a fit transform, this step is performed only once. Not to mention that we only have to keep track of one object!

How it works...

Hopefully it was obvious, but each step in Pipeline is passed to a Pipeline object via a list of tuples, with the first element getting the name and the second getting the actual object.

Under the hood, these steps are looped through when a method such as fit_transform is called on the Pipeline object.

This said, there are quick and dirty ways to create Pipeline, much in the same way there was a quick way to perform scaling, though we can use StandardScaler if we want more power. The pipeline function will automatically create the names for the Pipeline objects:

>>> pipe2 = pipeline.make_pipeline(imputer, pca)
>>> pipe2.steps
[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)),
('pca', PCA(copy=True, n_components=None, whiten=False))]

This is the same object that was created in the more verbose method:

>>> iris_data_transformed2 = pipe2.fit_transform(iris_data)
>>> iris_data_transformed2[:5]
array([[ -2.42e+00,  -3.59e-01,  -6.88e-01,  -3.49e-01], 
       [ -2.44e+00,  -6.94e-01,   3.27e-01,   4.87e-01], 
       [ -2.94e+00,   2.45e-01,  -1.85e-03,   4.37e-02], 
       [ -2.79e+00,   4.29e-01,  -8.05e-03,   9.65e-02], 
       [ -6.46e-01,   8.87e-01,   7.54e-01,  -5.19e-01]])

There's more...

We just walked through Pipelines at a very high level, but it's unlikely that we will want to apply the base transformation. Therefore, the attributes of each object in Pipeline can be accessed from a set_params method, where the parameter follows the <parameter's_name>__<parameter's_parameter> convention. For example, let's change the pca object to use two components:

>>> pipe2.set_params(pca_n_components=2)
Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, 
         missing_values='NaN', strategy='mean', verbose=0)), 
         ('pca', PCA(copy=True, n_components=2, whiten=False))])

Tip

The __ notation is pronounced as dunder in the Python community.

Notice n_components=2 in the preceding output. Just as a test, we can output the same transformation we already did twice, and the output will be an N x 2 matrix:

>>> iris_data_transformed3 = pipe2.fit_transform(iris_data)
>>> iris_data_transformed3[:5]
array([[-2.42, -0.36], 
       [-2.44, -0.69], 
       [-2.94,  0.24], 
       [-2.79,  0.43], 
       [-0.65,  0.89]])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset