Now that we've used Pipelines and data transformation techniques, we'll walk through a more complicated example that combines several of the previous recipes into a pipeline.
In this section, we'll show off some more of Pipeline's power. When we used it earlier to impute missing values, it was only a quick taste; we'll chain together multiple preprocessing steps to show how Pipelines can remove extra work.
Let's briefly load the iris
dataset and seed it with some missing values:
>>> from sklearn.datasets import load_iris >>> import numpy as np >>> iris = load_iris() >>> iris_data = iris.data >>> mask = np.random.binomial(1, .25, iris_data.shape).astype(bool) >>> iris_data[mask] = np.nan >>> iris_data[:5] array([[ 5.1, 3.5, 1.4, nan], [ nan, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, nan, 0.2]])
The goal of this chapter is to first impute the missing values of iris_data
, and then perform PCA on the corrected dataset. You can imagine (and we'll do it later) that this workflow might need to be split between a training dataset and a holdout set; Pipelines will make this easier, but first we need to take a baby step.
Let's load the required libraries:
>>> from sklearn import pipeline, preprocessing, decomposition
Next, create the imputer and PCA classes:
>>> pca = decomposition.PCA() >>> imputer = preprocessing.Imputer()
Now that we have the classes we need, we can load them into Pipeline:
>>> pipe = pipeline.Pipeline([('imputer', imputer), ('pca', pca)]) >>> iris_data_transformed = pipe.fit_transform(iris_data) >>> iris_data_transformed[:5] array([[ -2.42e+00, -3.59e-01, -6.88e-01, -3.49e-01], [ -2.44e+00, -6.94e-01, 3.27e-01, 4.87e-01], [ -2.94e+00, 2.45e-01, -1.85e-03, 4.37e-02], [ -2.79e+00, 4.29e-01, -8.05e-03, 9.65e-02], [ -6.46e-01, 8.87e-01, 7.54e-01, -5.19e-01]])
This takes a lot more management if we use separate steps. Instead of each step requiring a fit transform, this step is performed only once. Not to mention that we only have to keep track of one object!
Hopefully it was obvious, but each step in Pipeline is passed to a Pipeline object via a list of tuples, with the first element getting the name and the second getting the actual object.
Under the hood, these steps are looped through when a method such as fit_transform
is called on the Pipeline object.
This said, there are quick and dirty ways to create Pipeline, much in the same way there was a quick way to perform scaling, though we can use StandardScaler
if we want more power. The pipeline
function will automatically create the names for the Pipeline objects:
>>> pipe2 = pipeline.make_pipeline(imputer, pca) >>> pipe2.steps [('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, n_components=None, whiten=False))]
This is the same object that was created in the more verbose method:
>>> iris_data_transformed2 = pipe2.fit_transform(iris_data) >>> iris_data_transformed2[:5] array([[ -2.42e+00, -3.59e-01, -6.88e-01, -3.49e-01], [ -2.44e+00, -6.94e-01, 3.27e-01, 4.87e-01], [ -2.94e+00, 2.45e-01, -1.85e-03, 4.37e-02], [ -2.79e+00, 4.29e-01, -8.05e-03, 9.65e-02], [ -6.46e-01, 8.87e-01, 7.54e-01, -5.19e-01]])
We just walked through Pipelines at a very high level, but it's unlikely that we will want to apply the base transformation. Therefore, the attributes of each object in Pipeline can be accessed from a set_params
method, where the parameter follows the <parameter's_name>__<parameter's_parameter>
convention. For example, let's change the pca
object to use two components:
>>> pipe2.set_params(pca_n_components=2) Pipeline(steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('pca', PCA(copy=True, n_components=2, whiten=False))])
Notice n_components=2
in the preceding output. Just as a test, we can output the same transformation we already did twice, and the output will be an N x 2 matrix:
>>> iris_data_transformed3 = pipe2.fit_transform(iris_data) >>> iris_data_transformed3[:5] array([[-2.42, -0.36], [-2.44, -0.69], [-2.94, 0.24], [-2.79, 0.43], [-0.65, 0.89]])