In the previous chapters, you learned about the essential machine learning algorithms for classification and how to get our data into shape before we feed it into those algorithms. Now, it's time to learn about the best practices of building good machine learning models by fine-tuning the algorithms and evaluating the model's performance! In this chapter, we will learn how to:
When we applied different preprocessing techniques in the previous chapters, such as standardization for feature scaling in Chapter 4, Building Good Training Sets – Data Preprocessing, or principal component analysis for data compression in Chapter 5, Compressing Data via Dimensionality Reduction, you learned that we have to reuse the parameters that were obtained during the fitting of the training data to scale and compress any new data, for example, the samples in the separate test dataset. In this section, you will learn about an extremely handy tool, the Pipeline
class in scikit-learn. It allows us to fit a model including an arbitrary number of transformation steps and apply it to make predictions about new data.
In this chapter, we will be working with the Breast Cancer Wisconsin dataset, which contains 569 samples of malignant and benign tumor cells. The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively. The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant. The Breast Cancer Wisconsin dataset has been deposited on the UCI machine learning repository and more detailed information about this dataset can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
In this section we will read in the dataset, and split it into training and test datasets in three simple steps:
>>> import pandas as pd >>> df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
X
. Using LabelEncoder
, we transform the class labels from their original string representation (M
and B
) into integers:>>> from sklearn.preprocessing import LabelEncoder >>> X = df.loc[:, 2:].values >>> y = df.loc[:, 1].values >>> le = LabelEncoder() >>> y = le.fit_transform(y)
After encoding the class labels (diagnosis) in an array y
, the malignant tumors are now represented as class 1
, and the benign tumors are represented as class 0
, respectively, which we can illustrate by calling the transform
method of LabelEncoder
on two dummy class labels:
>>> le.transform(['M', 'B']) array([1, 0])
>>> from sklearn.cross_validation import train_test_split >>> X_train, X_test, y_train, y_test = ... train_test_split(X, y, test_size=0.20, random_state=1)
In the previous chapter, you learned that many learning algorithms require input features on the same scale for optimal performance. Thus, we need to standardize the columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression. Furthermore, let's assume that we want to compress our data from the initial 30 dimensions onto a lower two-dimensional subspace via principal component analysis (PCA), a feature extraction technique for dimensionality reduction that we introduced in Chapter 5, Compressing Data via Dimensionality Reduction. Instead of going through the fitting and transformation steps for the training and test dataset separately, we can chain the StandardScaler
, PCA
, and LogisticRegression
objects in a pipeline:
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.decomposition import PCA >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.pipeline import Pipeline >>> pipe_lr = Pipeline([('scl', StandardScaler()), ... ('pca', PCA(n_components=2)), ... ('clf', LogisticRegression(random_state=1))]) >>> pipe_lr.fit(X_train, y_train) >>> print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test)) Test Accuracy: 0.947
The Pipeline
object takes a list of tuples as input, where the first value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline, as we will see later in this chapter, and the second element in every tuple is a scikit-learn transformer or estimator.
The intermediate steps in a pipeline constitute scikit-learn transformers, and the last step is an estimator. In the preceding code example, we built a pipeline that consisted of two intermediate steps, a StandardScaler
and a PCA
transformer, and a logistic regression classifier as a final estimator. When we executed the fit
method on the pipeline pipe_lr
, the StandardScaler
performed fit
and transform
on the training data, and the transformed training data was then passed onto the next object in the pipeline, the PCA
. Similar to the previous step, PCA
also executed fit
and transform
on the scaled input data and passed it to the final element of the pipeline, the estimator. We should note that there is no limit to the number of intermediate steps in this pipeline. The concept of how pipelines work is summarized in the following figure: