In the previous chapters, you learned about the essential machine learning algorithms for classification and how to get our data into shape before we feed it into those algorithms. Now, it's time to learn about the best practices of building good machine learning models by fine-tuning the algorithms and evaluating the model's performance! In this chapter, we will learn how to do the following:
When we applied different preprocessing techniques in the previous chapters, such as standardization for feature scaling in Chapter 4, Building Good Training Sets – Data Preprocessing, or principal component analysis for data compression in Chapter 5, Compressing Data via Dimensionality Reduction, you learned that we have to reuse the parameters that were obtained during the fitting of the training data to scale and compress any new data, such as the samples in the separate test dataset. In this section, you will learn about an extremely handy tool, the Pipeline
class in scikit-learn. It allows us to fit a model including an arbitrary number of transformation steps and apply it to make predictions about new data.
In this chapter, we will be working with the Breast Cancer Wisconsin dataset, which contains 569 samples of malignant and benign tumor cells. The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnoses (M
= malignant, B
= benign), respectively. Columns 3-32 contain 30 real-valued features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant. The Breast Cancer Wisconsin dataset has been deposited in the UCI Machine Learning Repository, and more detailed information about this dataset can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
You can find a copy of the breast cancer dataset (and all other datasets used in this book) in the code bundle of this book, which you can use if you are working offline or the UCI server at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data is temporarily unavailable. For instance, to load the Wine dataset from a local directory, you can take the following lines:
df = pd.read_csv('https://archive.ics.uci.edu/ml/' 'machine-learning-databases' '/breast-cancer-wisconsin/wdbc.data', header=None)
Replace the preceding lines with this:
df = pd.read_csv('your/local/path/to/wdbc.data', header=None)
In this section, we will read in the dataset and split it into training and test datasets in three simple steps:
pandas
:>>> import pandas as pd >>> df = pd.read_csv('https://archive.ics.uci.edu/ml/' ... 'machine-learning-databases' ... '/breast-cancer-wisconsin/wdbc.data', header=None)
X
. Using a LabelEncoder
object, we transform the class labels from their original string representation ('M'
and 'B'
) into integers:>>> from sklearn.preprocessing import LabelEncoder >>> X = df.loc[:, 2:].values >>> y = df.loc[:, 1].values >>> le = LabelEncoder() >>> y = le.fit_transform(y) >>> le.classes_ array(['B', 'M'], dtype=object)
After encoding the class labels (diagnosis) in an array y
, the malignant tumors are now represented as class 1
, and the benign tumors are represented as class 0
, respectively. We can double-check this mapping by calling the transform
method of the fitted LabelEncoder
on two dummy class labels:
>>> le.transform(['M', 'B']) array([1, 0])
>>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = >>> train_test_split(X, y, ... test_size=0.20, ... stratify=y, ... random_state=1)
In the previous chapter, you learned that many learning algorithms require input features on the same scale for optimal performance. Thus, we need to standardize the columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression. Furthermore, let's assume that we want to compress our data from the initial 30 dimensions onto a lower two-dimensional subspace via Principal Component Analysis (PCA), a feature extraction technique for dimensionality reduction that we introduced in Chapter 5, Compressing Data via Dimensionality Reduction.
Instead of going through the fitting and transformation steps for the training and test datasets separately, we can chain the StandardScaler
, PCA
, and LogisticRegression
objects in a pipeline:
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.decomposition import PCA >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.pipeline import make_pipeline >>> pipe_lr = make_pipeline(StandardScaler(), ... PCA(n_components=2), ... LogisticRegression(random_state=1)) >>> pipe_lr.fit(X_train, y_train) >>> y_pred = pipe_lr.predict(X_test) >> print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test)) Test Accuracy: 0.956
The make_pipeline
function takes an arbitrary number of scikit-learn transformers (objects that support the fit
and transform
methods as input), followed by a scikit-learn estimator that implements the fit
and predict
methods. In our preceding code example, we provided two transformers, StandardScaler
and PCA
, and a LogisticRegression
estimator as inputs to the make_pipeline
function, which constructs a scikit-learn Pipeline
object from these objects.
We can think of a scikit-learn Pipeline
as a meta-estimator or wrapper around those individual transformers and estimators. If we call the fit
method of Pipeline
, the data will be passed down a series of transformers via fit
and transform
calls on these intermediate steps until it arrives at the estimator object (the final element in a pipeline). The estimator will then be fitted to the transformed training data.
When we executed the fit
method on the pipe_lr
pipeline in the preceding code example, StandardScaler
first performed fit
and transform
calls on the training data. Second, the transformed training data was passed on to the next object in the pipeline, PCA
. Similar to the previous step, PCA
also executed fit
and transform
on the scaled input data and passed it to the final element of the pipeline, the estimator.
Finally, the LogisticRegression
estimator was fit to the training data after it underwent transformations via StandardScaler
and PCA
. Again, we should note that there is no limit to the number of intermediate steps in a pipeline; however, the last pipeline element has to be an estimator.
Similar to calling fit
on a pipeline, pipelines also implement a predict
method. If we feed a dataset to the predict
call of a Pipeline
object instance, the data will pass through the intermediate steps via transform
calls. In the final step, the estimator object will then return a prediction on the transformed data.
The pipelines of scikit-learn library are immensely useful wrapper tools, which we will use frequently throughout the rest of this book. To make sure that you've got a good grasp of how Pipeline
object works, please take a close look at the following illustration, which summarizes our discussion from the previous paragraphs: