Chapter 9. Design Strategies and Case Studies

With the possible exception of data munging, evaluating is probably what machine learning scientists spend most of their time doing. Staring at lists of numbers and graphs, watching hopefully as their models run, and trying earnestly to make sense of their output. Evaluation is a cyclical process; we run models, evaluate the results, and plug in new parameters, each time hoping that this will result in a performance gain. Our work becomes more enjoyable and productive as we increase the efficiency of each evaluation cycle, and there are some tools and techniques that can help us achieve this. This chapter will introduce some of these through the following topics:

  • Evaluating model performance
  • Model selection
  • Real-world case studies.
  • Machine learning design at a glance

Evaluating model performance

Measuring a model's performance is an important machine learning task, and there are many varied parameters and heuristics for doing this. The importance of defining a scoring strategy should not be underestimated, and in Sklearn, there are basically three approaches:

  • Estimator score: This refers to using the estimator's inbuilt score() method, specific to each estimator
  • Scoring parameters: This refers to cross-validation tools relying on an internal scoring strategy
  • Metric functions: These are implemented in the metrics module

We have seen examples of the estimator score() method, for example, clf.score(). In the case of a linear classifier, the score() method returns the mean accuracy. It is a quick and easy way to gauge an individual estimator's performance. However, this method is usually insufficient in itself for a number of reasons.

If we remember, accuracy is the sum of the true positive and true negative cases divided by the number of samples. Using this as a measure would indicate that if we performed a test on a number of patients to see if they had a particular disease, simply predicting that every patient was disease free would likely give us a high accuracy. Obviously, this is not what we want.

A better way to measure performance is using by precision, (P) and Recall, (R). If you remember from the table in Chapter 4, Models – Learning from Information, precision, or specificity, is the proportion of predicted positive instances that are correct, that is, TP/(TP+FP). Recall, or sensitivity, is TP/(TP+FN). The F-measure is defined as 2*R*P/(R+P). These measures ignore the true negative rate, and so they are not making an evaluation on how well a model handles negative cases.

Rather than use the score method of the estimator, it often makes sense to use specific scoring parameters such as those provided by the cross_val_score object. This has a cv parameter that controls how the data is split. It is usually set as an int, and it determines how many random consecutive splits are made on the data. Each of these has a different split point. This parameter can also be set to an iterable of train and test splits, or an object that can be used as a cross validation generator.

Also important in cross_val_score is the scoring parameter. This is usually set by a string indicating a scoring strategy. For classification, the default is accuracy, and some common values are f1, precision, recall, as well as the micro-averaged, macro-averaged, and weighted versions of these. For regression estimators, the scoring values are mean_absolute_error, mean_squared error, median_absolute_error, and r2.

The following code estimates the performance of three models on a dataset using 10 consecutive splits. Here, we print out the mean of each score, using several measures, for each of the four models. In a real-world situation, we will probably need to preprocess our data in one or more ways, and it is important to apply these data transformations to our test set as well as the training set. To make this easier, we can use the sklearn.pipeline module. This sequentially applies a list of transforms and a final estimator, and it allows us to assemble several steps that can be cross-validated together. Here, we also use the StandardScaler() class to scale the data. Scaling is applied to the logistic regression model and the decision tree by using two pipelines:

from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import samples_generator
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline
X, y = samples_generator.make_classification(n_samples=1000,n_informative=5, n_redundant=0,random_state=42)
le=LabelEncoder()
y=le.fit_transform(y)
Xtrain, Xtest, ytrain, ytest = cross_validation.train_test_split(X, y, test_size=0.5, random_state=1)
clf1=DecisionTreeClassifier(max_depth=2,criterion='gini').fit(Xtrain,ytrain)
clf2= svm.SVC(kernel='linear', probability=True, random_state=0).fit(Xtrain,ytrain)
clf3=LogisticRegression(penalty='l2', C=0.001).fit(Xtrain,ytrain)
pipe1=Pipeline([['sc',StandardScaler()],['mod',clf1]])
mod_labels=['Decision Tree','SVM','Logistic Regression' ]
print('10 fold cross validation: 
')
for mod,label in zip([pipe1,clf2,clf3], mod_labels):
    #print(label)
    auc_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='roc_auc')
    p_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='precision_macro')
    r_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='recall_macro')
    f_scores= cross_val_score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring ='f1_macro')

    print(label)
    print("auc scores %2f +/- %2f " % (auc_scores.mean(), auc_scores.std()))
    print("precision %2f +/- %2f " % (p_scores.mean(), p_scores.std()))
    print("recall %2f +/- %2f ]" % (r_scores.mean(), r_scores.std()))
    print("f scores %2f +/- %2f " % (f_scores.mean(), f_scores.std()))

On execution, you will see the following output:

Evaluating model performance

There are several variations on these techniques, most commonly using what is known as k-fold cross validation. This uses what is sometimes referred to as the leave one out strategy. First, the model is trained using k—1 of the folds as training data. The remaining data is then used to compute the performance measure. This is repeated for each of the folds. The performance is calculated as an average of all the folds.

Sklearn implements this using the cross_validation.KFold object. The important parameters are a required int, indicating the total number of elements, and an n_folds parameter, defaulting to 3, to indicate the number of folds. It also takes optional shuffle and random_state parameters indicating whether to shuffle the data before splitting, and what method to use to generate the random state. The default random_state parameter is to use the NumPy random number generator.

In the following snippet, we use the LassoCV object. This is a linear model trained with L1 regularization. The optimization function for regularized linear regression, if you remember, includes a constant (alpha) that multiplies the L1 regularization term. The LassoCV object automatically sets this alpha value, and to see how effective this is, we can compare the selected alpha and the score on each of the k-folds:

import numpy as np
from sklearn import cross_validation, datasets, linear_model
X,y=datasets.make_blobs(n_samples=80,centers=2, random_state=0, cluster_std=2)
alphas = np.logspace(-4, -.5, 30)
lasso_cv = linear_model.LassoCV(alphas=alphas)
k_fold = cross_validation.KFold(len(X), 5)
alphas = np.logspace(-4, -.5, 30)

for k, (train, test) in enumerate(k_fold):
    lasso_cv.fit(X[train], y[train])
    print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}".
          format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test])))

The output of the preceding commands is as follows:

Evaluating model performance

Sometimes, it is necessary to preserve the percentages of the classes in each fold. This is done using stratified cross validation. It can be helpful when classes are unbalanced, that is, when there is a larger number of some classes and very few of others. Using the stratified cv object may help correct defects in models that might cause bias because a class is not represented in a fold in large enough numbers to make an accurate prediction. However, this may also cause an unwanted increase in variance.

In the following example, we use stratified cross validation to test how significant the classification score is. This is done by repeating the classification procedure after randomizing the labels. The p value is the percentage of runs by which the score is greater than the classification score obtained initially. This code snippet uses the cross_validation.permutation_test_score method that takes the estimator, data, and labels as parameters. Here, we print out the initial test score, the p value, and the score on each permutation:

import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import StratifiedKFold, permutation_test_score
from sklearn import datasets

X,y=datasets.make_classification(n_samples=100, n_features=5)
n_classes = np.unique(y).size
cls=linear_model.LogisticRegression()
cv = StratifiedKFold(y, 2)
score, permutation_scores, pvalue = permutation_test_score(cls, X, y, scoring="f1", cv=cv, n_permutations=10, n_jobs=1)

print("Classification score %s (pvalue : %s)" % (score, pvalue))
print("Permutation scores %s" % (permutation_scores))

This gives the following output:

Evaluating model performance
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset