Cross-validation

If you have run the previous experiment, you may have realized that:

Both the validation and test results vary, as their samples are different.
The chosen hypothesis is often the best one, but this is not always the case.

Unfortunately, relying on the validation and testing phases of samples brings uncertainty, along with a reduction of the learning examples dedicated to training (the fewer the examples, the more the variance of the estimates from the model).

A solution would be to use cross-validation, and Scikit-learn offers a complete module for cross-validation and performance evaluation (sklearn.model_selection).

By resorting to cross-validation, you'll just need to separate your data into a training and test set, and you will be able to use the training data for both model optimization and model training.

How does cross-validation work? The idea is to divide your training data into a certain number of partitions (called folds) and train your model as many times as the number of partitions there are, keeping out a different partition every time from the training phase. After every model training, you will test the result on the fold that is left out and store it away. In the end, you will have as many results as there are folds, and you can calculate both the average and standard deviation on them:

In the preceding graphical example, the chart depicts a dataset that's been divided into five equally sized folds, which are differently used, depending on the iteration, as part of the train or test set during the machine learning process.

Ten folds is quite a common configuration in the cross-validation that we recommend. Using fewer folds can be fine with biased estimators such as linear regression, but it may penalize machine learning algorithms that are more complex. In some cases, you really need to use more folds to ensure that there is enough training data for the machine learning algorithm to generalize properly. This happens quite commonly in medical datasets where there are not enough data points. On the other hand, if the number of examples at hand is not an issue, using more folds is more computationally intensive and it may take longer for the cross-validation to complete. Sometimes, using five folds is a good compromise between accuracy of estimates and running times.

The standard deviation will provide a hint on how your model is influenced by the data that is provided for training (the variance of the model, actually), and the mean provides a fair estimate of its general performance. Using the mean of the cross-validation results obtained from different models (because of a different model type employed, or because a different selection of the training variables has been used, or because the different hyperparameters of the model), you can confidently choose the best performing hypothesis to be tested for general performance.

We strongly suggest that you use cross-validation just for optimization purposes and not for performance estimation (that is, to figure out what the error of the model might be on fresh data). Cross-validation just points out the best possible algorithm and parameter choice based on the best averaged result. Using it for performance estimation would mean using the best result found, a more optimistic estimation than it should be. In order to report an unbiased estimation of your possible performance, you should prefer using a test set.

Let's execute an example in order to see cross-validation in action. At this point, we can review the previous evaluation of three possible hypotheses for our digits dataset:

In: choosen_random_state = 1
    cv_folds = 10 # Try 3, 5 or 20
    eval_scoring='accuracy' # Try also f1
    workers = -1 # this will use all your CPU power
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
                                      X, y, 
                                      test_size=0.30, 
                                      random_state=choosen_random_state)
    for hypothesis in [h1, h2, h3]:
        scores = model_selection.cross_val_score(hypothesis, 
                     X_train, y_train, 
                     cv=cv_folds, scoring= eval_scoring, n_jobs=workers)
        print ("%s -> cross validation accuracy: mean = %0.3f 
               std = %0.3f" % (hypothesis, np.mean(scores), 
                               np.std(scores))) 

Out: LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
      intercept_scaling=1, loss='squared_hinge', max_iter=1000,
      multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
      verbose=0) -> cross validation accuracy: mean = 0.930 std = 0.021

     SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False) -> cross validation accuracy: 
      mean = 0.990 std = 0.007

     SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False) -> cross validation accuracy: 
      mean = 0.987 std = 0.010

The core of the script is the model_selection.cross_val_score function. The function in our script receives the following parameters:

A learning algorithm (estimator)
A training set of predictors (X)
A target variable (y)
The number of cross-validation folds (cv)
A scoring function (scoring)
The number of CPUs to be used (n_jobs)

Given such an input, the function wraps some other complex functions. It creates n-iterations, training a model of the n-cross-validation in-samples, testing the results, and storing scores derived at each iteration from the out-of-sample fold. In the end, the function reports a list of the recorded scores of this kind:

In: scores

Out: array([ 0.96899225, 0.96899225, 0.9921875, 0.98412698, 0.99206349, 
             1, 1., 0.984, 0.99186992, 0.98347107])

The main advantage of using cross_val_score resides in its simplicity of usage and in the fact that it automatically incorporates all of the necessary steps for a correct cross-validation. For example, when deciding on how to split the training sample into folds, if a y vector is provided, it keeps the same target class label's proportion in each fold as it was in the y that was initially provided.

Table of Contents for Cross-validation

Create new playlist

Sign In

Sign Up

Table of Contents for
Cross-validation