If you have run the previous experiment, you may have realized that:
- Both the validation and test results vary, as their samples are different.
- The chosen hypothesis is often the best one, but this is not always the case.
Unfortunately, relying on the validation and testing phases of samples brings uncertainty, along with a reduction of the learning examples dedicated to training (the fewer the examples, the more the variance of the estimates from the model).
A solution would be to use cross-validation, and Scikit-learn offers a complete module for cross-validation and performance evaluation (sklearn.model_selection).
By resorting to cross-validation, you'll just need to separate your data into a training and test set, and you will be able to use the training data for both model optimization and model training.
How does cross-validation work? The idea is to divide your training data into a certain number of partitions (called folds) and train your model as many times as the number of partitions there are, keeping out a different partition every time from the training phase. After every model training, you will test the result on the fold that is left out and store it away. In the end, you will have as many results as there are folds, and you can calculate both the average and standard deviation on them:
In the preceding graphical example, the chart depicts a dataset that's been divided into five equally sized folds, which are differently used, depending on the iteration, as part of the train or test set during the machine learning process.
The standard deviation will provide a hint on how your model is influenced by the data that is provided for training (the variance of the model, actually), and the mean provides a fair estimate of its general performance. Using the mean of the cross-validation results obtained from different models (because of a different model type employed, or because a different selection of the training variables has been used, or because the different hyperparameters of the model), you can confidently choose the best performing hypothesis to be tested for general performance.
Let's execute an example in order to see cross-validation in action. At this point, we can review the previous evaluation of three possible hypotheses for our digits dataset:
In: choosen_random_state = 1
cv_folds = 10 # Try 3, 5 or 20
eval_scoring='accuracy' # Try also f1
workers = -1 # this will use all your CPU power
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y,
test_size=0.30,
random_state=choosen_random_state)
for hypothesis in [h1, h2, h3]:
scores = model_selection.cross_val_score(hypothesis,
X_train, y_train,
cv=cv_folds, scoring= eval_scoring, n_jobs=workers)
print ("%s -> cross validation accuracy: mean = %0.3f
std = %0.3f" % (hypothesis, np.mean(scores),
np.std(scores)))
Out: LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0) -> cross validation accuracy: mean = 0.930 std = 0.021
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False) -> cross validation accuracy:
mean = 0.990 std = 0.007
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False) -> cross validation accuracy:
mean = 0.987 std = 0.010
The core of the script is the model_selection.cross_val_score function. The function in our script receives the following parameters:
- A learning algorithm (estimator)
- A training set of predictors (X)
- A target variable (y)
- The number of cross-validation folds (cv)
- A scoring function (scoring)
- The number of CPUs to be used (n_jobs)
Given such an input, the function wraps some other complex functions. It creates n-iterations, training a model of the n-cross-validation in-samples, testing the results, and storing scores derived at each iteration from the out-of-sample fold. In the end, the function reports a list of the recorded scores of this kind:
In: scores
Out: array([ 0.96899225, 0.96899225, 0.9921875, 0.98412698, 0.99206349,
1, 1., 0.984, 0.99186992, 0.98347107])
The main advantage of using cross_val_score resides in its simplicity of usage and in the fact that it automatically incorporates all of the necessary steps for a correct cross-validation. For example, when deciding on how to split the training sample into folds, if a y vector is provided, it keeps the same target class label's proportion in each fold as it was in the y that was initially provided.