Using cross-validation iterators

Though the cross_val_score function from the model_selection module acts as a complete helper function for most of the cross-validation purposes, you may have the need to build up your own cross-validation process. In this case, the same model_selection module guarantees a formidable selection of iterators.

Before examining the most useful ones, let's provide a clear overview of how they function by studying how one of the iterators, model_selection.KFold, works.

KFold is quite simple in its functionality. If n-number of folds is given, it returns n iterations to the indexes of the training and validation sets for the testing of each fold.

Let's say that we have a training set made up of 100 examples and we would like to create a 10-fold cross-validation. First, let's set up our iterator:

In: kfolding = model_selection.KFold(n_splits=10, shuffle=True, 
random_state=1)
for train_idx, validation_idx in kfolding.split(range(100)):
print (train_idx, validation_idx)

Out: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 24 25 26 27
28 29 30 31 32 34 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63 64 66 67 68 70 71 72 73 74 75 76 77 78 79
83 85 86 87 88 89 90 91 92 94 95 96 97 98 99] [17 33 36 65 69 80 81 82
84 93] ...

By using the n parameter, we can instruct the iterator to perform the folding on 100 indexes. n_splits specifies the number of folds. While shuffle is set to True, it will randomly choose the fold components. Instead, if it is set to False, the folds will be created with respect to the order of the indexes (so, the first fold will be [0 1 2 3 4 5 6 7 8 9]).

As usual, the random_state parameter allows for the reproducibility of the fold's generation.

During the iterator loop, the indexes for training and validation are provided with respect to your hypothesis for evaluation. (Let's figure out how it works by using h1, the linear SVC.) You just have to select both X and y accordingly with the help of fancy indexing:

In: h1.fit(X[train_idx],y[train_idx])
h1.score(X[validation_idx],y[validation_idx])

Out:0.90000000000000002

As you can see, a cross-validation iterator provides you with just the index functionality, and it is up to you when it comes to using indexes for your scoring evaluation on your hypothesis. This opens up opportunities for sophisticated operations of validation.

Among the other most useful iterators, the following are worth mentioning:

  •  StratifiedKFold works like Kfold, but it always returns folds with approximately the same class percentage as the training set. This leaves each fold balanced; therefore, the learner is fitted on the correct proportion of classes. Instead of the number of cases, as an input parameter, it needs the target variable y. It is the iterator that is wrapped, by default, in the cross_val_score function, as we saw in the preceding section.
  •  LeaveOneOut works like Kfold, but it returns as a validation set of only one observation. Therefore, in the end, the number of folds will be equivalent to the number of examples in the training set. We recommend that you use this cross-validation approach only when the training set is heavily unbalanced (such as in fraud detection problems) or very small, especially if there are less than 100 observations a k-fold validation would reduce the training set a lot.
  •  LeavePOut is similar in regards to the advantages and limitations of LeaveOneOut, but its validation set is made up of P cases. Therefore, the number of total folds will be the combination of P cases from all the available cases (which actually could be quite a large number as the size of your dataset grows).
  •  LeaveOneLabelOut provides a convenient way to cross-validate according to a scheme that you have prepared or computed in advance. In fact, it will act like Kfolds, but for the fact that the folds will already be labeled and provided to the labels parameter.
  • LeavePLabelOut is a variant of LeaveOneLabelOut. In this instance, the test folds are made of a number of labels according to the scheme that you prepare in advance.
To learn more about the specific parameters required by each iterator, we suggest that you check out the Scikit-learn website: 
http://Scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
.

As a matter of fact, cross-validation can also be used for prediction purposes. In fact, for specific data science projects, you may be required to build a model from your available data and then produce predictions on the very same data. As seen previously, using training predictions will lead to high variance estimates, given that the model has been fitted on that very data and thus it has memorized much of its characteristics.

The cross-validation process applied to prediction can come to the rescue:

  • Create a cross-validation iterator (preferably with a large number of k folds).
  • Iterate through the cross-validation and each time train your model with the k-1 training folds.
  • At each iteration, on the validation fold (which is an out-of-sample fold, actually), produce predictions and store them away, keeping track of their index. The best way of doing so is to have a prediction matrix which will be populated with predictions by using fancy indexing.

Such an approach is commonly referred to as out-of-cross-validation fold prediction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset