Cross-validation

Cross-validation (which you may hear some data scientists refer to as rotation estimation, or simply a general technique for assessing models), is another method for assessing a model's performance (or its accuracy).

Mainly used with predictive modeling to estimate how accurately a model might perform in practice, one might see cross-validation used to check how a model will potentially generalize; in other words, how the model will apply what it infers from samples, to an entire population (or dataset).

With cross-validation, you identify a (known) dataset as your validation dataset on which training is run, along with a dataset of unknown data (or first seen data) against which the model will be tested (this is known as your testing dataset). The objective is to ensure that problems such as overfitting (allowing non-inclusive information to influence results) are controlled, as well as provide an insight on how the model will generalize a real problem or on a real data file.

This process will consist of separating data into samples of similar subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set):

Separation → Analysis → Validation

To reduce variability, multiple iterations (also called folds or rounds) of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Typically, a data scientist will use a model's stability to determine the actual number of rounds of cross-validation that should be performed.

Again, the cross-validation method can perhaps be better understood by thinking about selecting a subset of data and manually calculating the results. Once you know the correct results, they can be compared to the model-produced results (using a separate subset of data). This is one round. Multiple rounds would be performed and the compared results averaged and reviewed, eventually providing a fair estimate of a model's prediction performance.

Suppose a university provides data on its student body over time. The students are described as having various characteristics, such as having a High School GPA greater or less than 3.0, if they have a family member that graduated from the school, if the student was active in non-program activities, was a resident (lived on campus), was a student athlete, and so on. Our predictive model wants to predict what characteristics students who graduate early have.

The following table is a representation of the results of using a five-round cross-validation process to predict our model's expected accuracy:

Cross-validation

Given the preceding figures, I'd say our predictive model is expected to be very accurate!

In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. This method is typically used in cases where there is not enough data available to test without losing significant modeling or testing quality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset