Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as a K-fold cross-validation scheme, and selecting the appropriate labeled data.
The purpose of this section is to create a reusable Scala class to validate models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.
Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with Black (with respect to White) color in the following diagram. Data scientists use the following terminologies:
This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures, as follows:
The computation of the precision, recall, and F1 score depends on the number of classes used in the classifier. We will consider the following implementations:
The binomial F validation computes the precision, recall, and F scores for the positive class.
Let's implement the F-score or F-measure as a specialized validation of the following:
trait Validation { def score: Double }
The BinFValidation
class encapsulates the computation of the Fn score as well as precision and recall by counting the occurrences of TP, TN, FP, and FN values. It implements the M3 formula. In the tradition of Scala programming, the class is immutable; it computes the counters for TP, TN, FP, and FN when the class is instantiated. The class takes the following three parameters:
expected
values with the 0
value for a negative outcome and 1
for a positive outcomext
, is used for validating the modelpredict
function classifies observations (line 1
)class BinFValidation[T <: AnyVal]( expected: Vector[Int], xt: XVSeries[T]) (predict: Array[T] => Int)(implicit f: T => Double) extends Validation { //1 val counters = { val predicted = xt.map( predict(_)) expected.zip(predicted) .aggregate(new Counter[Label])((cnt, ap) => cnt + classify(ap._1, ap._2), _ ++ _) //2 } override def score: Double = f1 //3 lazy val f1 = 2.0*precision*recall/(precision + recall) lazy val precision = compute(FP) //4 lazy val recall = compute(FN) def compute(n: Label): Double = { val denom = counters(TP) + counters(n) counters(TP).toDouble/denom } def classify(predicted: Int, expected: Int): Label = //5 if(expected == predicted) if(expected == POSITIVE) TP else TN else if(expected == POSITIVE) FN else FP }
The constructor counts the number of occurrences for each of the four outcomes (TP, TN, FP, and FN) (line 2
). The precision
, recall
, and f1
values are defined as lazy values so they are computed only once, when they are accessed directly or the score
method is invoked (line 4
). The F1 measure is the most commonly used scoring value for validating classifiers. Therefore, it is the default score (line 3
). The classify
private method extracts the qualifier from the expected and predicted values (line 5
).
The BinFValidation
class is independent of the type of classifier, training, labeling process, and type of observations.
Contrary to Java, which defines an enumerator as a class of types, Scala requires enumerators to be singletons. Enumerators extend the scala.Enumeration
abstract class:
object Label extends Enumeration {
type Label = Value
val TP, TN, FP, FN = Value
}
The F-score formula with higher cardinality (Fn) with n > 1 favors precision over recall, which is shown in the following graph:
The validation metric is defined by the M3 formula. The idea is quite simple: the precision and recall values are computed for all the classes and then they are averaged to produce a single precision and recall value for the entire model. The precision and recall for the entire model leverage the counts of TP, FP, FN, and TN introduced in the previous section.
There are two commonly used set of formulas to compute the precision and recall for a model:
We will use the macro formulas from now on.
The computation of the precision and recall factor for a classifier with more than two classes requires the extraction and manipulation of the confusion matrix. We use the following convention: expected values are defined as columns and predicted values are defined as rows.
The MultiFValidation
multinomial validation class takes the following four parameters:
expected
class index with the 0
value for a negative outcome and 1
for a positive outcomext
, is used for validating the modelclasses
in the modelpredict
predictive function classifies observations (line 7
)The code will be as follows:
class MultiFValidation[T <: AnyVal]( expected: Vector[Int], xv: XVSeries[T], classes: Int) (predict: Array[T] => Int)(implicit f : T => Double) extends Validation { //7 val confusionMatrix: Matrix[Int] = //8 labeled./:(Matrix[Int](classes)){case (m, (x,n)) => m + (n, predict(x), 1)} //9 val macroStats: DblPair = { //10 val pr= Range(0, classes)./:(0.0,0.0)((s, n) => { val tp = confusionMatrix(n, n) //11 val fn = confusionMatrix.col(n).sum – tp //12 val fp = confusionMatrix.row(n).sum – tp //13 (s._1 + tp.toDouble/(tp + fp), s._2 +tp.toDouble/(tp + fn)) }) (pr._1/classes, pr._2/classes) } lazy val precision: Double = macroStats._1 lazy val recall: Double = macroStats._1 def score: Double = 2.0*precision*recall/(precision+recall) }
The core element of the multiclass validation is the confusion matrix, confusionMatrix
(line 8
). Its elements at indices (i, j) = (index of expected class for an observation, index of the predicted class for the same observation) are computed using the expected and predictive outcome for each class (line 9
).
As stated in the introduction of the section, we use the macro definition of the precision and recall (line 10
). The count of a true positive, tp
, for each class corresponds to the diagonal element of the confusion matrix (line 11
). The count of the fn
false negatives for a class is computed as the sum of the counts for all the predicted classes (column values), given an expected class except the true positive class (line 12
). The count of the fp
false positives for a class is computed as the sum of the counts for all the expected classes (row values), given a predicted class except the true positive class (line 13
).
The formula for the computation of the F1 score is the same as the formula used in the binomial validation.
It is quite common that the labeled dataset (observations plus the expected outcome) available to the scientists is not very large. The solution is to break the original labeled dataset into K groups of data.
The one-fold cross validation is the simplest scheme used for extracting a training set and validation set from a labeled dataset, as described in the following diagram:
The one-fold cross validation methodology consists of the following three steps:
The one-fold cross validation is implemented by the OneFoldXValidation
class. It takes the following three arguments: an xt
vector of observations, the expected
vector of expected classes, and ratio
of the size of the training set over the size of the validation set (line 14
):
type ValidationType[T] = Vector[(Array[T], Int)] class OneFoldXValidation[T <: AnyVal]( xt: XVSeries[T], expected: Vector[Int], ratio: Double)(implicit f : T => Double) { //14 val datasSet: (ValidationType[T], ValidationType[T]) //15 def trainingSet: ValidationType[T] = datasSet._1 def validationSet: ValidationType[T] = datasSet._1 }
The constructor of the OneFoldXValidation
class generates the segregated training and validation sets from the set of observations and expected classes (line 15
):
val datasSet: (Vector[LabeledData[T]],Vector[LabeledData[T]]) = { val labeledData = xt.drop(1).zip(expected) //16 val trainingSize = (ratio*expected.size).floor.toInt //17 val valSz = labeledData.size - trainingSize val adjSz = if(valSz < 2) 1 else if(valSz >= labeledData.size) labeledData.size -1 else valSz //18 val iter = labeledData.grouped(adjSz ) //18 val ordLabeledData = labeledData .map( (_, Random.nextDouble) ) //19 .sortWith( _._2 < _._2).unzip._1 //20 (ordlabeledData.takeRight(adjValSz), ordlabeledData.dropRight(adjValSz)) //21 }
The initialization of the OneFoldXValidation
class creates a labeledData
vector of labeled observations by zipping the observations and the expected outcome (line 16
). The training ratio
value is used to compute the respective size of the training set (line 17
) and validation set (line 18
).
In order to create training and validations sets randomly, we zip the labeled dataset with a random generator (line 19
), and then reorder the labeled dataset by sorting the random values (line 20
). Finally, the method returns the pair of training set and validation set (line 21
).
The data scientist creates K training-validation datasets by selecting one of the groups as a validation set and then combining all the remaining groups into a training set, as illustrated in the following diagram. The process is known as the K-fold cross validation [2:7].
The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset.
The challenge is to create a model that fits both the training set and subsequent observations to be classified during the validation phase.
If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations she/he selected for training are representative of the real world.
The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance.
The mathematical definition for the bias, variance, and mean square error (MSE) of the distribution are defined by the following formulas:
Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, you have not been introduced to most of the machines learning techniques. Therefore, we create a simulator to illustrate the relation between the bias and variance of a classifier. The components of the simulation are as follows:
training
target
model of the target: Double => Double
type extracted from the training setmodels
to evaluateA model that exactly matches the training data overfits the target model. Models that approximate the target model will most likely underfit. The models in this example are defined by single variable functions.
These models are evaluated against a validation dataset. The BiasVariance
class takes the target model, target
, and the size of the nValues
validation test as parameters (line 22
). It merely implements the formula to compute the bias and variance for each model:
type Dbl_F = Double => Double class BiasVariance[T](target: Dbl_F,nValues: Int) (implicit f: T => Double) {//22 def fit(models: List[Dbl_F]): List[DblPair] = { //23 models.map(accumulate(_, models.size)) //24 } }
The fit
method computes the variance and bias for each of the models
model compared to the target
model (line 23
). It computes the mean, variance, and bias in the accumulate
method (line 24
):
def accumulate(f: Dbl_F, y:Double, numModels: Int): DblPair = Range(0, nValues)./:(0.0, 0.0){ case((s,t) x) => { val diff = (f(x) - y)/numModels (s + diff*diff, t + Math.abs(f(x)-target(x))) //25 }}
The training data is generated by the single variable function with the r1 and r2 noise components:
The accumulate
method returns a tuple (variance, bias) for each model, f (line 25
). The model candidates are defined by the following family of single variable for values n = 1, 2, and 4:
The target
model (line 26
) and models
(line 27
) belong to the same family of single variable functions:
val template = (x: Double, n : Int) => 0.2*x*(1.0 + Math.sin(x*0.1)/n) val training = (x: Double) => { val r1 = 0.45*(Random.nextDouble-0.5) val r2 = 38.0*(Random.nextDouble - 0.5) + Math.sin(x*0.3) 0.2*x*(1.0 + Math.sin(x*0.1 + r1)) + r2 } Val target = (x: Double) => template(x, 1) //26 val models = List[(Dbl_F, String)] ( //27 ((x: Double) => template(x, 4), "Underfit1"), ((x: Double) => template(x, 2), "Underfit2"), ((x : Double) => training(x), "Overfit") (target, "target"), ) val evaluator = new BiasVariance[Double](target, 200) evaluator.fit(models.map( _._1)) match { /* … */ }
The JFreeChart library is used to display the training dataset and the models:
The model that replicates the training data overfits. The models that smooth the model with lower amplitude for the sine component of the template
function underfit. The variance-bias trade-off
for the different models and training data is illustrated in the following scatter chart:
The variance of each of the smoothing or approximating models is lower than the variance of the training set. As expected the target model, 0.2.x.(1+sin(x/10)), has no bias and no variance. The training set has a very high variance because it overfits any target model. The last chart compares the mean square error between each of the models, training set, and the target model:
You can apply the methodology presented in the example to any classification and regression model. The list of models with low variance includes constant functions and models independent of the training set. High degree polynomials, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has a low bias, while linear regression applied to nonlinear data has a higher bias [2:8].
Overfitting affects all aspects of the modeling process negatively, for example:
However, there are well-proven solutions to reduce overfitting [2:9]: