Assessing a model

Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as a K-fold cross-validation scheme, and selecting the appropriate labeled data.

Validation

The purpose of this section is to create a reusable Scala class to validate models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.

Key quality metrics

Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with Black (with respect to White) color in the following diagram. Data scientists use the following terminologies:

  • True positives (TP): These are observations that are correctly labeled as those that belong to the positive class (white dots on a dark background)
  • True negatives (TN): These are observations that are correctly labeled as those that belong to the negative class (black dots on a light background)
  • False positives (FP): These are observations incorrectly labeled as those that belong to the positive class (white dots on a dark background)
  • False negatives (FN): These are observations incorrectly labeled as those that belong to the negative class (dark dots on a light background)
    Key quality metrics

    Categorization of validation results

This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures, as follows:

  • Accuracy: This is the percentage of observations correctly classified and is represented as ac.
  • Precision: This is the percentage of observations correctly classified as positive in the group that the classifier has declared positive. It is represented as p.
  • Recall: This is the percentage of observations labeled as positive that are correctly classified and is represented as r.
  • F1-measure or F1-score: This measure strikes a balance between precision and recall. It is computed as the harmonic mean of the precision and recall with values ranging between 0 (worst score) and 1 (best score). It is represented as F1.
  • Fn score: This is the generic F scoring method with an arbitrary n degree. It is represented as Fn.
  • G measure: This is similar to the F-measure but is computed as the geometric mean of precision p and recall r. It is represented as G.

Note

Validation metrics

M3:Accuracy ac, precision p, recall r, F1, Fn, and G scores are defined as follows:

Key quality metrics

The computation of the precision, recall, and F1 score depends on the number of classes used in the classifier. We will consider the following implementations:

  • F-score validation for binomial (two classes) classification (that is, a positive and negative outcome)
  • F-score validation for multinomial (more than two classes) classification

F-score for binomial classification

The binomial F validation computes the precision, recall, and F scores for the positive class.

Let's implement the F-score or F-measure as a specialized validation of the following:

trait Validation { def score: Double }

The BinFValidation class encapsulates the computation of the Fn score as well as precision and recall by counting the occurrences of TP, TN, FP, and FN values. It implements the M3 formula. In the tradition of Scala programming, the class is immutable; it computes the counters for TP, TN, FP, and FN when the class is instantiated. The class takes the following three parameters:

  • The expected values with the 0 value for a negative outcome and 1 for a positive outcome
  • The set of observations, xt, is used for validating the model
  • The predictive predict function classifies observations (line 1)

The code will be as follows:

class BinFValidation[T <: AnyVal](
     expected: Vector[Int],
     xt: XVSeries[T])
     (predict: Array[T] => Int)(implicit f: T => Double) 
  extends Validation { //1

  val counters = {
    val predicted = xt.map( predict(_))
    expected.zip(predicted)
      .aggregate(new Counter[Label])((cnt, ap) => 
         cnt + classify(ap._1, ap._2), _ ++ _) //2
  }

  override def score: Double = f1   //3
  lazy val f1 = 2.0*precision*recall/(precision + recall)
  lazy val precision = compute(FP)  //4
  lazy val recall = compute(FN) 

  def compute(n: Label): Double = {
    val denom = counters(TP) + counters(n)
    counters(TP).toDouble/denom
  }
  def classify(predicted: Int, expected: Int): Label = //5
    if(expected == predicted) if(expected == POSITIVE) TP else TN
    else if(expected == POSITIVE) FN else FP 
}

The constructor counts the number of occurrences for each of the four outcomes (TP, TN, FP, and FN) (line 2). The precision, recall, and f1 values are defined as lazy values so they are computed only once, when they are accessed directly or the score method is invoked (line 4). The F1 measure is the most commonly used scoring value for validating classifiers. Therefore, it is the default score (line 3). The classify private method extracts the qualifier from the expected and predicted values (line 5).

The BinFValidation class is independent of the type of classifier, training, labeling process, and type of observations.

Contrary to Java, which defines an enumerator as a class of types, Scala requires enumerators to be singletons. Enumerators extend the scala.Enumeration abstract class:

object Label extends Enumeration {
  type Label = Value
  val TP, TN, FP, FN = Value
}

The F-score formula with higher cardinality (Fn) with n > 1 favors precision over recall, which is shown in the following graph:

F-score for binomial classification

A comparative analysis of the impact of precision on F1, F2, and F3 score for a given recall

Note

Multiclass scoring

Our implementation of the binomial validation computes the precision, recall, and F1 score for the positive class only. The generic multinomial validation class presented in the next section computes these quality metrics for both positive and negative classes.

F-score for multinomial classification

The validation metric is defined by the M3 formula. The idea is quite simple: the precision and recall values are computed for all the classes and then they are averaged to produce a single precision and recall value for the entire model. The precision and recall for the entire model leverage the counts of TP, FP, FN, and TN introduced in the previous section.

There are two commonly used set of formulas to compute the precision and recall for a model:

  • Macro: This method computes the precision and recall for each class, sums and then averages them up.
  • Micro: This method sums the numerator and denominator of the precision and recall formulas for all the classes before computing the precision and recall.

We will use the macro formulas from now on.

Note

Macro formulas for multinomial precision and recall

M4: The macro version of the precision p and recall r for a model of the c classes is computed as follows:

F-score for multinomial classification

The computation of the precision and recall factor for a classifier with more than two classes requires the extraction and manipulation of the confusion matrix. We use the following convention: expected values are defined as columns and predicted values are defined as rows.

F-score for multinomial classification

A confusion matrix for six class classification

The MultiFValidation multinomial validation class takes the following four parameters:

  • The expected class index with the 0 value for a negative outcome and 1 for a positive outcome
  • The set of observations, xt, is used for validating the model
  • The number of classes in the model
  • The predict predictive function classifies observations (line 7)

The code will be as follows:

class MultiFValidation[T <: AnyVal](
    expected: Vector[Int],
    xv: XVSeries[T],
    classes: Int)
    (predict: Array[T] => Int)(implicit f : T => Double)
  extends Validation { //7

  val confusionMatrix: Matrix[Int] = //8
  labeled./:(Matrix[Int](classes)){case (m, (x,n)) => 
    m + (n, predict(x), 1)}  //9

 val macroStats: DblPair = { //10
   val pr= Range(0, classes)./:(0.0,0.0)((s, n) => {
     val tp = confusionMatrix(n, n)   //11
     val fn = confusionMatrix.col(n).sum – tp  //12
     val fp = confusionMatrix.row(n).sum – tp  //13
     (s._1 + tp.toDouble/(tp + fp), s._2 +tp.toDouble/(tp + fn))
   })
   (pr._1/classes, pr._2/classes)
 }
 lazy val precision: Double = macroStats._1
 lazy val recall: Double = macroStats._1
 def score: Double = 2.0*precision*recall/(precision+recall)
 }

The core element of the multiclass validation is the confusion matrix, confusionMatrix (line 8). Its elements at indices (i, j) = (index of expected class for an observation, index of the predicted class for the same observation) are computed using the expected and predictive outcome for each class (line 9).

As stated in the introduction of the section, we use the macro definition of the precision and recall (line 10). The count of a true positive, tp, for each class corresponds to the diagonal element of the confusion matrix (line 11). The count of the fn false negatives for a class is computed as the sum of the counts for all the predicted classes (column values), given an expected class except the true positive class (line 12). The count of the fp false positives for a class is computed as the sum of the counts for all the expected classes (row values), given a predicted class except the true positive class (line 13).

The formula for the computation of the F1 score is the same as the formula used in the binomial validation.

Cross-validation

It is quite common that the labeled dataset (observations plus the expected outcome) available to the scientists is not very large. The solution is to break the original labeled dataset into K groups of data.

One-fold cross validation

The one-fold cross validation is the simplest scheme used for extracting a training set and validation set from a labeled dataset, as described in the following diagram:

One-fold cross validation

An illustration of the generation of a one-fold validation set

The one-fold cross validation methodology consists of the following three steps:

  1. Select the ratio of the size of the training set over the size of the validation set.
  2. Randomly select the labeled observations for the validation phase.
  3. Create the training set as the remaining labeled observations.

The one-fold cross validation is implemented by the OneFoldXValidation class. It takes the following three arguments: an xt vector of observations, the expected vector of expected classes, and ratio of the size of the training set over the size of the validation set (line 14):

type ValidationType[T] = Vector[(Array[T], Int)]
class OneFoldXValidation[T <: AnyVal](
    xt: XVSeries[T],
    expected: Vector[Int], 
    ratio: Double)(implicit f : T => Double) {  //14
  val datasSet: (ValidationType[T], ValidationType[T]) //15
  def trainingSet: ValidationType[T] = datasSet._1
  def validationSet: ValidationType[T] = datasSet._1
}

The constructor of the OneFoldXValidation class generates the segregated training and validation sets from the set of observations and expected classes (line 15):

val datasSet: (Vector[LabeledData[T]],Vector[LabeledData[T]]) = { 
  val labeledData = xt.drop(1).zip(expected)  //16
  val trainingSize = (ratio*expected.size).floor.toInt //17
  
  val valSz = labeledData.size - trainingSize
  val adjSz = if(valSz < 2) 1 
          else if(valSz >= labeledData.size)  labeledData.size -1 
          else valSz  //18
  val iter = labeledData.grouped(adjSz )  //18
  val ordLabeledData = labeledData
      .map( (_, Random.nextDouble) )  //19
      .sortWith( _._2 < _._2).unzip._1 //20
 
  (ordlabeledData.takeRight(adjValSz),   
   ordlabeledData.dropRight(adjValSz))  //21
}

The initialization of the OneFoldXValidation class creates a labeledData vector of labeled observations by zipping the observations and the expected outcome (line 16). The training ratio value is used to compute the respective size of the training set (line 17) and validation set (line 18).

In order to create training and validations sets randomly, we zip the labeled dataset with a random generator (line 19), and then reorder the labeled dataset by sorting the random values (line 20). Finally, the method returns the pair of training set and validation set (line 21).

K-fold cross validation

The data scientist creates K training-validation datasets by selecting one of the groups as a validation set and then combining all the remaining groups into a training set, as illustrated in the following diagram. The process is known as the K-fold cross validation [2:7].

K-fold cross validation

An illustration of the generation of a K-fold cross validation set

The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset.

Bias-variance decomposition

The challenge is to create a model that fits both the training set and subsequent observations to be classified during the validation phase.

If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations she/he selected for training are representative of the real world.

The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance.

The mathematical definition for the bias, variance, and mean square error (MSE) of the distribution are defined by the following formulas:

Note

M5: Variance and bias for a true model, θ, is defined as:

Bias-variance decomposition

M6: Mean square error is defined as:

Bias-variance decomposition

Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, you have not been introduced to most of the machines learning techniques. Therefore, we create a simulator to illustrate the relation between the bias and variance of a classifier. The components of the simulation are as follows:

  • A training set, training
  • A simulated target model of the target: Double => Double type extracted from the training set
  • A set of possible models to evaluate

A model that exactly matches the training data overfits the target model. Models that approximate the target model will most likely underfit. The models in this example are defined by single variable functions.

These models are evaluated against a validation dataset. The BiasVariance class takes the target model, target, and the size of the nValues validation test as parameters (line 22). It merely implements the formula to compute the bias and variance for each model:

type Dbl_F = Double => Double 
class BiasVariance[T](target: Dbl_F,nValues: Int)
     (implicit f: T => Double) {//22
  def fit(models: List[Dbl_F]): List[DblPair] = { //23
    models.map(accumulate(_, models.size)) //24
  }
}

The fit method computes the variance and bias for each of the models model compared to the target model (line 23). It computes the mean, variance, and bias in the accumulate method (line 24):

def accumulate(f: Dbl_F, y:Double, numModels: Int): DblPair = 
  Range(0, nValues)./:(0.0, 0.0){ case((s,t) x) => { 
    val diff = (f(x) - y)/numModels
    (s + diff*diff, t + Math.abs(f(x)-target(x))) //25
  }}

The training data is generated by the single variable function with the r1 and r2 noise components:

Bias-variance decomposition

The accumulate method returns a tuple (variance, bias) for each model, f (line 25). The model candidates are defined by the following family of single variable for values n = 1, 2, and 4:

Bias-variance decomposition

The target model (line 26) and models (line 27) belong to the same family of single variable functions:

val template = (x: Double, n : Int) => 
                        0.2*x*(1.0 + Math.sin(x*0.1)/n) 
val training = (x: Double) => {
  val r1 = 0.45*(Random.nextDouble-0.5)
  val r2 = 38.0*(Random.nextDouble - 0.5) + Math.sin(x*0.3)
  0.2*x*(1.0 + Math.sin(x*0.1 + r1)) + r2
}
Val target = (x: Double) => template(x, 1) //26
val models = List[(Dbl_F, String)] (  //27
  ((x: Double) => template(x, 4), "Underfit1"),  
  ((x: Double) => template(x, 2), "Underfit2"),
  ((x : Double) => training(x), "Overfit")
  (target, "target"),
)
val evaluator = new BiasVariance[Double](target, 200)
evaluator.fit(models.map( _._1)) match { /* … */ }

The JFreeChart library is used to display the training dataset and the models:

Bias-variance decomposition

Fitting models to dataset

The model that replicates the training data overfits. The models that smooth the model with lower amplitude for the sine component of the template function underfit. The variance-bias trade-off for the different models and training data is illustrated in the following scatter chart:

Bias-variance decomposition

Scatter plot for the bias-variance trade-off for four models, one duplicating the training set

The variance of each of the smoothing or approximating models is lower than the variance of the training set. As expected the target model, 0.2.x.(1+sin(x/10)), has no bias and no variance. The training set has a very high variance because it overfits any target model. The last chart compares the mean square error between each of the models, training set, and the target model:

Bias-variance decomposition

Comparative mean squares error for four models

Note

Evaluating bias and variance

The section uses a fictitious target model and training set to illustrate the concept of the bias and variance of models. The bias and variance of machine learning models are actually estimated using validation data.

Overfitting

You can apply the methodology presented in the example to any classification and regression model. The list of models with low variance includes constant functions and models independent of the training set. High degree polynomials, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has a low bias, while linear regression applied to nonlinear data has a higher bias [2:8].

Overfitting affects all aspects of the modeling process negatively, for example:

  • It renders debugging difficult
  • It makes the model too dependent of minor fluctuations (long tail) and noisy data
  • It may discover irrelevant relationships between observed and latent features
  • It leads to poor predictive performance

However, there are well-proven solutions to reduce overfitting [2:9]:

  • Increasing the size of the training set whenever possible
  • Reducing noise in labeled observations using smoothing and filtering techniques
  • Decreasing the number of features using techniques such as principal components analysis, as discussed in the Principal components analysis section in Chapter 4, Unsupervised Learning
  • Modeling observable and latent noisy data using Kalman or auto regressive models, as discussed in Chapter 3, Data Preprocessing
  • Reducing inductive bias in a training set by applying cross-validation
  • Penalizing extreme values for some of the model's features using regularization techniques, as discussed in the Regularization section in Chapter 6, Regression and Regularization
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset