Assessing a model

Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as an N-Fold cross-validation scheme, and selecting the appropriate labeled data.

Validation

The purpose of this section is to create a Scala class to be used in future chapters for validating models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.

Key metrics

Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with Black (with respect to White) color in the following diagram. Data scientists use the following terminology:

  • True positives (TP): These are observations that are correctly labeled as belonging to the positive class (white dots on a dark background)
  • True negatives (TN): These are observations that are correctly labeled as belonging to the negative class (black dots on a light background)
  • False positives (FP): These are observations incorrectly labeled as belonging to the positive class (white dots on a dark background)
  • False negatives (FN): These are observations incorrectly labeled as belonging to the negative class (black dots on a light background)
    Key metrics

    Categorization of validation results

This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures:

  • Accuracy: Represented as ac, this is the percentage of observations correctly classified.
  • Precision: Represented as p, this is the percentage of observations correctly classified as positive in the group that the classifier has declared positive.
  • Recall: Represented as r, this is the percentage of observations labeled as positive that are correctly classified.
  • F-Measure or F-score F1: This is the score of a test's accuracy that strikes a balance between precision and recall. It is computed as the harmonic mean of the precision and recall with values ranging between 0 (worst score) and 1 (best score).
  • G-measure: Represented as G, this is similar to the F-measure but is computed as the geometric mean of precision p and recall r.
    Key metrics
    Key metrics

Implementation

Let's implement the validation formula using the same trait-based modular design used in creating the preprocessor and classifier modules. The Validation trait defines the signature for the validation of a classification model: the computation of the F1 statistics and the precision-recall pair:

trait Validation {
  def f1: Double
  def precisionRecall: (Double, Double)
}

Let's provide a default implementation of the Validation trait of the F1Validation class. In the tradition of Scala programming, the class is immutable; it computes the counters for TP, TN, FP, and FN when the class is instantiated. The class takes two parameters:

  • The array of actual versus expected class: actualExpected
  • The target class for true positive observations: tpClass
    class F1Validation(actualExpected: Array[(Int, Int)], tpClass: Int) extends Validation {
      val counts = actualExpected.foldLeft(new Counter[Label])((cnt, oSeries) => cnt + classify(oSeries._1, oSeries._2))
    
      lazy val accuracy = {
        val num = counts(TP) + counts(TN)
        num.toDouble/counts.foldLeft(0)( (s,kv)  => s + kv._2)
      }
     
      lazy val precision = counts(TP).toDouble/(counts(TP) + counts(FP)) 
      lazy val recall = counts(TP).toDouble/(counts(TP) + counters(FN))
    
      override def f1: Double  = 2.0*precision*recall/(precision + recall)
      override def precisionRecall: (Double, Double) = (precision, recall)
    
      def classify(actual: Int, expected: Int): Label = {
         if(actual == expected) { if(actual == tpClass) TP else TN }
         else { if (actual == tpClass) FP else FN }
       }
    }

The precision and recall variables are defined as lazy so they are computed only once, when they are either accessed for the first time or the f1 and precisionRecall functions are invoked. The class is independent of the selected machine learning algorithm, the training, the labeling process, and the type of observations.

Contrary to Java, which defines an enumerator as a class of types, Scala requires enumerators to be singletons that inherit the functionality of the Enumeration class:

object Label extends Enumeration {
  type Label = Value
  val TP, TN, FP, FN = Value
}

K-fold cross-validation

It is quite common that the labeled dataset used for both training and validation is not large enough. The solution is to break the original labeled dataset into K data groups. The data scientist creates K training-validation datasets by selecting one of the groups as a validation set then combining all other remaining groups into a training set as illustrated in the next diagram. The process is known as the K-fold cross validation [2:7].

K-fold cross-validation

The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset.

Bias-variance decomposition

There is an obvious challenge in creating a model that fits both the training set and subsequent observations to be classified during the validation phase.

If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations he or she selected for training are representative to the real world.

The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance.

The mathematical definition for the bias, variance, and mean squared error (MSE) of the distribution are defined by the following formulas:

Note

Variance and bias for a true model, θ:

Bias-variance decomposition

Mean square error:

Bias-variance decomposition

Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, most of the machines learning techniques have not been introduced yet. Therefore, the example will emulate a multiple models fEst: Double => Double generated from non-overlapping training sets.

These models are evaluated against a test/validation datasets that are emulated by a model, emul. The BiasVarianceEmulator emulator class takes the emulator function and the size of the nValues validation test as parameters. It merely implements the formula to compute the bias and variance for each of the fEst models:

class BiasVarianceEmulator[T <% Double](emul: Double => Double, nValues: Int) {
    
  def fit(fEst: List[Double => Double]): Option[XYTSeries] = {
     val rf = Range(0, fEst.size)
     val meanFEst = Array.tabulate(nValues)( x =>  
         rf.foldLeft(0.0)((s, n) => s+fEst(n)(x))/fEst.size) // 1

     val r = Range(0, nValues)
     Some(fEst.map(fe => {
        r.foldLeft(0.0, 0.0)((s, x) => { 
          val diff = (fe(x) - meanFEst(x))/ fEst.size   // 2
          (s._1 + diff*diff, s._2 + Math.abs(fe(x)-emul(x)))} )
     }).toArray)
  }
}

The fit method computes the variance and bias for each of the fEst models generated from training. First, the mean of all the models are computed (line 1), and then used in the computation of the variance and bias. The method returns a tuple (variance, bias) for each of the fEst model.

Let's apply the emulator to three nonlinear regression models evaluated against validation data:

Bias-variance decomposition

The client code for the emulator consists of defining the emul emulator function, and a list, fEst, of three models defined as tuples of (function, descriptor) of type (Double=>Double, String). The fit method is call on the model functions extracted through a map, as shown in the following code:

val emul = (x: Double) => 0.2*x*(1.0 + Math.sin(x*0.05))
val fEst = List[(Double=>Double, String)] (
  ((x: Double) => 0.2*x, "y=x/5"),
  ((x: Double) => 0.0003*x*x + 0.18*x, "y=3e-4.x^2-0.18x"),
  ((x: Double) =>0.2*x*(1+Math.sin(x*0.05),
                "y=x(1+sin(x/20))/5"))
val emulator = new BiasVarianceEmulator[Double](emul, 200)
emulator.fit(fEst.map( _._1)) match {
  case Some(varBias) => show(varBias)
  case None => …
}

The JFreeChart library is used to display the test dataset and the three model functions.

Bias-variance decomposition

Fitting models to dataset

The variance-bias trade-off is illustrated in the following scatter chart using the absolute value of the bias:

Bias-variance decomposition

The more complex the function, the lower the bias is. It is usually, but not always related to, a high variance. The most complex function y=x (1+sin(x/20))/5 has by far the highest variance and the lowest bias. The more complex model matches fairly well with the training dataset. As expected, the mean square error reflects the ability of each of the three models to fit the test data.

Bias-variance decomposition

Mean square error bar chart

The low bias of the complex model reflects in its ability to predict new observations correctly. Its MSE is therefore low, as expected.

Complex models with low bias and high variance are known as overfitting. Models with high bias and low variance are characterized as underfitting.

Overfitting

The methodology presented in the example can be applied to any classification and regression model. The list of models with low variance includes constant function and models independent of the training set. High degree polynomial, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has a low bias, while linear regression applied to nonlinear data has a higher bias [2:8]

Overfitting affects all aspects of the modeling process negatively, for example:

  • It is a sure sign of an overly complex model, which is difficult to debug and consumes computation resources
  • It makes the model representing minor fluctuations and noise
  • It may discover irrelevant relationships between observed and latent features
  • It has poor predictive performance

However, there are well-proven solutions to reduce overfitting [2:9]:

  • Increasing the size of the training set whenever possible
  • Reducing noise in labeled and input data through filtering
  • Decreasing the number of features using techniques such as principal components analysis
  • Modeling observable and latent noised using filtering techniques such as Kalman or autoregressive models
  • Reducing inductive bias in a training set by applying cross-validation
  • Penalizing extreme values for some of the model's features using regularization techniques
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset