Naïve Bayes classifiers

This conditional independence between X features is an essential requirement for the Naïve Bayes classifier. It also restricts its applicability. The Naïve Bayes classification is better understood through simple, concrete examples [5:5].

Introducing the multinomial Naïve Bayes

Let's consider the problem of how to predict change in interest rates. The first step is to list the factors that potentially may trigger or cause an increase or decrease in the interest rates. For the sake of illustrating Naïve Bayes, we will select the consumer price index (CPI), change in the Federal fund rate (FDF) and the gross domestic product (GDP) as a first set of features. The terminology is described in the Terminology section under Finances 101 in Appendix A, Basic Concepts.

The use case is to predict direction of the change in the yield of the 1-year Treasury bill (1yTB), taking into account the change in the current CPI, FDF, and GDP. The objective is, therefore, to create a predictive model using a combination of these three features.

It is assumed that there is no available financial investment expert who can supply rules or policies to predict interest rates. Therefore, the model depends highly on the historical data. Intuitively, if one feature is always increasing when the yield of the 1-year Treasury bill increases, then we can conclude that there is a strong correlation of causal relationship between the features and the output variation in interest rates.

Introducing the multinomial Naïve Bayes

The Naïve Bayes model for predicting the change in the yield of the 1-year T-bill

The correlation (or cause-effect relationship) is derived from historical data. The methodology consists of counting the number of times each feature either increases (UP) or decreases (DOWN), and recording the corresponding output (or labeled data), as illustrated in the following table:

ID

GDP

FDF

CPI

1yTB

1

UP

DOWN

UP

UP

2

UP

UP

UP

UP

3

DOWN

UP

DOWN

DOWN

4

UP

DOWN

DOWN

DOWN

256

DOWN

DOWN

UP

DOWN

First, let's tabulate the number of occurrence of each change {UP, DOWN} for the three features and the output value (the 1-year Treasury bill):

Number

GDP

FDF

CPI

1yTB

UP

169

184

175

159

DOWN

97

72

81

97

Total

256

256

256

256

UP/Total

0.66

0.72

0.68

0.625

Next, let's compute the number of positive directions for each of the features when the yield 1-year Treasury bill increases (159 occurrences):

Number

GDP

Fed funds

CPI

UP

110

136

127

DOWN

49

23

32

Total

159

159

159

UP/Total

0.69

0.85

0.80

For this table, we conclude that the yield of the 1-year Treasury bill increases when the GDP is increasing (69 percent of the time), the rate of the Federal funds increases (85 percent of the time) and the CPI increases (80 percent of the time).

Let's formalize the Naïve Bayes model before turning these findings into a probabilistic model.

Formalism

Let's start by clarifying the terminology used in the Bayesian model:

  • Class prior probability or class prior is the probability of a class
  • Likelihood is the probability of an observation given a class, also known as the probability of the predictor given a class
  • Evidence is the probability of observations occurring, also known as the prior probability of the predictor
  • Posterior probability is the probability of an observation x being in a given class

No model can be simpler! The log likelihood, log(p(x|C), is commonly used instead of the likelihood, p(x|C), (probability of an observation given a class) in order to reduce the impact of the features y that have a low likelihood, p(y|C).

The objective of the Naïve Bayes classification of a new observation, is to compute the class that has the higher log likelihood. The mathematical notation for the Naïve Bayes model is also straightforward.

Note

The posterior probability, Formalism:

Formalism
  • x = {xi} (0, n-1), with a set of n features
  • {Cj}, a set of classes with their class prior p(Cj)
  • p(x), the evidence of new observation
  • p(x| Cj), the likelihood for each feature

Posterior probability, Formalism, with conditional independence:

Formalism
  • xi are independent and the probabilities are normalized for evidence p(x) = 1

Log-likelihood:

Formalism

Naïve Bayes classification:

Formalism

This particular use case has a major drawback—the GDP statistics are provided quarterly, while the CPI data is made available once a month and a change in the FDF rate is rather infrequent.

The frequentist perspective

The ability to compute the posteriori probability depends on the formulation of the likelihood using historical data. A simple solution is to count the occurrences of observations for each class and compute the frequency.

Let's consider the first example that predicts the direction of change in the yield of the 1-year Treasury bill given changes in the GDP, FDF, and CPI.

The results are expressed with simple probabilistic formulas and a directed graphical model:

P(GDP=UP|1yTB=UP) = 110/159
P(1yTB=UP) = num occurrences (1yTB=UP)/total num of occurrences=159/256
p(1yTB=UP|GDP=UP,FDF=UP,CPI=UP) = p(GDP=UP|1yTB=UP) x
                                  p(FDF=UP|1yTB=UP) x
                                  p(CPI=UP|1yTB=UP) x
                                  p(1yTB=UP) = 0.69 x 0.85 x 0.80 x 0.625
The frequentist perspective

The Bayesian network for the prediction of the change of the yield of the 1-year Treasury bill

Tip

Overfitting

The Naïve Bayes model is not immune to overfitting, in case the number of observations is not large enough relative to the number of features. One approach to address this problem is to perform a feature selection, using the mutual information exclusion [5:6].

This problem is not a good candidate for a Bayesian classification for two reasons:

  • The training set is not large enough to compute accurate prior probabilities and generate a stable model; decades of quarterly GDP data is needed to train and validate the model
  • The features have different rates of change, which predominately favor the feature with the highest frequency; in this case, the CPI

Let's select another use case for which a large historical data set is available and can be automatically labeled.

The predictive model

The predictive model is the second use case that consists of predicting the direction of the closing price of a stock, pr(t+1) = {UP, DOWN}, at trading day t+1, given the history of its direction of the price, volume, and volatility for the previous t days, pr(i),i=1,t. The features volume and volatility have been already used in the Creating a model (learning) section under Let's kick the tires in Chapter 1, Getting Started.

Therefore, the three features under consideration are:

  • The closing price, pr(t), of the last trading session, t, is above or below the average closing price over the n previous trading days, [t-n, t]
  • The volume of the last trading day, vl(t), is above or below the average volume of the n previous trading days
  • The volatility on the last trading day, vt(t), is above or below the average volatility of the previous n trading days

The directed graphic model can be expressed using one output variable (price at session t+1 is greater than price at session t) and three features: price condition (1), volume condition (2), and volatility condition (3).

The predictive model

A Bayesian model for predicting the future direction of the stock price

This model works under the assumption that there is at least one observation, and ideally few observations for each feature and for each labeled output.

The zero-frequency problem

It is possible that the training set does not contain any data actually observed for a feature for a specific label or class. In this case, the mean is 0/N = 0, and therefore, the likelihood is null, making classification unfeasible. The case for which there are only few observations for a feature in a given class is also an issue, as it skews the likelihood.

There are a couple of correcting or smoothing formulas for unobserved features or features with a low number of occurrences that address this issue, such as the Laplace and Lidstone smoothing formula.

Note

The smoothing factor for counters

Laplace smoothing of the mean k/N out of N observations of features of dimension n:

The zero-frequency problem

Lidstone smoothing with a factor The zero-frequency problem:

The zero-frequency problem

The two formulas are commonly used in natural language processing applications, for which occurrence of a specific word or tag is a feature [5:7].

Implementation

I think it is time to write some Scala code and toy around with Naïve Bayes. Let's start with an overview of the software components.

Software design

Our implementation of the Naïve Bayes classifier uses the following components:

  • A generic model, NaiveBayesModel, of the type Model, which is initialized through training during the instantiation of the class.
  • A model for the binomial classification, BinNaiveBayesModel, which subclasses NaiveBayesModel. The model consists of a density function of the type Density, and a pair of positive and negative Likelihood instances.
  • A model for the multinomial classification MultiNaiveBayesModel.
  • The predictive or classification routine is implemented as a data transformation extending the PipeOperator trait.
  • The NaiveBayes classifier class has two parameters: a smoothing function such as Laplace and a labeled training set of the XTSeries type.

The principle of software architecture applied to the implementation of classifiers is described in the Design template for classifiers section in Appendix A, Basic Concepts.

The key software components of the Naïve Bayes classifier are described in the following UML class diagram:

Software design

The UML class diagram for the Naïve Bayes classifier

Training

The objective of the training phase is to build a model consisting of the likelihood for each feature and the class prior. The likelihood for a feature is identified as:

  • The number of occurrences k of this features for N > k observations in case of binary features or counters
  • The mean value for all the observations for this features in the case of numeric or continuous features

It is assumed for the sake of this test case that the features, technical analysis indicators price, volume, and volatility are conditionally independent. This assumption is not actually correct.

Note

Conditional dependency

Recent models, known as Hidden Naïve Bayes (HNB), relax the restrictions on the independence between features. The HNB algorithm uses conditional mutual information to describe the interdependency between some of the features [5:8].

Let's write the code to train the multinomial Naïve Bayes. The first step is to define the likelihood for each feature using historical data. The Likelihood class has the following attributes:

  • The label for the observation, label
  • An array of tuple Laplace or Lidstone smoothed mean and standard deviation, muSigma
  • The prior class prior that computes p(c)

As with any code snippet presented in this book, the validation of class parameters and method arguments are omitted in order to keep the code readable. The Likelihood class is defined as follows:

type Density = (Double*) => Double //1
type XYTSeries = Array[(Double, Double)]
val MINLOGARG = 1e-32
val MINLOGVALUE = -MINLOGARG
class Likelihood[T <% Double](val label: Int, val muSigma: XYTSeries, prior: Double) { //2
  def score(obs: Array[T], density: Density): Double =
    (obs, muSigma).zipped
                    .foldLeft(0.0)((post, xms) => {
                     val mean = xms._2._1
                     val stdDev = xms._2._2
                     val _obs = xms._1
       val prob = density(mean, stdDev, _obs)
       post + Math.log(if(prob< MINLOGARG) MINLOGVALUE else prob)
  }) + Math.log(prior) //3

}

The functions of the Density type compute the probability density for the values of a feature (line 1). The method takes an undefined number of arguments: the mean, the standard deviation, and the input value for the Gaussian distribution, the mean and input value {0, 1} for the Bernoulli distribution. The default probability density function is the normal distribution implemented by Stats.gauss.

The parameterized, view-bounded class, Likelihood, has two purposes:

  • Define the model extracted from training (likelihood for each feature and the class prior) in the constructor (line 2)
  • Compute the score of a new observation as part of the classification process score (line 3). The computation of the log of the likelihood uses a density method of the type Density, which is an argument of the score method. As seen in the next section, the density can be either a Gaussian or a Bernoulli distribution. The score method uses the Scala's zipped method to merge the observation values with the labeled output.

The next step is to define the BinNaiveBayesModel model for a two-class classification scheme. The two-class model consists of the two Likelihood instances: positives for the label UP (value==1) and negatives for the label DOWN (value== 0). In order to make the model generic, we created NaiveBayesModel, an abstract class that can be extended as needed to support both the Binomial and Multinomial Naïve Bayes models, as follows:

abstract class NaiveBayesModel [T <% Double](density: Density) {
  def classify(values: DblVector): Int
}
class BinNaiveBayesModel [T <% Double](positives: Likelihood, negatives: Likelihood, density: Density) extends NaiveBayesModel [T]( density) {
  override def classify(x: Array[T]): Int =
    if (positives.score(x,density) > negatives.score(x,density)) 1
    else 0
}

The classification is executed by the classify method called by the |> operator in the Naïve Bayes classifier. It returns 1 for the class containing the positive cases and 0 for the negative.

Tip

Model validation

The parameters of the Naïve Bayes model (likelihood) are computed through training and the model value is instantiated regardless of whether the model is actually validated in this example. A commercial application would require the model to be validated using a methodology such as the K-fold validation and F1 measure. (Refer to the Design template for classifiers section in Appendix A, Basic Concepts.)

The multinomial Naïve Bayes model, defined by the MultiNaiveBayesModel class is very similar to the BinNaiveBayesModel class:

class MultiNaiveBayesModel[T <% Double](likelihoodXs: List[Likelihood[T]], density: Density) extends NaiveBayesModel[T](density) {
  override def classify(x: Array[T]): Int = 
    likelihoodXs.sortWith( (p1,p2) => p1.score(x, density) > p2.score(x, density)).head.label
}

The multinomial Naïve Bayes model differs from the binomial version in the following ways:

  • The likelihood is defined as a list, likelihoodXs (one likelihood per class)
  • The runtime classification sorts the class by the log likelihood (sortWith), selects the class with the highest score, and returns the class ID

Finally, the Naïve Bayes classifier is implemented by the NaiveBayes class. It implements the training and runtime classification using the Naïve Bayes formula. Any supervised learning model needs to be validated. In order to force the developer to define a validation for any new supervised learning technique, the class inherits from the Supervised trait that declares the validation method, validate:

trait Supervised[T] {
  def validate(xt: XTSeries[(Array[T],Int)], tpClass:Int): Double
}

The validate method takes a labeled time series xt as an array of tuples (observation, class label) and the tpClass index that contains the true positives (that is, increase in the stock price) outcome. The method returns an F1-measure.

Besides inheriting the Supervised trait, the NaiveBayes class inherits the PipeOperator trait so that it can be integrated into a generic workflow as one of the computation units.

The attributes of the multinomial Naïve Bayes are as follows:

  • The smoothing formula (Laplace, Lidstone, and so on): smoothing
  • The labeled training set defined as a time series: xt
  • The probability density function: density

The NaiveBayes class is defined as follows:

Class NaiveBayes[T <% Double](smoothing: Double, xt: XTSeries[(Array[T], Int)], density: Density) extends PipeOperator[XTSeries[Array[T]], Array[Int]] with Supervised[T] {

  val model = BinNaiveBayesModel[T](train(1),train(0),density) //1
  def train(label:Int)(implicit f: Array[T] => DblVector): Likelihood[T] = {  //2
    val xi = xt.toArray
    val values= xi.filter( _._2 == label).map(x => f(x._1) )
    val dim = xi(0)._1.size
    val vt = XTSeries[DblVector](values.toArray) //3
    val muStdDev = statistics(vt).map(stat => 
               (stat.lidstoneMean(smoothing, dim), stat.stdDev))
     Likelihood(label, muStdDev, values.size.toDouble/xi.size) //4
  }
  …

The classifier uses the binomial Naïve Bayes model, BinNaiveBayesModel (line 1). The training process is implemented in the constructor by invoking the private train method (line 2). The method relies on an implicit conversion, f: Array[T] => DblVector, because of the Array type erasure. The main reason for this is to hide the details of the model and its training from the client code. We cannot assume that the user of the model is the same person as the creator of the model.

Tip

Training and class instantiation

There are several benefits of allowing the instantiation of the Naïve Bayes mode only once when it is trained. It prevents the client code from invoking the algorithm on an untrained or partially trained model, and it reduces the number of states of the model (untrained and trained). It is an elegant way to hide the details of the training of the model from the user.

The train method takes the labeled observations (observations or label) as input. The vt time series is extracted (line 3) and the likelihoods are calculated by counting the positive and negative labels, computing the mean, corrected with the Lidstone smoothing formula (line 4). The lidstoneMean method and standard deviation, stdDev, use the statistics method of the XTSeries singleton instance.

The NaiveBayes class also defined the runtime classification method |> and the F1-validation methods. Both methods are described in the next section.

Note

Handling missing data

Naïve Bayes has a no-nonsense approach to handling missing data. You just ignore the attribute in the observations for which the value is missing. In this case, the prior for this particular attribute for these observations is not computed. This workaround is obviously made possible because of the conditional independence between features.

Classification

The likelihood and class prior that have been computed through training is used for validating the model and classifying new observations.

The score represents the log of likelihood estimate (or the posterior probability), which is computed as the summation of the log of the Gaussian distribution using the mean and standard deviation, extracted from the training phase and the log of the likelihood class.

The Naïve Bayes classification using Gaussian distribution is illustrated using two classes, C1 and C2, and a model with two features (x and y):

Classification

Illustration of the Gaussian Naive Bayes using a 2-dimensional model

The Gaussian mixture is particularly suited for modeling datasets for which the features have large sets of discrete values or are continuous variables. The conditional probabilities for the feature x is described by the normal probability density function [5:9].

Note

Naïve Bayes classification using Gaussian density

For a Lidstone or Laplace smoothed mean µ' and a standard deviation σ, the log likelihood of a posterior probability is defined as:

Classification

In this example, we used the Gaussian distribution as our probability density function as defined in the Stats object, which was introduced in Chapter 2, Hello World!. The implementation of the computation of the Gaussian probability density is quite simple, shown as follows:

object Stats {
  final val INV_SQRT_2PI = 1.0/Math.sqrt(2.0*Math.PI)
  def gauss(mu: Double, sigma: Double, x:Double) : Double = {
    val y = x - mu
    INV_SQRT_2PI/sigma * Math.exp(-0.5*y*y/sigma*sigma)
  }
  def gauss(x: Double*): Double = gauss(x(0), x(1), x(2))
  …
  }

The second version of the Gaussian density is required to handle the Density type: (Double, Double, Double) => Double.

Finally, the classification method is implemented as the pipe operator |> of the NaiveBayes class. The classification model and the density function are provided at runtime as attributes of the class:

def |> : PartialFunction[XTSeries[Array[T]], Array[Int]] = {
  case xt: XTSeries[Array[T]] if(xt != null && xt.size > 0 && model != None) => xt.toArray.map( model.classify( _))}

Labeling

The most critical element in the training of a supervised learning algorithm is the creation of labeled data. Fortunately, in this case, the label (or expected class) can be automatically generated. The objective is to predict the direction of the price of a stock for the next trading day, taking into account the average price, volume, and volatility over the last n days.

The first step is to extract the average price, volume, and volatility for each stock during the period of Jan 1, 2000 and Dec 31, 2014 with daily and weekly closing prices. Let's use the simple moving average to compute these averages for the [t-n, t] window.

First, the extractor function extracts the closing, high, and low prices, and volume for each trading day, using the toDouble and % operators described in the Data extraction and Data sources section in Appendix A, Basic Concepts, as follows:

val extractor = toDouble(CLOSE)  //stock closing price
               :: ratio(HIGH, LOW) //volatility (HIGH-LOW)/HIGH
               :: toDouble(VOLUME)  //daily stock trading volume
               ::List[Array[String] =>Double]()

Secondly, the data source extractor outputs the four statistics for each stock (line 1) for which the average for a window period is computed (line 3) using a simple moving average mv (line 2):

val xs = DataSource(symbol, path, true) |> extractor //1
val mv = SimpleMovingAverage(period)  //2
    
val ratios = xs.map(x => { //3
   val xt =  mv get x,toArray 
   val zValues = x.drop(period).zip(xt.drop(period))
   zValues.map(z => if(z._1 > z._2) 1 else 0).toArray  //4
})
var prev = xs(0)(period)
val label = xs(0).drop(period+1).map( x => { //5
   val y = if( x > prev) 1 else 0
   prev = x; y
}).toArray
ratios.transpose.take(label.size).zip(label)  //6

The Scala's drop method is used to shift the time series to compute the average of the three variables: price, toDouble(CLOSE); volume, toDouble(VOLUME); and volatility, ratio(HIGH, LOW) (line 4). The labeled data, direction of the price action for the next trading day, is added to the three ratios (line 5). Finally, the array is transposed to extract the list of tuples (list of UP/DOWN values for each feature and price direction for next trading day/labeled data) (line 6).

The labeled data extracted from the input CSV file is used in the training and validation of the time series using the Naïve Bayes classifier:

val trainValidRatio = 0.8
val period = 10

val labels = XTSeries[(Array[Int], Int)](input.map(x =>
                                (x._1.toArray, x._2)).toArray) //7
val numObsToTrain  = (trainValidRatio*labels.size).floor.toInt //8
val nb = NaiveBayes[Int](labels.take(numObsToTrain)) //9
validate(labels.drop(numObsForTrains+1), nb) //10

The original labeled dataset, labels, is split between training and validation labeled data (line 7) using the trainValidRatio ratio (line 8). The NaiveBayes constructor initializes the model through training (line 9). Finally, the validate method returns the F1 measure for the validation test (line 10).

Results

The next chart plots the value of the F1 measure of the predictor of the direction of the IBM stock using price, volume, and volatility over the previous n trading days, with n varying from 1 to 12 trading days:

Results

A graph of the F1-measure for the validation of the Naïve Bayes model

The preceding chart illustrates the impact of the value of the averaging period (number of trading days) on the quality of the multinomial Naïve Bayesian prediction, using the value of stock price, volatility, and volume relative to their average over the averaging period.

From this experiment, we conclude that:

  • The prediction of the stock movement using the average price, volume, and volatility is not very good. The F1 measure for the models using weekly (with respect to daily) closing prices varies between 0.68 and 0.74 (with respect to 0.56 and 0.66).
  • The prediction using weekly closing prices is more accurate than the prediction using the daily closing prices. In this particular example, the distribution of the weekly closing prices is more reflective of an intermediate term trend than the distribution of daily prices.
  • The prediction is somewhat independent of the period used to average the features.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset