This conditional independence between X features is an essential requirement for the Naïve Bayes classifier. It also restricts its applicability. The Naïve Bayes classification is better understood through simple, concrete examples [5:5].
Let's consider the problem of how to predict change in interest rates. The first step is to list the factors that potentially may trigger or cause an increase or decrease in the interest rates. For the sake of illustrating Naïve Bayes, we will select the consumer price index (CPI), change in the Federal fund rate (FDF) and the gross domestic product (GDP) as a first set of features. The terminology is described in the Terminology section under Finances 101 in Appendix A, Basic Concepts.
The use case is to predict direction of the change in the yield of the 1-year Treasury bill (1yTB), taking into account the change in the current CPI, FDF, and GDP. The objective is, therefore, to create a predictive model using a combination of these three features.
It is assumed that there is no available financial investment expert who can supply rules or policies to predict interest rates. Therefore, the model depends highly on the historical data. Intuitively, if one feature is always increasing when the yield of the 1-year Treasury bill increases, then we can conclude that there is a strong correlation of causal relationship between the features and the output variation in interest rates.
The correlation (or cause-effect relationship) is derived from historical data. The methodology consists of counting the number of times each feature either increases (UP) or decreases (DOWN), and recording the corresponding output (or labeled data), as illustrated in the following table:
ID |
GDP |
FDF |
CPI |
1yTB |
---|---|---|---|---|
1 |
UP |
DOWN |
UP |
UP |
2 |
UP |
UP |
UP |
UP |
3 |
DOWN |
UP |
DOWN |
DOWN |
4 |
UP |
DOWN |
DOWN |
DOWN |
… | ||||
256 |
DOWN |
DOWN |
UP |
DOWN |
First, let's tabulate the number of occurrence of each change {UP, DOWN} for the three features and the output value (the 1-year Treasury bill):
Number |
GDP |
FDF |
CPI |
1yTB |
---|---|---|---|---|
UP |
169 |
184 |
175 |
159 |
DOWN |
97 |
72 |
81 |
97 |
Total |
256 |
256 |
256 |
256 |
UP/Total |
0.66 |
0.72 |
0.68 |
0.625 |
Next, let's compute the number of positive directions for each of the features when the yield 1-year Treasury bill increases (159 occurrences):
Number |
GDP |
Fed funds |
CPI |
---|---|---|---|
UP |
110 |
136 |
127 |
DOWN |
49 |
23 |
32 |
Total |
159 |
159 |
159 |
UP/Total |
0.69 |
0.85 |
0.80 |
For this table, we conclude that the yield of the 1-year Treasury bill increases when the GDP is increasing (69 percent of the time), the rate of the Federal funds increases (85 percent of the time) and the CPI increases (80 percent of the time).
Let's formalize the Naïve Bayes model before turning these findings into a probabilistic model.
Let's start by clarifying the terminology used in the Bayesian model:
No model can be simpler! The log likelihood, log(p(x|C), is commonly used instead of the likelihood, p(x|C), (probability of an observation given a class) in order to reduce the impact of the features y that have a low likelihood, p(y|C).
The objective of the Naïve Bayes classification of a new observation, is to compute the class that has the higher log likelihood. The mathematical notation for the Naïve Bayes model is also straightforward.
The posterior probability, :
Posterior probability, , with conditional independence:
Log-likelihood:
Naïve Bayes classification:
This particular use case has a major drawback—the GDP statistics are provided quarterly, while the CPI data is made available once a month and a change in the FDF rate is rather infrequent.
The ability to compute the posteriori probability depends on the formulation of the likelihood using historical data. A simple solution is to count the occurrences of observations for each class and compute the frequency.
Let's consider the first example that predicts the direction of change in the yield of the 1-year Treasury bill given changes in the GDP, FDF, and CPI.
The results are expressed with simple probabilistic formulas and a directed graphical model:
P(GDP=UP|1yTB=UP) = 110/159 P(1yTB=UP) = num occurrences (1yTB=UP)/total num of occurrences=159/256 p(1yTB=UP|GDP=UP,FDF=UP,CPI=UP) = p(GDP=UP|1yTB=UP) x p(FDF=UP|1yTB=UP) x p(CPI=UP|1yTB=UP) x p(1yTB=UP) = 0.69 x 0.85 x 0.80 x 0.625
This problem is not a good candidate for a Bayesian classification for two reasons:
Let's select another use case for which a large historical data set is available and can be automatically labeled.
The predictive model is the second use case that consists of predicting the direction of the closing price of a stock, pr(t+1) = {UP, DOWN}, at trading day t+1, given the history of its direction of the price, volume, and volatility for the previous t days, pr(i),i=1,t. The features volume and volatility have been already used in the Creating a model (learning) section under Let's kick the tires in Chapter 1, Getting Started.
Therefore, the three features under consideration are:
The directed graphic model can be expressed using one output variable (price at session t+1 is greater than price at session t) and three features: price condition (1), volume condition (2), and volatility condition (3).
This model works under the assumption that there is at least one observation, and ideally few observations for each feature and for each labeled output.
It is possible that the training set does not contain any data actually observed for a feature for a specific label or class. In this case, the mean is 0/N = 0, and therefore, the likelihood is null, making classification unfeasible. The case for which there are only few observations for a feature in a given class is also an issue, as it skews the likelihood.
There are a couple of correcting or smoothing formulas for unobserved features or features with a low number of occurrences that address this issue, such as the Laplace and Lidstone smoothing formula.
The two formulas are commonly used in natural language processing applications, for which occurrence of a specific word or tag is a feature [5:7].
I think it is time to write some Scala code and toy around with Naïve Bayes. Let's start with an overview of the software components.
Our implementation of the Naïve Bayes classifier uses the following components:
NaiveBayesModel
, of the type Model
, which is initialized through training during the instantiation of the class.BinNaiveBayesModel
, which subclasses NaiveBayesModel
. The model consists of a density function of the type Density
, and a pair of positive and negative Likelihood
instances.MultiNaiveBayesModel
.PipeOperator
trait.NaiveBayes
classifier class has two parameters: a smoothing function such as Laplace and a labeled training set of the XTSeries
type.The principle of software architecture applied to the implementation of classifiers is described in the Design template for classifiers section in Appendix A, Basic Concepts.
The key software components of the Naïve Bayes classifier are described in the following UML class diagram:
The objective of the training phase is to build a model consisting of the likelihood for each feature and the class prior. The likelihood for a feature is identified as:
It is assumed for the sake of this test case that the features, technical analysis indicators price, volume, and volatility are conditionally independent. This assumption is not actually correct.
Let's write the code to train the multinomial Naïve Bayes. The first step is to define the likelihood for each feature using historical data. The Likelihood
class has the following attributes:
label
muSigma
prior
that computes p(c)As with any code snippet presented in this book, the validation of class parameters and method arguments are omitted in order to keep the code readable. The Likelihood
class is defined as follows:
type Density = (Double*) => Double //1 type XYTSeries = Array[(Double, Double)] val MINLOGARG = 1e-32 val MINLOGVALUE = -MINLOGARG class Likelihood[T <% Double](val label: Int, val muSigma: XYTSeries, prior: Double) { //2 def score(obs: Array[T], density: Density): Double = (obs, muSigma).zipped .foldLeft(0.0)((post, xms) => { val mean = xms._2._1 val stdDev = xms._2._2 val _obs = xms._1 val prob = density(mean, stdDev, _obs) post + Math.log(if(prob< MINLOGARG) MINLOGVALUE else prob) }) + Math.log(prior) //3 }
The functions of the Density
type compute the probability density for the values of a feature (line 1
). The method takes an undefined number of arguments: the mean, the standard deviation, and the input value for the Gaussian distribution, the mean and input value {0, 1} for the Bernoulli distribution. The default probability density function is the normal distribution implemented by Stats.gauss
.
The parameterized, view-bounded class, Likelihood
, has two purposes:
2
)3
). The computation of the log of the likelihood uses a density
method of the type Density
, which is an argument of the score
method. As seen in the next section, the density can be either a Gaussian or a Bernoulli distribution. The score
method uses the Scala's zipped
method to merge the observation values with the labeled output.The next step is to define the BinNaiveBayesModel
model for a two-class classification scheme. The two-class model consists of the two Likelihood
instances: positives for the label UP (value==1
) and negatives for the label DOWN (value== 0
). In order to make the model generic, we created NaiveBayesModel
, an abstract class that can be extended as needed to support both the Binomial and Multinomial Naïve Bayes models, as follows:
abstract class NaiveBayesModel [T <% Double](density: Density) { def classify(values: DblVector): Int } class BinNaiveBayesModel [T <% Double](positives: Likelihood, negatives: Likelihood, density: Density) extends NaiveBayesModel [T]( density) { override def classify(x: Array[T]): Int = if (positives.score(x,density) > negatives.score(x,density)) 1 else 0 }
The classification is executed by the classify
method called by the |>
operator in the Naïve Bayes classifier. It returns 1
for the class containing the positive cases and 0
for the negative.
Model validation
The parameters of the Naïve Bayes model (likelihood) are computed through training and the model
value is instantiated regardless of whether the model is actually validated in this example. A commercial application would require the model to be validated using a methodology such as the K-fold validation and F1 measure. (Refer to the Design template for classifiers section in Appendix A, Basic Concepts.)
The multinomial Naïve Bayes model, defined by the MultiNaiveBayesModel
class is very similar to the BinNaiveBayesModel
class:
class MultiNaiveBayesModel[T <% Double](likelihoodXs: List[Likelihood[T]], density: Density) extends NaiveBayesModel[T](density) { override def classify(x: Array[T]): Int = likelihoodXs.sortWith( (p1,p2) => p1.score(x, density) > p2.score(x, density)).head.label }
The multinomial Naïve Bayes model differs from the binomial version in the following ways:
likelihoodXs
(one likelihood per class)sortWith
), selects the class with the highest score
, and returns the class IDFinally, the Naïve Bayes classifier is implemented by the NaiveBayes
class. It implements the training and runtime classification using the Naïve Bayes formula. Any supervised learning model needs to be validated. In order to force the developer to define a validation for any new supervised learning technique, the class inherits from the Supervised
trait that declares the validation method, validate
:
trait Supervised[T] { def validate(xt: XTSeries[(Array[T],Int)], tpClass:Int): Double }
The validate
method takes a labeled time series xt
as an array of tuples (observation, class label) and the tpClass
index that contains the true positives (that is, increase in the stock price) outcome. The method returns an F1-measure.
Besides inheriting the Supervised
trait, the NaiveBayes
class inherits the PipeOperator
trait so that it can be integrated into a generic workflow as one of the computation units.
The attributes of the multinomial Naïve Bayes are as follows:
smoothing
xt
density
The NaiveBayes
class is defined as follows:
Class NaiveBayes[T <% Double](smoothing: Double, xt: XTSeries[(Array[T], Int)], density: Density) extends PipeOperator[XTSeries[Array[T]], Array[Int]] with Supervised[T] { val model = BinNaiveBayesModel[T](train(1),train(0),density) //1 def train(label:Int)(implicit f: Array[T] => DblVector): Likelihood[T] = { //2 val xi = xt.toArray val values= xi.filter( _._2 == label).map(x => f(x._1) ) val dim = xi(0)._1.size val vt = XTSeries[DblVector](values.toArray) //3 val muStdDev = statistics(vt).map(stat => (stat.lidstoneMean(smoothing, dim), stat.stdDev)) Likelihood(label, muStdDev, values.size.toDouble/xi.size) //4 } …
The classifier uses the binomial Naïve Bayes model, BinNaiveBayesModel
(line 1
). The training process is implemented in the constructor by invoking the private train
method (line 2
). The method relies on an implicit conversion, f: Array[T] => DblVector
, because of the Array
type erasure. The main reason for this is to hide the details of the model and its training from the client code. We cannot assume that the user of the model is the same person as the creator of the model.
Training and class instantiation
There are several benefits of allowing the instantiation of the Naïve Bayes mode only once when it is trained. It prevents the client code from invoking the algorithm on an untrained or partially trained model, and it reduces the number of states of the model (untrained and trained). It is an elegant way to hide the details of the training of the model from the user.
The train
method takes the labeled observations (observations or label) as input. The vt
time series is extracted (line 3
) and the likelihoods are calculated by counting the positive and negative labels, computing the mean, corrected with the Lidstone smoothing formula (line 4
). The lidstoneMean
method and standard deviation, stdDev
, use the statistics method of the XTSeries
singleton instance.
The NaiveBayes class also defined the runtime classification method |>
and the F1-validation methods. Both methods are described in the next section.
Handling missing data
Naïve Bayes has a no-nonsense approach to handling missing data. You just ignore the attribute in the observations for which the value is missing. In this case, the prior for this particular attribute for these observations is not computed. This workaround is obviously made possible because of the conditional independence between features.
The likelihood and class prior that have been computed through training is used for validating the model and classifying new observations.
The score represents the log of likelihood estimate (or the posterior probability), which is computed as the summation of the log of the Gaussian distribution using the mean and standard deviation, extracted from the training phase and the log of the likelihood
class.
The Naïve Bayes classification using Gaussian distribution is illustrated using two classes, C1
and C2
, and a model with two features (x
and y
):
The Gaussian mixture is particularly suited for modeling datasets for which the features have large sets of discrete values or are continuous variables. The conditional probabilities for the feature x is described by the normal probability density function [5:9].
In this example, we used the Gaussian distribution as our probability density function as defined in the Stats
object, which was introduced in Chapter 2, Hello World!. The implementation of the computation of the Gaussian probability density is quite simple, shown as follows:
object Stats { final val INV_SQRT_2PI = 1.0/Math.sqrt(2.0*Math.PI) def gauss(mu: Double, sigma: Double, x:Double) : Double = { val y = x - mu INV_SQRT_2PI/sigma * Math.exp(-0.5*y*y/sigma*sigma) } def gauss(x: Double*): Double = gauss(x(0), x(1), x(2)) … }
The second version of the Gaussian density is required to handle the Density
type: (Double, Double, Double) => Double.
Finally, the classification method is implemented as the pipe operator |>
of the NaiveBayes
class. The classification model and the density function are provided at runtime as attributes of the class:
def |> : PartialFunction[XTSeries[Array[T]], Array[Int]] = {
case xt: XTSeries[Array[T]] if(xt != null && xt.size > 0 && model != None) => xt.toArray.map( model.classify( _))}
The most critical element in the training of a supervised learning algorithm is the creation of labeled data. Fortunately, in this case, the label (or expected class) can be automatically generated. The objective is to predict the direction of the price of a stock for the next trading day, taking into account the average price, volume, and volatility over the last n days.
The first step is to extract the average price, volume, and volatility for each stock during the period of Jan 1, 2000 and Dec 31, 2014 with daily and weekly closing prices. Let's use the simple moving average to compute these averages for the [t-n, t] window.
First, the extractor
function extracts the closing, high, and low prices, and volume for each trading day, using the toDouble
and %
operators described in the Data extraction and Data sources section in Appendix A, Basic Concepts, as follows:
val extractor = toDouble(CLOSE) //stock closing price :: ratio(HIGH, LOW) //volatility (HIGH-LOW)/HIGH :: toDouble(VOLUME) //daily stock trading volume ::List[Array[String] =>Double]()
Secondly, the data source extractor outputs the four statistics for each stock (line 1
) for which the average for a window period is computed (line 3
) using a simple moving average mv
(line 2
):
val xs = DataSource(symbol, path, true) |> extractor //1 val mv = SimpleMovingAverage(period) //2 val ratios = xs.map(x => { //3 val xt = mv get x,toArray val zValues = x.drop(period).zip(xt.drop(period)) zValues.map(z => if(z._1 > z._2) 1 else 0).toArray //4 }) var prev = xs(0)(period) val label = xs(0).drop(period+1).map( x => { //5 val y = if( x > prev) 1 else 0 prev = x; y }).toArray ratios.transpose.take(label.size).zip(label) //6
The Scala's drop
method is used to shift the time series to compute the average of the three variables: price, toDouble(CLOSE)
; volume, toDouble(VOLUME)
; and volatility, ratio(HIGH, LOW)
(line 4
). The labeled data, direction of the price action for the next trading day, is added to the three ratios (line 5
). Finally, the array is transposed to extract the list of tuples (list of UP/DOWN values for each feature and price direction for next trading day/labeled data) (line 6
).
The labeled data extracted from the input
CSV file is used in the training and validation of the time series using the Naïve Bayes classifier:
val trainValidRatio = 0.8 val period = 10 val labels = XTSeries[(Array[Int], Int)](input.map(x => (x._1.toArray, x._2)).toArray) //7 val numObsToTrain = (trainValidRatio*labels.size).floor.toInt //8 val nb = NaiveBayes[Int](labels.take(numObsToTrain)) //9 validate(labels.drop(numObsForTrains+1), nb) //10
The original labeled dataset, labels
, is split between training and validation labeled data (line 7
) using the trainValidRatio
ratio (line 8
). The NaiveBayes
constructor initializes the model through training (line 9
). Finally, the validate
method returns the F1 measure for the validation test (line 10
).
The next chart plots the value of the F1 measure of the predictor of the direction of the IBM stock using price, volume, and volatility over the previous n trading days, with n varying from 1 to 12 trading days:
The preceding chart illustrates the impact of the value of the averaging period (number of trading days) on the quality of the multinomial Naïve Bayesian prediction, using the value of stock price, volatility, and volume relative to their average over the averaging period.
From this experiment, we conclude that: