Despite its name, the logistic regression is a classifier. As a matter of fact, the logistic regression is one of the most used discriminative learning techniques because of its simplicity and its ability to leverage a large variety of optimization algorithms. The technique is used to quantify the relationship between an observed target variable y and a set of variables x that it depends on. Once the model is created (trained), it is used to classify real-time data.
A logistic regression can be either binomial (two classes) or multinomial (three and more classes). In a binomial classification, the observed outcome is defined as {true, false}, {0, 1}, or {-1, +1}.
The conditional probability in a linear regression model is a linear function of its weights [6:13]. The logistic regression model addresses the nonlinear regression problem by defining the logarithm of the conditional probability as a linear function of its parameters.
First, let's introduce the logistic function and its derivative, which are defined as follows:
Have a look at the following graph:
The remainder of this section is dedicated to the application of the multivariate logistic regression to a binary classification (two classes).
The logistic regression is popular for several reasons; some are as follows:
Let's consider the classification problem using two classes. As discussed in the Validation section of Chapter 2, Hello World!, even the best classifier produces false positives and false negatives. The training procedure for a binomial classification is illustrated in the following diagram:
The purpose of the training is to compute the hyperplane that separates the observations into two categories or classes. Mathematically speaking, a hyperplane in an n-dimensional space (number of features) is a subspace of n-1 dimensions. The separating hyperplane of a three-dimension space is a curved surface. The separating hyperplane of a two-dimension problem (plane) is a line. In our preceding example, the hyperplane segregates/separates a training set into two very distinct classes (or groups), class 1 and class 2, in an attempt to reduce the overlap (false positive and false negative).
The equation of the hyperplane is defined as the logistic function of the dot product of the regression parameters (or weights) and features.
The logistic function accentuates the difference between the two groups of training observations, separated by the hyperplane. It pushes the observations away from the separating hyperplane towards either of the classes.
In the case of two classes, c1
and c2
with their respective probabilities, p(C=c1| X=xi|w) = p(xi|w) and p(C=c2 |X= xi|w) = 1- p(xi|w), where w
is the model parameters set or weights in the case of the logistic regression, the following functions can be defined:
Let's implement the logistic regression without a penalty term using the Apache Commons Math library. The library contains several least squares optimizers, allowing you to specify the minimizing algorithm, optimizer
, for the loss function in the logistic regression class LogisticRegression
:
class LogisticRegression[T <% Double](xt: XTSeries[Array[T]], labels: Array[Int], optimizer: LogisticRegressionOptimizer) extends PipeOperator[Array[T], Int]{ val model: Option[RegressionModel] = { … } … }
The parameters of the logistic regression class are the multivariate time series (features) xt
, the target or labeled data, labels
, and the optimizer
algorithm used to minimize the loss function or residual sum of squares. In the case of the binomial logistic regression, labels
are assigned the values of 1 for one class and 0 for the other.
The purpose of the training is to determine the regression coefficient, model._1
, which minimizes the loss function. The residual sum of squares (RSS) is computed as model._2
.
Target values
There is no specific rule to assign the two values to the observed data for the binomial logistic regression: {-1, +1}, {0, 1}, or {false, true}. The values pair {0, 1} is convenient because it allows the developer to reuse the code for multinomial logistic regression using normalized class values.
For convenience, the definition and the configuration of the optimizer are encapsulated in the LogisticRegressionOptimizer
class.
The implementation of the logistic regression uses the following components:
RegressionModel
of the Model
type, which is initialized through training during the instantiation of the classifier. We reuse the RegressionModel
type introduced in the Linear regression section.|>
extending the PipeOperator
trait.LogisticRegression
, has three parameters: the least squares optimizer of the type LogisticRegresssionOptimizer
(used in training), a features set XTSeries
, and a label vector DblVector
.The key software components of the logistic regression are described in the following UML class diagram:
Our implementation of the training of the logistic regression model leverages either the Gauss-Newton or the Levenberg-Marquardt nonlinear least squares optimizers, (refer to the Nonlinear least squares minimization section in Appendix A, Basic Concepts) packaged with the Apache Commons Math library.
The training of the logistic regression is performed by the train
method:
val model: Option[RegressionModel] = train
Handling exceptions from the Apache Commons Math library
The training of the logistic regression using the Apache Commons Math library requires handling ConvergenceException
, DimensionMismatchException
, TooManyEvaluationsException
, TooManyIterationsException
, and MathRuntimeException
. Debugging is greatly facilitated by understanding the context of these exceptions in the Apache library source code.
The implementation of the training method, train
, relies on the following five steps:
The workflow and the Apache Commons Math classes used in the training of the logistic regression are visualized by the following flow diagram:
The first four steps are required by the Apache Commons Math library to initialize the configuration of the logistic regression prior to the minimization of the loss function. Let's start with the configuration of the least squares optimizer.
In this step, you have to specify the algorithm to minimize the residual of the sum of squares. The LogisticRegressionOptimizer
class is responsible for configuring the optimizer. The class has the following two purposes:
LeastSquaresOptimizer
interface defined in the Apache Commons Math libraryConsider the following code:
class LogisticRegressionOptimizer(maxIters: Int, maxEvals: Int,eps: Double, lsOptimizer: LeastSquaresOptimizer){ def optimize(lsProblem: LeastSquaresProblem): Optimum = lsOptimizer.optimize(lsProblem) }}
The configuration of the logistic regression optimizer is defined using the maximum number of iterations (maxIters
), the maximum number of evaluations (maxEval
) for the logistic function and its derivative, the convergence criteria (eps
) on the residual sum of squares, and the instance of the least squares problem (org.apache.commons.math3.fitting.leastsquares.LeastSquaresProblem
).
The next step consists of computing the value of the logistic function and its first order partial derivatives with respect to the weights by overriding the value
method of the fitting.leastsquares.MultivariateJacobianFunction
interface:
final val initWeight = 0.5 val weights0 = Array.fill(xt(0) +1)(initWeight) //1 val lrJacobian = new MultivariateJacobianFunction { override def value(w:RealVector):Pair[RealVector,RealMatrix] ={ val _w = w.toArray val gradient = xt.toArray .map( g => { //2 val expn = g.zip(_w.drop(1)) .foldLeft(_w(0))((s,z) => s + z._1*z._2) val logIt = 1.0/(1.0 + Math.exp(-expn)) //3 (logIt, logIt *(1- logIt)) //4 }) val jacobian = Array.ofDim[Double](xt.size, weights0.size)//5 xt.toArray.zipWithIndex.foreach(xi => { //6 val df: Double = gradient(xi._2)._2 Range(0, xi._1.size).foreach(j => jacobian(xi._2)(j+1) = xi._1(j)*df) jacobian(xi._2)(0) = 1.0 //7 }) (new ArrayRealVector(gradient.map(_._1)), new Array2DRowRealMatrix(jacobian)) //8 } }
The regression weights, weights0
, are initialized with the arbitrary value of 0.5.
The value
method uses the primitives types RealVector
, RealMatrix
, ArrayRealVector
, and Array2DRowRealMatrix
defined in the org.apache.commons.math3.linear
Apache Commons Math package.
It takes the regression weight, w
, and computes the gradient
(line 2
) of the logistic function for each data point and returns the value of logit
(line 3
) and its derivative (line 4
) as a tuple. The Jacobian matrix is created (line 5
), and then initialized with logit
and its derivative (line 6
). The first element of each column of the Jacobian matrix is set to 1.0 to take into account the intercept (line 7
). Finally, the vector of the logit
values for each observation and the Jacobian matrix are returned (line 8
) as a tuple to comply with the return type of the function value.
The third step defines the exit condition for the optimizer. It is accomplished by overriding the converged
method of the parameterized org.apache.commons.math3.optim.ConvergenceChecker
interface:
val exitCheck = new ConvergenceChecker[PointVectorValuePair] { override def converged(iteration: Int, prev: PointVectorValuePair, current:PointVectorValuePair): Boolean = { val delta = prev.getValue .zip(current.getValue) .foldLeft(0.0)((s, z) =>{ val d = z._1 - z._2 s + diff*diff }) Math.sqrt(delta)<optimizer.eps && iteration>=optimizer.maxIters } }
This implementation computes the convergence or exit condition as follows:
delta
, is smaller than the convergence criteria, eps
maxIters
allowedThe Apache Commons Math least squares optimizer package requires all the input to the nonlinear least squares minimizer to be defined as an instance of LeastSquareProblem
generated by the factory LeastSquareBuilder
class:
val builder = new LeastSquaresBuilder val diagWeights0 = Array.fill(xt.size)(1.0) //1 val wMatrix = MatrixUtils.createRealDiagonalMatrix(diagWeights0) val lsp = builder.model(lrJacobian) //2 .weight(wMatrix) .target(labels) //7 .checkerPair(exitCheck) //5 .maxEvaluations(optimizer.maxEvals) //3 .start(weights0) //6 .maxIterations(optimizer.maxIters) //4 .build
The diagonal elements of the weights matrix are initialized to 1.0 (line 1
). Besides the initialization of the model with the Jacobian matrix, lrJacobian
(line 2
), the maximum number of evaluations (line 3
), maximum number of iterations (line 4
), and the exit condition (line 5
) are also initialized.
The regression weights are initialized as 0.5 (weights0
) (line 6
). Finally, the labeled or target values are initialized (line 7
).
The training is executed with a simple call to the least squares minimizer, lsp
:
val optimum = optimizer.optimize(lsp) (optimum.getPoint.toArray, optimum.getRMS)
The regression coefficients (or weights) and the residuals mean square (RMS) are returned by invoking the getPoint
method on the optimum
class of the Apache Commons Math library.
Let's test our implementation of the binomial multivariate logistic regression using the example of the Copper ETF price variation versus volatility and volume, used in the previous two sections. The only difference is that we need to define the target values as 0 if the ETF price decreases between two consecutive trading sessions, and 1 otherwise. Therefore, the deltaPrice
vector used in the linear and ridge regression is to be modified to support the binary outcome:
val deltaPrice = prices.drop(1).zip(prices).dropRight(1)) .map(p => if(p._1>p._2) 1 else 0)
Executing the test case is just a matter of instantiating the LogisticRegression
class with the appropriate configuration parameters. The implementation reuses the code already defined for the least squares and ridge regression to load data from CSV files (src
, price
, volatility
, and volume
) and normalize the observations:
val MAXITERS = 80; val MAXEVALS = 1000; val EPS = 1e-4 val lsOptimizer = LogisticRegressionOptimizer(MAXITERS, MAXEVALS, EPS, new LevenbergMarquardtOptimizer) val xt = XTSeries[DblVector](features) val regression = new LogisticRegression[Double](xt, deltaPrice, lsOptimizer) val rms = regression.rms.get val weights = regression.weights.get
In this example, the Levenberg-Marquardt algorithm is used to minimize the loss function.
Levenberg-Marquardt parameters
The driver code uses the LevenbergMarquardtOptimizer
with the default tuning parameters configuration to keep the implementation simple. However, the algorithm has a few important parameters, such as relative tolerance for cost and matrix inversion, that are worth tuning for commercial applications (refer to the Levenberg-Marquardt section in Appendix A, Basic Concepts).
The execution of the test produces the following results:
The last step is the classification of the real-time data.
As mention earlier and despite its name, the binomial logistic regression is a binary classifier. The classification method is implemented as a data transformation by overriding the pipe operator:
type Feature = Array[T]
final val MARGIN = 0.01
def |> : PartialFunction[Feature, Int] = { //1
case x: Feature if(model!=None && model.get.size-1==x.size) =>{
val w = _model.get.weights
val dot = x.zip(w.drop(1))
.foldLeft(w(0))((s,xw) => s + xw._1*xw._2)//2
if(logit(dot) > 0.5 + MARGIN) 1 else 0 //3
}
}
The classification method, |>
, checks if the number of model parameters (weights) is equal to the number of features plus 1 (line 1
) and throws an exception if the test fails. The dot product of the weights and the features is computed using a fold. Finally, the method returns 1
(class 1, which signifies that the price variation of the ETF is positive) if the value of the sigmoid is greater than 0.5. It returns 0
otherwise (class 2, which signifies that the price variation of the ETF is negative) (line 3
).
Let's apply the classification to the original training set, features
, to validate our model (weights):
val predicted = features.map(x => regression |> x)
The direction of the price variation of the Copper ETF, price(t+1) – price(t)
, is compared to the direction predicted by the logistic regression. The result is plotted with the success
value if the positive or negative direction is correctly classified, otherwise, it is plotted with the failure
value:
The logistic regression was able to classify 78 out of 121 trading sessions (65 percent accuracy).
Now, let's use the logistic regression to predict the positive price variation for the Copper ETF, given its volatility and trading volume. This trading or investment strategy is known as being long on the market. This use case ignores the trading sessions for which the price was either flat or declined:
The logistic regression was able to correctly predict the positive price variation for 58 out of 64 trading sessions (90.6 percent accuracy). What is the difference between the first and second test cases?
In the first case, the separating hyperplane equation, w0 + w1.volatility + w2.volume
, is used to segregate both the features generating either positive or negative price variation. The overall accuracy of the classification is negatively impacted by the overlap of the features from the two classes.
In the second case, the classifier has to consider only the observations located on one side of the hyperplane equation, without taking into account the false negatives.
Impact of rounding errors
Under some circumstances, the generation of the rounding errors during the computation of the Jacobian matrix has an impact on the accuracy of the separating hyperplane equation: w0 + w1.volatility + w2.volume
. This negatively impacts the prediction of both the positive and negative price variation.
The accuracy of the binary classifier can be further improved by considering the positive variation of price as price(t+1) – price(t) > EPS.