The logistic regression

Despite its name, the logistic regression is a classifier. As a matter of fact, the logistic regression is one of the most used discriminative learning techniques because of its simplicity and its ability to leverage a large variety of optimization algorithms. The technique is used to quantify the relationship between an observed target variable y and a set of variables x that it depends on. Once the model is created (trained), it is used to classify real-time data.

A logistic regression can be either binomial (two classes) or multinomial (three and more classes). In a binomial classification, the observed outcome is defined as {true, false}, {0, 1}, or {-1, +1}.

The logit function

The conditional probability in a linear regression model is a linear function of its weights [6:13]. The logistic regression model addresses the nonlinear regression problem by defining the logarithm of the conditional probability as a linear function of its parameters.

First, let's introduce the logistic function and its derivative, which are defined as follows:

The logit function

Have a look at the following graph:

The logit function

The graph of the logistic function and its derivative

The remainder of this section is dedicated to the application of the multivariate logistic regression to a binary classification (two classes).

Binomial classification

The logistic regression is popular for several reasons; some are as follows:

  • It is available with most statistical software packages and open source libraries
  • Its S-shape describes the combined effect of several explanatory variables
  • Its range of values [0, 1] is intuitive from a probabilistic perspective

Let's consider the classification problem using two classes. As discussed in the Validation section of Chapter 2, Hello World!, even the best classifier produces false positives and false negatives. The training procedure for a binomial classification is illustrated in the following diagram:

Binomial classification

Illustration of the binomial classification for a 2-dimension dataset

The purpose of the training is to compute the hyperplane that separates the observations into two categories or classes. Mathematically speaking, a hyperplane in an n-dimensional space (number of features) is a subspace of n-1 dimensions. The separating hyperplane of a three-dimension space is a curved surface. The separating hyperplane of a two-dimension problem (plane) is a line. In our preceding example, the hyperplane segregates/separates a training set into two very distinct classes (or groups), class 1 and class 2, in an attempt to reduce the overlap (false positive and false negative).

The equation of the hyperplane is defined as the logistic function of the dot product of the regression parameters (or weights) and features.

The logistic function accentuates the difference between the two groups of training observations, separated by the hyperplane. It pushes the observations away from the separating hyperplane towards either of the classes.

In the case of two classes, c1 and c2 with their respective probabilities, p(C=c1| X=xi|w) = p(xi|w) and p(C=c2 |X= xi|w) = 1- p(xi|w), where w is the model parameters set or weights in the case of the logistic regression, the following functions can be defined:

Note

The log likelihood:

Binomial classification

Conditional probabilities using the logit function:

Binomial classification

The log likelihood for the binomial logistic regression:

Binomial classification

First order derivative for the log likelihood:

Binomial classification

Let's implement the logistic regression without a penalty term using the Apache Commons Math library. The library contains several least squares optimizers, allowing you to specify the minimizing algorithm, optimizer, for the loss function in the logistic regression class LogisticRegression:

class LogisticRegression[T <% Double](xt: XTSeries[Array[T]], labels: Array[Int], optimizer: LogisticRegressionOptimizer) extends PipeOperator[Array[T], Int]{
  val model: Option[RegressionModel] = { … }
  …
}

The parameters of the logistic regression class are the multivariate time series (features) xt, the target or labeled data, labels, and the optimizer algorithm used to minimize the loss function or residual sum of squares. In the case of the binomial logistic regression, labels are assigned the values of 1 for one class and 0 for the other.

The purpose of the training is to determine the regression coefficient, model._1, which minimizes the loss function. The residual sum of squares (RSS) is computed as model._2.

Tip

Target values

There is no specific rule to assign the two values to the observed data for the binomial logistic regression: {-1, +1}, {0, 1}, or {false, true}. The values pair {0, 1} is convenient because it allows the developer to reuse the code for multinomial logistic regression using normalized class values.

For convenience, the definition and the configuration of the optimizer are encapsulated in the LogisticRegressionOptimizer class.

Software design

The implementation of the logistic regression uses the following components:

  • RegressionModel of the Model type, which is initialized through training during the instantiation of the classifier. We reuse the RegressionModel type introduced in the Linear regression section.
  • The predictive or classification routine is implemented as a data transformation |> extending the PipeOperator trait.
  • The logistic regression class, LogisticRegression, has three parameters: the least squares optimizer of the type LogisticRegresssionOptimizer (used in training), a features set XTSeries, and a label vector DblVector.

The key software components of the logistic regression are described in the following UML class diagram:

Software design

The UML class diagram for the logistic regression

The training workflow

Our implementation of the training of the logistic regression model leverages either the Gauss-Newton or the Levenberg-Marquardt nonlinear least squares optimizers, (refer to the Nonlinear least squares minimization section in Appendix A, Basic Concepts) packaged with the Apache Commons Math library.

The training of the logistic regression is performed by the train method:

val model: Option[RegressionModel] = train

Tip

Handling exceptions from the Apache Commons Math library

The training of the logistic regression using the Apache Commons Math library requires handling ConvergenceException, DimensionMismatchException, TooManyEvaluationsException, TooManyIterationsException, and MathRuntimeException. Debugging is greatly facilitated by understanding the context of these exceptions in the Apache library source code.

The implementation of the training method, train, relies on the following five steps:

  1. Select and configure the least squares optimizer.
  2. Define the logit function and its Jacobian.
  3. Specify the convergence and exit criteria.
  4. Compute the residuals using the least squares problem builder.
  5. Run the optimizer.

The workflow and the Apache Commons Math classes used in the training of the logistic regression are visualized by the following flow diagram:

The training workflow

The workflow for training the logistic regression using Apache Commons Math

The first four steps are required by the Apache Commons Math library to initialize the configuration of the logistic regression prior to the minimization of the loss function. Let's start with the configuration of the least squares optimizer.

Configuring the least squares optimizer

In this step, you have to specify the algorithm to minimize the residual of the sum of squares. The LogisticRegressionOptimizer class is responsible for configuring the optimizer. The class has the following two purposes:

  • Encapsulating the configuration parameters for the optimizer
  • Invoking the LeastSquaresOptimizer interface defined in the Apache Commons Math library

Consider the following code:

class LogisticRegressionOptimizer(maxIters: Int, maxEvals: Int,eps: Double, lsOptimizer: LeastSquaresOptimizer){
  def optimize(lsProblem: LeastSquaresProblem): Optimum = lsOptimizer.optimize(lsProblem)
}}

The configuration of the logistic regression optimizer is defined using the maximum number of iterations (maxIters), the maximum number of evaluations (maxEval) for the logistic function and its derivative, the convergence criteria (eps) on the residual sum of squares, and the instance of the least squares problem (org.apache.commons.math3.fitting.leastsquares.LeastSquaresProblem).

Computing the Jacobian matrix

The next step consists of computing the value of the logistic function and its first order partial derivatives with respect to the weights by overriding the value method of the fitting.leastsquares.MultivariateJacobianFunction interface:

final val initWeight = 0.5
val weights0 = Array.fill(xt(0) +1)(initWeight) //1

val lrJacobian = new MultivariateJacobianFunction {
  override def value(w:RealVector):Pair[RealVector,RealMatrix] ={
    val _w = w.toArray
    val gradient = xt.toArray
                     .map( g => {  //2
         val expn = g.zip(_w.drop(1))
                   .foldLeft(_w(0))((s,z) => s + z._1*z._2)
       val logIt = 1.0/(1.0 + Math.exp(-expn)) //3
       (logIt, logIt *(1- logIt))  //4
    })
   
   val jacobian = Array.ofDim[Double](xt.size, weights0.size)//5
   xt.toArray.zipWithIndex.foreach(xi => {  //6
      val df: Double = gradient(xi._2)._2
      Range(0, xi._1.size).foreach(j => 
                            jacobian(xi._2)(j+1) = xi._1(j)*df)
      jacobian(xi._2)(0) = 1.0  //7
   })
   (new ArrayRealVector(gradient.map(_._1)), 
       new Array2DRowRealMatrix(jacobian))  //8
  }
}

The regression weights, weights0, are initialized with the arbitrary value of 0.5.

The value method uses the primitives types RealVector, RealMatrix, ArrayRealVector, and Array2DRowRealMatrix defined in the org.apache.commons.math3.linear Apache Commons Math package.

It takes the regression weight, w, and computes the gradient (line 2) of the logistic function for each data point and returns the value of logit (line 3) and its derivative (line 4) as a tuple. The Jacobian matrix is created (line 5), and then initialized with logit and its derivative (line 6). The first element of each column of the Jacobian matrix is set to 1.0 to take into account the intercept (line 7). Finally, the vector of the logit values for each observation and the Jacobian matrix are returned (line 8) as a tuple to comply with the return type of the function value.

Defining the exit conditions

The third step defines the exit condition for the optimizer. It is accomplished by overriding the converged method of the parameterized org.apache.commons.math3.optim.ConvergenceChecker interface:

val exitCheck = new ConvergenceChecker[PointVectorValuePair] {
   override def converged(iteration: Int, prev: PointVectorValuePair, current:PointVectorValuePair): Boolean =  {
     val delta = prev.getValue
                     .zip(current.getValue)
                     .foldLeft(0.0)((s, z) =>{ 
                         val d = z._1 - z._2
                         s + diff*diff
                      })
   Math.sqrt(delta)<optimizer.eps && iteration>=optimizer.maxIters
  }
}

This implementation computes the convergence or exit condition as follows:

  • Either the L2-norm of the difference between the weights of the current iteration and the weights of the previous iteration, delta, is smaller than the convergence criteria, eps
  • Or the iteration exceeds the maximum number of iterations that maxIters allowed

Defining the least squares problem

The Apache Commons Math least squares optimizer package requires all the input to the nonlinear least squares minimizer to be defined as an instance of LeastSquareProblem generated by the factory LeastSquareBuilder class:

val builder = new LeastSquaresBuilder
val diagWeights0 = Array.fill(xt.size)(1.0) //1
val wMatrix = MatrixUtils.createRealDiagonalMatrix(diagWeights0)
val lsp = builder.model(lrJacobian)  //2
                 .weight(wMatrix) 
                 .target(labels)  //7
                 .checkerPair(exitCheck) //5
                 .maxEvaluations(optimizer.maxEvals) //3
                 .start(weights0) //6
                 .maxIterations(optimizer.maxIters) //4
                 .build

The diagonal elements of the weights matrix are initialized to 1.0 (line 1). Besides the initialization of the model with the Jacobian matrix, lrJacobian (line 2), the maximum number of evaluations (line 3), maximum number of iterations (line 4), and the exit condition (line 5) are also initialized.

The regression weights are initialized as 0.5 (weights0) (line 6). Finally, the labeled or target values are initialized (line 7).

Minimizing the loss function

The training is executed with a simple call to the least squares minimizer, lsp:

val optimum = optimizer.optimize(lsp)
(optimum.getPoint.toArray, optimum.getRMS)

The regression coefficients (or weights) and the residuals mean square (RMS) are returned by invoking the getPoint method on the optimum class of the Apache Commons Math library.

Test

Let's test our implementation of the binomial multivariate logistic regression using the example of the Copper ETF price variation versus volatility and volume, used in the previous two sections. The only difference is that we need to define the target values as 0 if the ETF price decreases between two consecutive trading sessions, and 1 otherwise. Therefore, the deltaPrice vector used in the linear and ridge regression is to be modified to support the binary outcome:

val deltaPrice = prices.drop(1).zip(prices).dropRight(1)) .map(p => if(p._1>p._2) 1 else 0)

Executing the test case is just a matter of instantiating the LogisticRegression class with the appropriate configuration parameters. The implementation reuses the code already defined for the least squares and ridge regression to load data from CSV files (src, price, volatility, and volume) and normalize the observations:

val MAXITERS = 80; val MAXEVALS = 1000; val EPS = 1e-4

val lsOptimizer = LogisticRegressionOptimizer(MAXITERS, MAXEVALS, EPS, new LevenbergMarquardtOptimizer)
val xt = XTSeries[DblVector](features)
val regression = new LogisticRegression[Double](xt, deltaPrice, lsOptimizer)
val rms = regression.rms.get
val weights = regression.weights.get

In this example, the Levenberg-Marquardt algorithm is used to minimize the loss function.

Tip

Levenberg-Marquardt parameters

The driver code uses the LevenbergMarquardtOptimizer with the default tuning parameters configuration to keep the implementation simple. However, the algorithm has a few important parameters, such as relative tolerance for cost and matrix inversion, that are worth tuning for commercial applications (refer to the Levenberg-Marquardt section in Appendix A, Basic Concepts).

The execution of the test produces the following results:

  • Residual mean square is 0.497
  • Weights are -0.124 for intercept, 0.453 for ETF volatility, and -0.121 for ETF volume

The last step is the classification of the real-time data.

Classification

As mention earlier and despite its name, the binomial logistic regression is a binary classifier. The classification method is implemented as a data transformation by overriding the pipe operator:

type Feature = Array[T]
final val MARGIN = 0.01
def |> : PartialFunction[Feature, Int] = { //1
  case x: Feature if(model!=None && model.get.size-1==x.size) =>{
    val w = _model.get.weights
    val dot = x.zip(w.drop(1))
               .foldLeft(w(0))((s,xw) => s + xw._1*xw._2)//2
   if(logit(dot) > 0.5 + MARGIN) 1 else 0 //3
  }
}

The classification method, |>, checks if the number of model parameters (weights) is equal to the number of features plus 1 (line 1) and throws an exception if the test fails. The dot product of the weights and the features is computed using a fold. Finally, the method returns 1 (class 1, which signifies that the price variation of the ETF is positive) if the value of the sigmoid is greater than 0.5. It returns 0 otherwise (class 2, which signifies that the price variation of the ETF is negative) (line 3).

Note

Class identification

The class that the new data x belongs to is determined by the logit(dot) > 0.5 test, where dot is the product of the features and the regression weights (w0+w1.volatility + w2.volume). This test is equivalent to dot > 0.0. You may find either condition in the literature.

Let's apply the classification to the original training set, features, to validate our model (weights):

val predicted = features.map(x => regression |> x)

The direction of the price variation of the Copper ETF, price(t+1) – price(t), is compared to the direction predicted by the logistic regression. The result is plotted with the success value if the positive or negative direction is correctly classified, otherwise, it is plotted with the failure value:

Classification

The logistic regression was able to classify 78 out of 121 trading sessions (65 percent accuracy).

Now, let's use the logistic regression to predict the positive price variation for the Copper ETF, given its volatility and trading volume. This trading or investment strategy is known as being long on the market. This use case ignores the trading sessions for which the price was either flat or declined:

Classification

The logistic regression was able to correctly predict the positive price variation for 58 out of 64 trading sessions (90.6 percent accuracy). What is the difference between the first and second test cases?

In the first case, the separating hyperplane equation, w0 + w1.volatility + w2.volume, is used to segregate both the features generating either positive or negative price variation. The overall accuracy of the classification is negatively impacted by the overlap of the features from the two classes.

In the second case, the classifier has to consider only the observations located on one side of the hyperplane equation, without taking into account the false negatives.

Tip

Impact of rounding errors

Under some circumstances, the generation of the rounding errors during the computation of the Jacobian matrix has an impact on the accuracy of the separating hyperplane equation: w0 + w1.volatility + w2.volume. This negatively impacts the prediction of both the positive and negative price variation.

The accuracy of the binary classifier can be further improved by considering the positive variation of price as price(t+1) – price(t) > EPS.

Tip

Validation methodology

The validation set is generated by randomly selecting data points from the original labeled set. A formal validation requires the use of a K-fold validation methodology to compute the recall, precision, and F1 measure for the logistic regression model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset