This section describes very briefly some of the mathematical concepts used in this book.
Many algorithms used in machine learning such as minimization of a convex loss function, principal component analysis, or least squares regression invariably involve manipulation and transformation of matrices. There are many good books on the subject, from the inexpensive [A:2] to the sophisticated [A:3].
QR decomposition (or QR factorization) is the decomposition of a matrix A into a product of an orthogonal matrix Q and upper triangular matrix R. So, A=QR and QTQ=I [A:4].
The decomposition is unique if A is a real, square, and invertible matrix. In the case of a rectangle matrix A, m by n with m > n, the decomposition is implemented as the dot product of two vectors of matrix A = [Q1, Q2].[R1, R2]T, where Q1 is an m by n matrix, Q2 is an m by n matrix, R1 is an n by n upper triangular matrix, and R2 is an m by n null matrix.
QR decomposition is a reliable method of solving a large system of linear equations in which the number of equations (rows) exceeds the number of variables (columns). Its asymptotic computational time complexity for a training set of m dimensions and n observations is O(mn2-n3/3).
It is used to minimize the loss function for ordinary least squares regression (refer to the Ordinary least squares (OLS) regression section of Chapter 6, Regression and Regularization).
LU factorization is a technique used to solve a matrix equation A.x = b where A is a non-singular matrix and x and b are two vectors. The technique consists of decomposing the original matrix A as the product of simple matrices A = A1A2…An. It is of two types as follows:
LDL decomposition for real matrices defines a real positive matrix A as the product of a lower unit triangular matrix L, a diagonal matrix D, and the transposed matrix of L, that is LT. So, A = LDLT.
The Cholesky factorization or Cholesky decomposition of real matrices is a special case of LU factorization [A:4]. It decomposes a positive definite matrix A into a product of a lower triangular matrix L and its conjugate transpose LT. So, A = LLT.
The asymptotic computational time complexity for the Cholesky factorization is O(mn2), where m is the number of features (model parameters) and n is the number of observations. Cholesky factorization is used in the linear least squares Kalman filter (refer to the The recursive algorithm section of Chapter 3, Data Preprocessing.
The singular value decomposition (SVD) of real matrices defines an m by n real matrix A as the product of an m square real unitary matrix U, an m by n rectangular diagonal matrix Σ, and the transpose VT matrix of a real matrix. So, A=UΣVT.
The columns of the matrices U and V are the orthogonal bases and the value of the diagonal matrix Σ is a singular value [A:4]. The asymptotic computational time complexity for the singular value decomposition for n observations and m features is O(mn2-n3). Singular value decomposition is used to minimize the total least squares and solve homogeneous linear equations.
The Eigen decomposition of a real square matrix A is the canonical factorization as Ax = λx.
λ is the eigenvalue (scalar) corresponding to the vector x. The n by n matrix A is then defined as A = QDQT. Q is the square matrix that contains the eigenvectors and D is the diagonal matrix whose elements are the eigenvalues associated to the eigenvectors [A:5], [A:6]. Eigen decomposition is used in Principal Components Analysis (refer to the Principal components analysis (PCA) section of Chapter 4, Unsupervised Learning).
There are many more open source algebraic libraries available to developers as APIs besides Apache Commons Math, which is used in Chapter 3, Data preprocessing; Chapter 5, Naïve Bayes Classifiers, and Chapter 6, Regression and Regularization, and Apache Spark/MLlib used in Chapter 12, Scalable Frameworks. They are as follows:
The Apache Spark/MLlib framework bundles jBlas, Colt, and Breeze. The Iitb framework for conditional random fields uses Colt linear algebra components.
Alternative to Java/Scala libraries
If your application or project needs a high-performance numerical processing tool under limited resources (CPU, RAM memory, and so on), and if portability is not a constraint, then using a C- or C++-compiled library is an excellent alternative. The binary functions can be accessed through the Java Native Interface (JNI).
Propositional logic is the formulation of axioms or propositions. There are several formal representations of propositions:
Propositional logic is subject to the rules of Boolean calculus. Let's consider three propositions P, Q, and R and three Boolean operators NOT, AND, OR. So the following rules apply:
NOT (NOT P) = P
P AND false = false, P AND true = P, P OR false = P, P OR true = P
P AND Q = Q AND P, P OR Q = Q OR P
P AND (Q AND R) = (P AND Q) AND R
First-order predicate logic, also known as first-order predicate calculus, is the quantification of propositional logic [A:7]. The most common formulations of the first order logic are as follows:
IF P THEN action
rulesFirst order logic is used to describe the classifiers in learning classifier systems. Refer to the XCS rules section of Chapter 11, Reinforcement Learning for more information.
Let's consider a function with n variables xi and m outputs yj such that f: { xi } -> {yj =fj(x)}.
The Jacobian matrix [A:8] is the matrix of the first order partial derivatives of the output values of a continuous, differential function:
The Hessian matrix is the square matrix of the second order of partial derivatives of a continuous, twice differentiable function:
An example is as follows:
The same comments regarding linear algebra algorithms apply to optimization. Treating such techniques in depth would have rendered the book impractical. However, optimization is critical to the efficiency and, to a lesser extent, the accuracy of the machine learning algorithms. Some basic knowledge in this field goes a long way to build practical solutions for large data sets.
The steepest descent (or gradient descent) method is one of the simplest techniques used to find a local minimum of any continuous, differentiable function F or the global minimum of any defined, differentiable, and convex function [A:9]. The value of a vector or data point xt+1 at the iteration t + 1 is computed from the previous value xt using the gradient F of function F and the slope γ:
The steepest gradient algorithm is used for solving systems of non-linear equations and minimization of the loss function in the logistic regression (refer to the Numerical optimization section of Chapter 6, Regression and Regularization), in support vector classifiers (refer to the The nonlinear SVM section of Chapter 8, Kernel Models and Support Vector Machines), and in multilayer perceptrons (refer to the The multilayer perceptron (MLP) section of Chapter 9, Artificial Neural Networks).
The conjugate gradient solves unconstrained optimization problems and systems of linear equations. It is an alternative to the LU factorization for positive, definite, and symmetric square matrices. The solution x* to the equation Ax = b is expanded as the weighted summation of n basis orthogonal directions pi (or conjugate directions):
The solution x* is extracted by computing the ith conjugate vector pi and then computing the coefficients αi.
The stochastic gradient method is a variant of the steepest descent method that minimizes the convex function by defining the objective function F as the sum of differentiable, basis functions fi:
The solution xt+1 at iteration t+1 is computed from the value xt at iteration t, the step size (or learning rate) α, and the sum of the gradient of the basis functions [A:10]. The stochastic gradient descent is usually faster than other gradient descent or quasi-Newton methods in converging towards a solution for convex functions. The stochastic gradient descent method is used in logistic regression, support vector machines, and back-propagation neural networks.
Stochastic gradient is particularly suitable for discriminative models with large datasets [A:11]. Spark/MLlib makes extensive use of the stochastic gradient method.
Quasi-Newton algorithms are variations of Newton's method of finding the value of a vector or data point that maximizes or minimizes a function F whose first order derivative is null [A:12].
Newton's method is a well-known and simple optimization method used to find the solution to equations F(x) = 0 for which F is continuous and differentiable up to the second order. It relies on the Taylor series expansion to approximate the function F with a quadratic approximation on the variable ∆x = xt+1-xt, to compute the value at the next iteration using the first order F' and second order F" derivatives:
Contrary to Newton's method, quasi-Newton methods do not require that the second order derivative, Hessian matrix of the objective function be computed. It just has to be approximated [A:13]. There are several approaches to approximate the computation of the Hessian matrix.
The Broyden-Fletcher-Goldfarb-Shanno (BGFS) method is a quasi-Newton iterative numerical method to solve unconstrained nonlinear problems. The Hessian matrix Ht+1 at an iteration t+1 is approximated using the value of the previous iteration t as Ht+1=Ht + Ut + Vt applied to the Newton equation for the direction pt:
The BFGS method is used in minimization of the cost function for the conditional random field, and L1 and L2 regression.
The performance of the BFGS algorithm can be improved by caching the intermediate computation in memory in order to approximate the Hessian matrix. The obvious drawback is that the memory becomes a limiting factor in the scalability of the optimizer.
The Limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm or L-BFGS is a variant of BFGS that uses a minimum amount of computer RAM. The algorithm maintains the last m incremental updates of the values ∆xt and the gradient ∆Gt at iteration t, and then computes those values for the next step t+1:
It is supported by the Apache Commons Math 3.3 and above, Apache Spark/MLlib 1.0 and above, Colt 1.0 and above, and Iiitb CRF libraries. L-BFGS is used in minimization of the loss function in conditional random fields. For more information, refer to the Conditional random fields section of Chapter 7, Sequential Data Models.
Let's consider the classic minimization of the least squares of a nonlinear function y = F(x, w) with wi parameters for observations {y, xi}. The objective is to minimize the sum of the squares of residuals ri:
The Gauss-Newton technique is a generalization of Newton's method. The technique solves nonlinear least squares by updating the parameters wt+1 at iteration t+1 using the first order derivative, or Jacobian:
The Gauss-Newton algorithm is used in logistic regression. For more information, refer to the The logistic regression section of Chapter 6, Regression and Regularization.
The Levenberg-Marquardt algorithm is an alternative to the Gauss-Newton technique for solving nonlinear least squares and curve fitting problems. The method consists of adding the gradient or Jacobian terms to the residuals ri to approximate the least squares error:
The Levenberg-Marquardt algorithm is used in the training of logistic regression. For more information, refer to the The logistic regression section of Chapter 6, Regression and Regularization.
The Lagrange multipliers methodology is an optimization technique to find the local optima of a multivariate function, subject to equality constraints [A:14]. The problem is stated as maximize f(x) subject to g(x) = c, where c is a constant and x is a variable or features vector.
This methodology introduces a new variable λ to integrate the constraint g into a function, known as the Lagrange function (x, λ). Let's note , which is the gradient of over the variables xi and λ. The Lagrange multipliers are computed by maximizing :
An example is as follows:
Lagrange multipliers are used in minimizing the loss function in the non-separable case of linear support vector machines. For more information, refer to The nonseparable case (soft margin) section of Chapter 8, Kernel Models and Support Vector Machines.
The purpose of dynamic programming is to break down an optimization problem into a sequence of steps known as substructures [A:15]. There are two types of problems for which dynamic programming is suitable.
The solution of a global optimization problem can be broken down into optimal solutions for its subproblems. The solutions of the subproblems are known as optimal substructures. Greedy algorithms or the computation of the minimum span of a graph are examples of decomposition into optimal substructures. Such algorithms can be implemented either recursively or iteratively.
The solution of the global problem is applied recursively to the subproblems if the number of subproblems is small. This approach is known as dynamic programming using overlapping substructures. Forward-backward passes on hidden Markov models, the Viterbi algorithm (refer to the The Viterbi algorithm section of Chapter 7, Sequential Data Models), or the back-propagation of error in a multilayer perceptron (refer to the Step 3 – error backpropagation section of Chapter 9, Artificial Neural Networks) are good examples of overlapping substructures.
The mathematical formulation of dynamic programming solutions is specific to the problem it attempts to solve. Dynamic programming techniques are also commonly used in mathematical puzzles such as the Tower of Hanoi.