Without prior knowledge of the problem domain, data scientists include all possible features in their first attempt to create a classification, prediction, or regression model. After all, making assumptions is a poor and dangerous approach to reduce the search space. It is not uncommon for a model to use hundreds of features, adding complexity and significant computation costs to build and validate the model.
Noise filtering techniques reduce the sensitivity of the model to features that are associated with sporadic behavior. However, these noise-related features are not known prior to the training phase, and therefore, cannot be completely discarded. As a consequence, training of the model becomes a very cumbersome and time-consuming task.
Overfitting is another hurdle that can arise from a large feature set. A training set of limited size does not allow you to create an accurate model with a large number of features.
Dimension reduction techniques alleviate these problems by detecting features that have little influence on the overall model behavior.
There are three approaches to reduce the number of features in a model:
The next section introduces one of the most commonly used algorithms of the third category: principal component analysis.
The purpose of principal components analysis is to transform the original set of features into a new set of ordered features by decreasing the order of variance. The original observations are transformed into a set of variables with a lower degree of correlation. Let's consider a model with two features, {x, y}, and a set of observations, {xi, yi}, plotted on the following chart:
The x and y features are converted into two X and Y variables (that is rotation) to appropriately match the distribution of observations. The variable with the highest variance is known as the first principal component. The variable with the nth highest variance is known as the nth principal component.
I highly recommend that you read the tutorial from Lindsay Smith [4:13] that describes the PCA algorithm in a very concrete and simple way using a two-dimensional model.
PCA and covariance matrix
M11: The covariance of two X and Y features with the observations set {xi, yi} and their respective mean values is defined as:
Here, and are the respective mean values for the observations, x and y.
M12: The covariance is computed from the Z-score of each observation:
M13: For a model with n features, xi, the covariance matrix is defined as:
M14: The transformation of x to X consists of computing the eigenvalues of the covariance matrix:
M15: The eigenvalues are ranked by their decreasing order of variance. Finally, the m top eigenvalues for which the cumulative of variance exceeds a predefined threshold (percentage of the trace of the matrix) are the principal components:
The algorithm is implemented in five steps:
The extraction of principal components by diagonalization of the covariance matrix Σ is visualized in the following diagram. The shades of grey used to represent the covariance value varies from white (lowest value) to black (highest value):
The eigenvalues (variance of X) are ranked by the decreasing order of their values. The PCA algorithm succeeds when the cumulative value of the last eigenvalues (the right-bottom section of the diagonal matrix) becomes insignificant.
The principal components analysis can be easily implemented using the Apache Commons Math library methods that compute the eigenvalues and eigenvectors. The PCA
class is defined as a data transformation of the ITransform
type, as described in the Monadic data transformation section in Chapter 2, Hello World!
The PCA
class has a single argument: the xt
training set (line 1
). The output type has a Double
for the projection of an observation along with the eigenvectors (line 2
). The constructor defines the z-score norm
function (line 3
):
class PCA[@specialized(Double) T <: AnyVal]( xt: XVSeries[T])(implicit f: T => Double) extends ITransform[Array[T]](xt) with Monitor[T] { //1 type V = Double //2 val norm = (xv: XVSeries[T]) => zScores(xv) //3 val model: Option[PCAModel] = train //4 override def |> : PartialFunction[U, Try[V]] }
The model for the PCA algorithm is defined by the PCAModel
case class (line 4
).
The model for the PCA algorithm, PCAModel
, consists of the covariance matrix, covariance
defined in the formula M11, and array of eigenvalues computed in the formula M16:
case class PCAModel(val covariance: DblMatrix, val eigenvalues: DblArray)
The |>
transformative method implements the computation of the principal components (that is, the eigenvector and eigenvalues):
def train: Option[PCAModel] = zScores(xt).map(x => { //5 val obs: DblMatrix = x.toArray val cov = new Covariance(obs).getCovarianceMatrix //6 val transform = new EigenDecomposition(cov) //7 val eigenVectors = transform.getV //8 val eigenValues = new ArrayRealVector(transform.getRealEigenvalues) val covariance = obs.multiply(eigenVectors).getData //9 PCAModel(covariance, eigenValues.toArray) //10 }) match {/* … */}
The normalization function zScores
the Z-score transformation (formula M12) (line 5
). Next, the method computes the covariance
matrix from the normalized data (line 6
). The eigenvectors, eigenVectors
, are computed (line 7
) and then retrieved using the getV
method in the Apache Commons Math EigenDecomposition
class (line 8
). The method computes the diagonal, transformed covariance matrix from the eigenvector (line 9
). Finally, the data transformation returns an instance of the PCA model (line 10
).
The |>
predictive method consists of projecting an observation onto the principal components:
override def |>
: PartialFunction[Array[T], Try[V]] = {
case x: Array[T]
if(isModel && x.length == dimension(xt)) =>
Try( inner(x, model.get.eigenvalues) )
}
The inner
method of the XTSeries
object computes the dot product of the values x
and the eigenvalues
model.
Let's apply the PCA algorithm to extract a subset of the features that represents some of the financial metrics ratios of 34 S&P 500 companies. The metrics under consideration are as follows:
The financial metrics are described in the Terminology section under Finances 101 in the Appendix A, Basic Concepts.
The input data is specified with the following format as a tuple (a ticker symbol and an array of five financial ratios, PE, PS, PB, ROE, and OM):
val data = Array[(String, DblVector)] ( // Ticker PE PS PB ROE OM ("QCOM", Array[Double](20.8, 5.32, 3.65, 17.65,29.2)), ("IBM", Array[Double](13, 1.22, 12.2, 88.1,19.9)), … )
The client code that executes the PCA algorithm is defined simply as follows:
val dim = data.head._2.size val input = data.map( _._2.take(dim)) val pca = new PCA[Double](input) //11 show(s"PCA model: ${pca.toString}") //12
The PCA is instantiated with the input
data (line 11
) and then displayed in a textual format (line 12
).
The first test on the 34 financial ratios uses a model that has five dimensions. As expected, the algorithm produces a list of five ordered eigenvalues:
2.5321, 1.0350, 0.7438, 0.5218, 0.3284
Let's plot the relative value of the eigenvalues (that is, relative importance of each feature) on the following bar chart:
The chart shows that three out of five features account for 85 percent of the total variance (trace of the transformed covariance matrix). I invite you to experiment with different combinations of these features. The selection of a subset of the existing features is as simple as applying Scala's take
or drop
methods:
data.map( _._2.take(dim))
Let's plot the cumulative eigenvalues for the three different model configurations:
The graph will be as follows:
The chart displays the cumulative value of eigenvalues that are the variance of the transformed features, Xi. If we apply a threshold of 90 percent to the cumulative variance, then the number of principal components for each test model is as follows:
In conclusion, the PCA algorithm reduced the dimension of the model by 33 percent for the three-feature model, 25 percent for the four-feature model, and 40 percent for the five-feature model for a threshold of 90 percent.
Cross-validation of PCA
Like any other unsupervised learning technique, the resulting principal components have to be validated through a one or K-fold cross-validation methodology using a regression estimator such as Partial Least Square Regression (PLSR) or the Predicted Residual Error Sum of Squares (PRESS). For those who are not afraid of statistics, I recommend that you read Fast Cross-validation in Robust PCA by S. Engelen and M. Hubert [4:14]. You need to be aware of the fact that the implementation of these regression estimators is not simple. The validation of the PCA is beyond the scope of this book.
Principal components analysis is a special case of the more general factor analysis. The latter class of algorithm does not require the transformation of the covariance matrix to be orthogonal.
The principal components analysis technique requires the model to be linear. Although the study of such algorithms is beyond the scope of this book, it's worth mentioning two approaches that extend PCA for non-linear models:
PCA extracts a set of orthogonal linear projections of an array of correlated values, X = {xi}. The kernel PCA algorithm consists of extracting a similar set of orthogonal projection of the inner product matrix, XTX. Nonlinearity is supported by applying a kernel function to the inner product. Kernel functions are described in the Kernel functions section in Chapter 8, Kernel Models and Support Vector Machines. The kernel PCA is an attempt to extract a low dimension feature set (or manifold) from the original observation space. The linear PCA is the projection on the tangent space of the manifold.
The concept of a manifold is borrowed from differential geometry. Manifolds generalize the notions of curves in a two-dimension space or surfaces in a three dimension space to a higher dimension. Nonlinear models are associated with Riemann manifolds whose metric is the inner product, XTX, on a tangent space. The manifold represents a low dimension feature space embedded into the original observation space. The idea is to project the principal components from the linear tangent space to a manifold using a exponential map. This feat is accomplished using a variety of techniques from Local Linear Embedding and density preserving maps to Laplacian Eigenmaps [4:15].
The vector of observations cannot be directly used on a manifold because metrics such as norms or inner products depend on the location of the manifold the vector is applied to. Computation on manifolds relies on tensors such as contravariant and covariant vectors. Tensors algebra is supported by covariant and contravariant functors, which is introduced in the Abstraction section in Chapter 1, Getting Started.
Techniques that use differentiable manifolds are known as spectral dimensionality reduction.
Alternative dimension reduction techniques
Here are some more alternative techniques, listed as references: factor analysis, principal factor analysis, maximum likelihood factor analysis, independent component analysis, Random projection, nonlinear ICA, Kohonen's self-organizing maps, neural networks, and multidimensional scaling, just to name a few [4:16].
Manifold learning algorithms such as classifiers and dimension reduction techniques are associated with semi-supervised learning.