Dimension reduction

Without prior knowledge of the data domain, data scientists include all possible features in their first attempt to create a classification, prediction, or regression model. After all, making assumptions is a poor and dangerous approach to reduce the search space. It is not uncommon for a model to use hundreds of features, adding complexity and significant computation costs to build and validate the model.

Noise-filtering techniques reduce the sensitivity of the model to features that are associated with sporadic behavior. However, these noise-related features are not known prior to the training phase, and therefore, cannot be discarded. As a consequence, training of the model becomes a very cumbersome and time-consuming task.

Overfitting is another hurdle that can arise from a large feature set. A training set of limited size does not allow you to create a model with a large number of features.

Dimension reduction techniques alleviate these problems by detecting features that have little influence on the overall model behavior.

There are three approaches to reduce the number of features in a model:

  • Statistical analysis solutions such as ANOVA for smaller feature sets
  • Regularization and shrinking techniques, which are introduced in Chapter 6, Regression and Regularization
  • Algorithms that maximize the variance of the dataset by transforming the covariance matrix

The next section introduces one of the most commonly used algorithms of the third category—principal component analysis.

Principal components analysis (PCA)

The purpose of principal components analysis is to transform the original set of features into a new set of ordered features by decreasing the order of variance. The original observations are transformed into a set of variables with a lower degree of correlation. Let's consider a model with two features {x, y} and a set of observations {xi, yi} plotted on the following chart:

Principal components analysis (PCA)

Visualization of principal components for a 2-dimension model

The features x and y are converted into two variables X and Y (that is rotation) to more appropriately match the distribution of observations. The variable with the highest variance is known as the first principal component. The variable with the nth highest variance is known as the nth principal component.

Algorithm

I highly recommend the tutorial from Lindsay Smith [4:13] that describes the PCA algorithm in a very concrete and simple way using a 2-dimension model.

Note

PCA and covariance matrix

The covariance of two features X and Y with the observations set {xi, yi} is defined as:

Algorithm

Here, Algorithm and Algorithm are the respective mean values for the observations x and y.

The covariance is computed from the zScore of each observation:

Algorithm

For a model with n features, xi, the covariance matrix is defined as:

Algorithm

The transformation of x to X consists of computing the eigenvalues of the covariance matrix:

Algorithm

The eigenvalues are ranked by their decreasing order of variance and the cumulative variance for each eigenvalue is computed. Finally, the m top eigenvalues for which the cumulative of variance exceeds a predefined threshold (percentage of the trace of the matrix) are the principal components or reduced feature set.

Algorithm

The algorithm is implemented in five steps:

  1. Compute the zScore for the observations by standardizing the mean and standard deviation.
  2. Compute the covariance matrix Σ for the original set of observations.
  3. Compute the new covariance matrix Σ' for the observations with the transformed features by extracting the eigenvalues and eigenvectors.
  4. Convert the matrix to rank eigenvalues by decreasing the order of variance. The ordered eigenvalues are the principal components.
  5. Select the principal components for which the total sum of variance exceeds a threshold by as a percentage of the trace of the new covariance matrix.

The extraction of principal components by diagonalization of the covariance matrix Σ is visualized in the following diagram. The color used to represent the covariance value varies from white (lowest value) to black (highest value):

Algorithm

Visualization of the extraction of eigenvalues in PCA

The eigenvalues (variance of X) are ranked by the decreasing order of their values. The PCA algorithm succeeds when the cumulative value of the last eigenvalues (the right-bottom section of the diagonal matrix) becomes insignificant.

Implementation

PCA can be easily implemented by using the Apache Commons Math library methods that compute the eigenvalues and eigenvectors. Once again, the main routine is implemented as a pipe operator so that it can be used in a generic workflow as defined in the The Pipe Operator section under Designing a workflow in Chapter 2, Hello World!.

import types.ScalaMl._, types.CommonMath._, //2

def |> : PartialFunction[XTSeries[Array[T]], (DblMatrix, DblVector)]={ 
  case xt: XTSeries[Array[T]] if(xt !=null && xt.size>1) => {
   zScoring(xt) match {//1
   case Some(obs) => {
    val covariance = new Covariance(obs).getCovarianceMatrix //3
    val transf = new EigenDecomposition(covariance)
    val eigVectors = transf.getV  //4
    val eigValues = new ArrayRealVector(transf.getRealEigenvalues)
    val cov = obs.multiply(eigVectors).getData
    (cov, eigValues.toArray) //5
…

PCA requires that the original set of observations is standardized using the z-score transformation. It is implemented using the XTSeries.zScoring function introduced in the Normalization and Gauss distribution section in Chapter 1, Getting Started (line 1).

The assignment forces the implicit conversion of a time series of features of the type T into a matrix of the type Double. The implicit conversions between Scala primitives and ScalaMl types such as DblMatrix (resp. between Apache Commons Math types and Scala Ml) are defined in Types.ScalamMl, as mentioned in the Type conversions section in Chapter 1, Getting Started (resp. Types.CommontMath in the Time series section in Chapter 3, Data Preprocessing) (line 2). The covariance matrix is computed based on the zScore created from the original observations (line 3). The eigenvectors, eigVectors, are computed using the getV method in the Apache Commons Math EigenDecomposition class. The eigenvalues, eigValues, are extracted as principal components (line 4).

Finally, the data transformation returns the tuple (covariance matrix, array of eigenvalues) (line 5).

Test case

Let's apply the PCA algorithm to extract a subset of the features that represents some of the financial metrics ratios of 34 S&P 500 companies. The metrics under consideration are:

  • Trailing Price-to-Earnings ratio (PE)
  • Price-to-Sale ratio (PS)
  • Price-to-Book ratio (PB)
  • Return on Equity (ROE)
  • Operation Margin (OM)

The financial metrics are described in the Terminology section under Finances 101 in Appendix A, Basic Concepts.

The input data is specified with the following format as a tuple: the ticker symbol and an array of five financial ratios, PE, PS, PB, ROE, and OM:

val data = Array[(String, DblVector)] (
  // Ticker              PE     PS     PB   ROE    OM
  ("QCOM", Array[Double](20.8, 5.32, 3.65, 17.65,29.2)),
  ("IBM",  Array[Double](13, 1.22, 12.2, 88.1,19.9)), 
   …
)

The client code that executes the PCA algorithm is defined simply as follows:

val pca = new PCA[Double]  //1
val input = data.map( _._2.take(3))
val cov = pca |> XTSeries[DblVector](input) //2
Display.show(toString(cov), logger)  //3

Once the PCA class is instantiated (line 1), the eigenvalues and covariance matrix, cov, are computed (line 2), and then displayed using the utility singleton Display that formats messages and appends to the logger (line 3).

Evaluation

The first test on the 34 financial ratios uses a model that has five dimensions. As expected, the algorithm produces a list of five ordered eigenvalues.

2.5321, 1.0350, 0.7438, 0.5218, 0.3284

Let's plot the relative value of the eigenvalues (that is, relative importance of each feature) on a bar chart:

Evaluation

Distribution of eigenvalues in PCA for 5 dimensions

The chart shows that 3 out of 5 features account for 85 percent of total variance (trace of the transformed covariance matrix). I invite you to experiment with different combinations of these features. The selection of a subset of the existing features is as simple as applying Scala's take or drop methods:

Val numFeatures = 4
val ts = XTSeries[DblVector](data.map(_._2.take(numFeatures)))

Let's plot the cumulative eigenvalues for the three different model configurations:

  • Five features: PE, PS, PB, ROE, and OM
  • Four features: PE, PS, PB, and ROE
  • Three features: PE, PS, and PB
    Evaluation

    Distribution of eigenvalues in PCA for 3, 4, and 5 features

The chart displays the cumulative value of eigenvalues that are the variance of the transformed features Xi. If we apply a threshold of 90 percent to the cumulative variance, then the number of principal components for each test model is as follows:

  • {PE, PS, PB}: 2
  • {PE, PS, PB, ROE}:3
  • {PE, PS, PB, ROE, OM}: 3

In conclusion, the PCA algorithm reduced the dimension of the model by 33 percent for the 3-feature model, 25 percent for the 4-feature model, and 40 percent for the 5-feature model for a threshold of 90 percent.

Note

Cross-validation of PCA

Like any other unsupervised learning technique, the resulting principal components have to be validated through a one or K-fold cross-validation using a regression estimator such as partial least square regression (PLSR) or the predicted residual error sum of squares (PRESS). For those not afraid of statistics, I recommend Fast Cross-validation in Robust PCA by S. Engelen and M. Hubert [4:14]. You need to be aware, however, that the implementation of these regression estimators is not simple.

The principal components can be validated through a 1-fold or K-fold cross-validation, by performing some type of regression estimators or EM on the same dataset. The validation of the PCA is beyond the scope and space allocated to this chapter.

Principal components analysis is a special case of the more general factor analysis. The later class of algorithm does not require the transformation of the covariance matrix to be orthogonal.

Other dimension reduction techniques

Although quite popular, the principal components analysis is far from being the only dimension reduction method. Here are some alternative techniques, listed as reference: factor analysis, principal factor analysis, maximum likelihood factor analysis, independent component analysis (ICA), Random projection, nonlinear PCA, nonlinear ICA, Kohonen's self-organizing maps, neural networks, and multidimensional scaling, just to name a few [4:15].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset