Feature reduction using principal component analysis

Quoting the curse of dimensionality (https://en.wikipedia.org/wiki/Curse_of_dimensionality), large number of features are computationally expensive. One way of reducing the number of features is by manually choosing and ignoring certain features. However, identification of the same features (represented differently) or highly correlated features is laborious when we have a huge number of features. Dimensionality reduction is aimed at reducing the number of features in the data while still retaining its variability.

Say, we have a dataset of housing prices and there are two features that represent the area of the house in feet and meters; we can always drop one of these two. Dimensionality reduction is very useful when dealing with text where the number of features easily runs into a few thousands.

In this recipe, we'll be looking into Principal Component Analysis (PCA) as a means to reduce the dimensions of data that is meant for both supervised and unsupervised learning.

How to do it...

As we have seen earlier, the only difference between the data for supervised and unsupervised learning is that the training and the test data for supervised learning have labels attached to them. This brings in a little complication, considering that we are interested only in reducing the dimensions of the feature vector and would like to retain the labels as they are.

Dimensionality reduction of data for supervised learning

The only thing that we have to watch out for while reducing the dimensions of data to be used as training data for supervised learning is that PCA must be applied on training data only. The test set must not be used to extract the components. Using test data for PCA would bleed the information in the test data into the components. This may result in higher accuracy numbers while testing, but it could perform poorly on unseen production data.

The least number of components that can be chosen while maintaining a sufficiently high variance is facilitated by the singular value vector available in the SingularValueDecomposion object. The singular values, available by calling the svd.s, show the amount of variance captured by the components. The first component will be the most important (by contributing the highest variance), and the importance will slowly diminish.

In order to come up with the probable number of dimensions, we can watch out for the difference and the extent to which the singular values diminish. Alternatively, we can just use simple heuristics and come up with a reasonable number if the features extend to a few thousand:

val dimensionDecidingSample=new RowMatrix((trainingSplit.randomSplit(Array(0.8,0.2))(1)).map(lp=>lp.features))

  val svd = dimensionDecidingSample.computeSVD(500, computeU = false)
   val sum = svd.s.toArray.sum
  //Calculate the number of principal components which retains a variance of 95%
  val featureRange=(0 to 500)

  val placeholder=svd.s.toArray.zip(featureRange).foldLeft(0.0) {
    case (cum, (curr, component)) =>
      val percent = (cum + curr) / sum
      println(s"Component and percent ${component + 1} :: $percent :::: Singular value is : $curr")
      cum + curr
  }

The steps that are involved are as follows:

  1. Mean-normalizing the training data.
  2. Extracting the principal components.
  3. Preparing the labeled data.
  4. Preparing the test data.
  5. Classify and evaluate the metrics.

Mean-normalizing the training data

It is highly recommended that the data be centered before running it by PCA. We achieve this using the fit and transform functions of StandardScaler. However, since the scaler in Spark accepts a DenseVector as the argument, we'll use the Vectors.dense factory method to convert the features in the labeled point into a DenseVector:

val docs = sc.textFile("SMSSpamCollection").map(line => {
    val words = line.split("	")
    Document(words.head.trim(), words.tail.mkString(" "))
  }).cache()

val labeledPointsWithTf = getLabeledPoints(docs)
val lpTfIdf = withIdf(labeledPointsWithTf).cache()

  //Split dataset
val spamPoints = lpTfIdf.filter(point => point.label == 1).randomSplit(Array(0.8, 0.2))
val hamPoints = lpTfIdf.filter(point => point.label == 0).randomSplit(Array(0.8, 0.2))

val trainingSpamSplit = spamPoints(0)
val trainingHamSplit = hamPoints(0)

val trainingData = trainingSpamSplit ++ trainingHamSplit

val unlabeledTrainData = trainingData.map(lpoint => Vectors.dense(lpoint.features.toArray)).cache()

  //Scale data - Does not support scaling of SparseVector.
  val scaler = new StandardScaler(withMean = true, withStd = false).fit(unlabeledTrainData)
  val scaledTrainingData = scaler.transform(unlabeledTrainData).cache()

Extracting the principal components

The computePrincipalComponents function is available in RowMatrix. So, we wrap our scaled training data into a RowMatrix and then extract 100 principal components out of it (as shown earlier, the number 100 is based on a run against a sample set of data and on investigating the singular value vector of the SVD). Our training data is currently a 4419 x 5000 matrix—4419 instances of data * 5000 features restricted by us while generating the term frequency using HashingTF. We then multiply this training matrix (4419 x 5000) by the principal component matrix (5000 x 100) to arrive at a 4419 * 100 matrix—4419 instances of data by 100 features (principal components). We can extract the feature vectors from this matrix by calling the rows() function:

  val trainMatrix = new RowMatrix(scaledTrainingData)
  val pcomp: Matrix = trainMatrix.computePrincipalComponents(100)

  val reducedTrainingData = trainMatrix.multiply(pcomp).rows.cache()

Preparing the labeled data

Now that we have reduced the data fifty-fold, the next step that we have to take is to use this reduced data in our algorithm to see how it fares. The classification algorithm (in this case, LogisticRegressionWithBFGS) requires an RDD of LabeledPoints. To construct the LabeledPoint, we extract the label from the original trainingData and the feature vector from the dimension-reduced dataset:

  val reducedTrainingSplit = trainingData.zip(reducedTrainingData).map { case (labeled, reduced) => new LabeledPoint(labeled.label, reduced) }

Preparing the test data

Before predicting our test data against the algorithm, we need to bring the test data to the same dimension as the training data. This is achieved by multiplying the principal components with the test matrix. As discussed earlier, we just need to make sure that we don't compute the principal components fresh here:

val unlabeledTestData=testSplit.map(lpoint=>lpoint.features)
  val testMatrix = new RowMatrix(unlabeledTestData)
  val reducedTestData=testMatrix.multiply(pcomp).rows.cache()
  val reducedTestSplit=testSplit.zip(reducedTestData).map{case (labeled,reduced) => new LabeledPoint (labeled.label, reduced)}

Classify and evaluate the metrics

The final step is to classify and evaluate the results of the algorithm. This step is the same as the classification recipe that we saw earlier. From the output, we can see that we not only reduced the number of features from 5,000 to 100, but also managed to maintain the accuracy of the algorithm at the same levels:

val logisticWithBFGS = getAlgorithm(10, 1, 0.001)
  val logisticWithBFGSPredictsActuals = runClassification(logisticWithBFGS, reducedTrainingSplit, reducedTestSplit)
  calculateMetrics(logisticWithBFGSPredictsActuals, "Logistic with BFGS")

  def getAlgorithm(iterations: Int, stepSize: Double, regParam: Double) = {
    val algo = new LogisticRegressionWithLBFGS()
algo.setIntercept(true).optimizer.setNumIterations(iterations).setRegParam(regParam)
    algo
  }

  def runClassification(algorithm: GeneralizedLinearAlgorithm[_ <: GeneralizedLinearModel], trainingData: RDD[LabeledPoint],
    testData: RDD[LabeledPoint]): RDD[(Double, Double)] = {
    val model = algorithm.run(trainingData)
    println ("predicting")
    val predicted = model.predict(testData.map(point => point.features))
    val actuals = testData.map(point => point.label)
    val predictsAndActuals: RDD[(Double, Double)] = predicted.zip(actuals)
    println (predictsAndActuals.collect)
    predictsAndActuals
  }

  def calculateMetrics(predictsAndActuals: RDD[(Double, Double)], algorithm: String) {

    val accuracy = 1.0 * predictsAndActuals.filter(predActs => predActs._1 == predActs._2).count() / predictsAndActuals.count()
    val binMetrics = new BinaryClassificationMetrics(predictsAndActuals)
    println(s"************** Printing metrics for $algorithm ***************")
    println(s"Area under ROC ${binMetrics.areaUnderROC}")
    println(s"Accuracy $accuracy")

    val metrics = new MulticlassMetrics(predictsAndActuals)
    println(s"Precision : ${metrics.precision}")
    println(s"Confusion Matrix 
${metrics.confusionMatrix}")
    println(s"************** ending metrics for $algorithm *****************")
  }

This is the output:

Compared to the area under the ROC at around the same levels (95%), we have considerably reduced the time of the run by reducing the dimensions of the features ten-fold:

************** Printing metrics for Logistic with BFGS ***************
Area under ROC 0.9428948576675849
Accuracy 0.9829136690647482
Confusion Matrix
965.0  3.0
16.0   128.0
************** ending metrics for Logistic with BFGS *****************

Dimensionality reduction of data for unsupervised learning

Unlike reducing the dimensions of data with labels, reducing the dimensionality of data for unsupervised learning is very simple. We just apply the PCA to the entire dataset. This helps a lot in improving the performance of algorithms such as K-means, where the entire set of features has to be plotted on a higher dimension and the entire data must be visited multiple times. A lesser number of features means a lesser number of dimensions and less data to be held in the memory.

For this recipe, we use the Iris.data that we used for clustering earlier. The dataset already has four features, and this isn't a great candidate for dimensionality reduction as such. However, the process around reducing dimensions for unlabeled data is the same as for any other dataset.

The steps that are involved are as follows:

  1. Mean-normalizing the training data.
  2. Extracting the principal components.
  3. Arriving at the number of components.
  4. Evaluating the metrics.

Mean-normalizing the training data

As we saw earlier, scaling is a must before reducing dimensions:

  val scaler = new StandardScaler(withMean = true, withStd = false).fit(data)
  val scaledData = scaler.transform(data).cache()

Extracting the principal components

As we saw earlier, to compute the principal components, we need to wrap our scaled training data into a RowMatrix. We then multiply the matrix by the principal component matrix to arrive at the reduced matrix. We can extract the feature vector from this matrix by calling the rows() function:

  val pcomp: Matrix = matrix.computePrincipalComponents(3)
  val reducedData = matrix.multiply(pcomp).rows

Arriving at the number of components

While we would like to have the least number for the components, the other goal is to retain the highest variance in the data. In this case, a run against three components was made, and we could see that holding on to just two components out of the four, we retained 90% of the variance. However, since we wanted at least 95%, 3 was chosen:

  val svd = matrix.computeSVD(3)

  val sum = svd.s.toArray.sum

  svd.s.toArray.zipWithIndex.foldLeft(0.0) {
    case (cum, (curr, component)) =>
      val percent = (cum + curr) / sum
      println(s"Component and percent ${component + 1} :: $percent :::: Singular value is : $curr")
      cum + curr
  }

The output is as follows:

Component and percent 1 :: 0.6893434455825798 :::: Singular value is : 25.089863978899867
Component and percent 2 :: 0.8544090583609627 :::: Singular value is : 6.0078525425063365
Component and percent 3 :: 0.9483881906752903 :::: Singular value is : 3.4205353829523646
Component and percent 4 :: 1.0                      :::: Singular value is : 1.878502340103494

Evaluating the metrics

After we have reduced the dimensions of the data from four to three (!?), for fun, we run the data against a range of one to seven clusters to see the elbow bend. When we compare the results of this with the K-means clustering without dimensionality reduction, the results looks practically the same:

 val clusterCost = (1 to 7).map { noOfClusters =>
    val kmeans = new KMeans()
      .setK(noOfClusters)
      .setMaxIterations(5)
      .setInitializationMode(KMeans.K_MEANS_PARALLEL) //KMeans||

    val model = kmeans.run(reducedData)
    (noOfClusters, model.computeCost(reducedData))
  }

Here is the output:

The following screenshot shows the cost across various numbers of clusters:

Evaluating the metrics

Here is a screenshot that shows strikingly similar results for the elbow bend:

Evaluating the metrics

In this chapter, we first saw the difference between supervised and unsupervised learning. Then we explored a sample of machine learning algorithms in Spark: LinearRegression for predicting continuous values, LogisticRegression and SVM for classification, K-means for clustering, and finally PCA for dimensionality reduction. There are a plenty of other algorithms in Spark, and more algorithms are being added to Spark with every version, both batch and streaming.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset