Chapter 4. Unsupervised Learning

Labeling a set of observations for classification or regression can be a daunting task, especially in the case of a large feature set. In some cases, labeled observations are either not available or not possible to create. In an attempt to extract some hidden association or structures from observations, the data scientist relies on unsupervised learning techniques to detect patterns or similarity in data.

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. These techniques are also applied in reducing the solution space or feature set similarly to the divide-and-conquer approach commonly used in Computer Science.

There are numerous unsupervised algorithms; some are more appropriate to handle dependent features while others generate more relevant groups in the case of hidden features [4:1]. In this chapter, you will learn three of the most common unsupervised learning algorithms:

  • K-means: Clustering observed features
  • Expectation-maximization (EM): Clustering observed and latent features
  • Principal components analysis (PCA): Reducing the dimension of the model

Any of these algorithms can be applied to technical analysis or fundamental analysis. Fundamental analysis of financial ratios and technical analysis of price movements are described in the Technical analysis section under Finances 101 in Appendix A, Basic Concepts. The K-means algorithm is fully implemented in Scala while expectation-maximization and principal components analysis leverage the Apache Commons Math library.

Clustering

Problems involving a large number of features for large datasets become quickly intractable, and it is quite difficult to evaluate the independence between features. Any computation that requires some level of optimization and, at a minimum, computation of first order derivatives requires a significant amount of computing power to manipulate high-dimension matrices. As with many engineering fields, a divide-and-conquer approach to classifying very large datasets is quite effective. The objective is to reduce continuous, infinite, or very large datasets into a small group of observations that share some common attributes.

Clustering

Visualization of data clustering

This approach is known as vector quantization. Vector quantization is a method that divides a set of observations into groups of similar size. The main benefit of vector quantization is that the analysis using a representative of each group is far simpler than an analysis of the entire dataset [4:2].

Clustering, also known as cluster analysis, is a form of vector quantization that relies on a concept of distance or similarity to generate groups known as clusters.

Note

Learning vector quantization (LVQ)

Vector quantization should not be confused with learning vector quantization. Learning vector quantization is a special case of artificial neural networks that relies on a winner-take-all learning strategy to compress signals, images, or videos.

This chapter introduces two of the most commonly applied clustering algorithms:

  • K-means, which is used for quantitative types and minimizes the total error (known as the reconstruction error) given the number of clusters and the distance formula.
  • Expectation-maximization (EM), which is a two-step probabilistic approach that maximizes the likelihood estimates of a set of parameters. EM is particularly suitable to handle missing data.

K-means clustering

K-means is a popular iterative clustering algorithm. The representative of each cluster is computed as the center of the cluster, known as the centroid. The similarity between observations within a single cluster relies on the concept of distance between observations.

Measuring similarity

There are many ways to measure the similarity between observations. The most appropriate measure has to be intuitive and avoid computational complexity. This section reviews three similarity measures:

  • The Manhattan distance
  • The Euclidean distance
  • Cosine of value observations

The Manhattan distance is defined by the absolute distance between two variables or vectors, {xi} and {yi}, of the same size:

Measuring similarity

The implementation is generic enough to compute the distance between two arrays of elements of different types as long as an implicit conversion between each of these types to Double values is already defined, as shown here:

def manhattan[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double = (x, y).zipped.foldLeft(0.0)((s, t) => s + Math.abs(t._1 - t._2))

The ubiquitous Euclidean distance is defined as the square of the distance between two vectors, {xi} and {yi}, of the same size:

Measuring similarity
def euclidean[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double =  Math.sqrt((x, y).zipped.foldLeft(0.0)((s, t) => { val d = t._1 - t._2; s + d*d} ))

The cosine distance is defined as the cosine of an angle between two vectors, {xi} and {yi}, of the same size:

Measuring similarity

In this implementation, the computation of the dot product and the norms for each dataset is done simultaneously using the tuple within the fold method:

def cosine[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double = {
  val zeros = (0.0, 0.0, 0.0)
  val norms = (x, y).zipped.foldLeft(zeros)((s, t) => 
     (s._1 + t._1*t._2, s._2 + t._1*t._1, s._3 + t._2*t._2))
  norms._1/Math.sqrt(norms._2*norms._3)
}

Tip

Performance of zip

The scalar product of two vectors is one of the most common operations. It is tempting to implement the dot product using the generic zip method:

def dot (x:Array[Double], y:Array[Double]): Array[Double] = 
    x.zip(y).map( x => f(x._1, x._2) )
An functional alternative is to use the Tuple2.zipped method.
def dot(x:Array[Double], y:Array[Double]): Array[Double] = (x, y).zipped map ( _ * _)

If readability is not a primary issue, you can always implement the dot method with a while loop.

Overview of the K-means algorithm

The main advantage of the K-means algorithm (and the reason for its popularity) is its simplicity [4:3].

Note

Let's consider K clusters {Ck} with means {mk}. The K-means algorithm is indeed an optimization problem, the objective of which is to minimize the reconstruction or the total error defined as the total sum of distance.

Overview of the K-means algorithm

The steps of the iterative algorithm are:

  1. Initialize the centroids or means mk of the K clusters.
  2. Assign observations to the nearest cluster given mk.
  3. Iterate until no observations are reassigned to a cluster:
    • Compute centroids mk that minimize the total error reconstruction for the current assignment
    • Reassign the observations given the new centroids mk

Step 1 – cluster configuration

The configuration of the K clusters consists of defining the following parameters for the K-means algorithm: number of K clusters, the distance metrics, the maximum number of iterations, and the initial value of the cluster's centroid.

Defining clusters

The first step is to define a cluster. A cluster is defined by the following parameters:

  • Centroid: center
  • The indices of the observations that belong to this cluster: members

The following code shows the definition of a cluster:

class Cluster[T <% Double](val center: DblVector) {
  val members = new ListBuffer[Int]

The cluster is responsible for managing its members (data points) at any point of the iterative computation of the K-means algorithm. It is assumed that a cluster will never contain the same data points twice.

The constructor of the Cluster class is implemented by the apply method in the companion object (for convenience, refer to the Class constructor template section in Appendix A, Basic Concepts):

object Cluster {
  def apply[T <% Double](c:DblVector):Cluster[T] = new Cluster[T](c)
}

At a minimum, a cluster should be able to manage its membership of observations, update its center, and compute the variance or standard deviation of all its member observations:

def += (n:Int): Unit = members.append(n)
def moveCenter(xt: XTSeries[Array[T]): Cluster[T] ={
   val sums = members.map(xt(_).map(_.toDouble)).toList
                     .transpose
                     .map( _.sum)
  Cluster[T](sums.map( _ / members.size).toArray)
}

def stdDev(xt: XTSeries[Array[T]], distance: (DblVector, Array[T]) => Double): Double = {
   Stats[Double](members.map(xt( _))
             .map( distance(center, _)).toArray).stdDev
}

The three important methods that define the behavior of a cluster instance are as follows:

  • +=: Add a member (index of an observation in the original time series).
  • moveCenter: Create a new cluster with the existing members and a new centroid computed as the mean of all the observations contained in the cluster.
  • stdDev: Compute the standard deviation (or density) of all the observations contained in the cluster relative to its center. The distance between each member and the centroid is extracted through a map, and then folded to generate the statistics. The function to compute the distance between the center and an observation is an argument of the method. The default distance is Euclidean.

Note

Cluster selection

There are different ways to select the most appropriate cluster when reassigning an observation (updating its membership). In this implementation, we will select the cluster with the larger spread or lowest density. An alternative is to select the cluster with the largest membership.

Defining K-means

Let's declare the K-means algorithm class, KMeans, with its public methods.

The KMeans class takes the number of clusters, K, and the maximum number of iterations, maxIters, as parameters. The implicit conversion of type T to a Double is specified by the T <% Double view bound. The Ordering class has to be passed implicitly as a parameter because it is required by the sortWith method in the initialize and maxBy methods. The Manifest method is required to preserve the type erasure for Array[T] in the JVM:

class KMeans[T <% Double](K: Int, maxIters: Int, distance: (DblVector,Array[T]) => Double)(implicit order: Ordering[T], m: Manifest[T]) extends PipeOperator[XTSeries[Array[T]], List[Cluster[T]]] {
  def |> : PartialFunction[XTSeries[Array[T]], List[Cluster[T]]]
  def initialize(xt:XTSeries[Array[T]]): List[Cluster[T]]

As with other data processing units, the extraction of K-means clusters is encapsulated by the pipe operator |>, so clustering can be integrated into a workflow using dependency injection described in the Dependency injection section in Chapter 2, Hello World!. The initialization of the centroids of each of the K clusters is performed by the private initialize method.

Initializing clusters

The initialization of the cluster centroids is important to ensure fast convergence of K-means. Solutions range from the simple random generation of centroids to the application of genetic algorithms to evaluate the fitness of centroid candidates. We selected an efficient and fast initialization algorithm developed by M. Agha and W. Ashour [4:4].

The steps of the initialization are as follows:

  1. Compute the standard deviation of the set of observations.
  2. Select the dimension k {xk,0, xk,1 … xk,n} with maximum standard deviation.
  3. Rank the observations by their increasing value of standard deviation for the dimension k.
  4. Divide the ranked observations set equally into K sets {Sm}.
  5. Find the median values, size (Sm)/2.
  6. Use the corresponding observations as centroids.

The initialization algorithm is implemented by the private initialize method:

  def initialize(xt:XTSeries[Array[T]]): List[Cluster[T]]={
   val stats = statistics(xt) //1
   val maxSDevDim = Range(0,stats.size).maxBy (stats( _ ).stdDev)//2
   val rankedObs = xt.zipWithIndex
                     .map(x=> (x._1(maxSDevDim), x._2)) //2
                     .sortWith( _._1  < _._1) //3
   val halfSegSize = ((rankedObs.size>>1)/K).floor.toInt //4
   val centroids = rankedObs.filter(isContained( _, halfSegSize, rankedObs.size) ).map(n => xt(n._2)) //6
   Range(0, K).foldLeft(List[Cluster[T]]())((xs, i) => Cluster[T](centroids(i)) :: xs) //7
}

Let's deconstruct the implementation of the Agha-Ashour algorithm in the initialize method.

The statistics function is applied to the input time series to extract the standard deviation for each dimension in the observations set (line 1). The dimension with the maxSDevDim maximum variance or standard deviation is computed by using the maxBy method on a Stats instance (line 2). Then, the observations are ranked by the increasing value of the standard deviation, rankedObs (line 3).

The ordered sequence of observations is then broken into xt.size/K segments (line 4) and the indices of the centroids are selected as the midpoint (or median) observations of those segments using the filtering condition, isContained:

def isContained(t: (T,Int), hSz: Int, dim: Int): Boolean = 
    ((t._2 % hSz == 0) && (t._2 %(hSz<<1) != 0)

The indices of the centroid in the time series are converted to actual observations using a map method (line 6). Finally, the list of clusters is generated using a fold (foldLeft) method on the range of cluster indices (0, K-1) (line 7).

Step 2 – cluster assignment

The second step in the K-means algorithm is the assignment of the observations to the clusters for which the centroids have been initialized in step 1. This feat is accomplished by the private assignToClusters method:

def assignToClusters(xt: XTSeries[Array[T]], clusters: List[Cluster[T]], membership: Array[Int]): Int = {
  xt.toArray
    .zipWithIndex
    .filter(x => { //1
       val nearestCluster = getNearestCluster(clusters, x._1)//2
       val reassigned = nearestCluster != membership(x._2) 
       clusters(nearestCluster) += x._2 //3
       membership(x._2) = nearestCluster //4
       reassigned
     }).size
}

The core of the assignment of observations to each cluster is the filter on the time series (line 1). The filter computes the index of the closest cluster and checks whether the observation is to be reassigned (line 2). The observation at the index x._2 is added to the nearest cluster, clusters(nearestCluster) (line 3). The current membership of the observations is then updated (line 4).

The cluster closest to an observation data is computed by the getNearestCluster method as follows:

def getNearestCluster(clusters: List[Cluster[T]], x:Array[T]): Int={
  clusters.zipWithIndex..foldLeft((Double.MaxValue,0))((p,c) => { 
      val measure = distance(c._1.center, x)
      if(measure < p._1) (measure, c._2) else p
   })._2

A fold is used to extract from the list of clusters the cluster that is closest to the observation x using the distance metric defined in the K-means constructor.

Step 3 – iterative reconstruction

The final step is to implement the iterative computation of the reconstruction error. In this implementation, the iteration terminates when no more observations are reassigned to different clusters. As with other data processing units, the extraction of K-means clusters is encapsulated by the pipe operator |>, so that clustering can be integrated into a workflow using dependency injection described in the Dependency Injection section in Chapter 2, Hello World!.

The generation of the K clusters is executed by the data transformation |>:

def |> :PartialFunction[XTSeries[Array[T]], List[Cluster[T]]] = {
  case xt: XTSeries[Array[T]] if(xt.size>2 && xt(0).size>0) => {
    val clusters = initialize(xt)  //1

    if( clusters.isEmpty) List.empty
    else  {
      val membership = Array.fill(xt.size)(0)
      val reassigned = assignToClusters(xt,clusters,membership)//2
      var newClusters: List[Cluster[T]] = List.empty
      Range(0, maxIters).find( _ => {
        newClusters = clusters.map( c => {
          if( c.size > 0) c.moveCenter(xt, dimension(xt)) 
          else clusters.filter( _.size > 0)
                       .maxBy( _.stdDev(xt, distance))
        }) //3
        assignToClusters(xt, newClusters, membership) == 0
      }) match {
        case Some(index) => newClusters
        case None => { … }
    } //4
  }
}

As described in the algorithm overview section, the main method initializes the membership for all the observations (line 1), creates and initializes the clusters, and assigns the observations to clusters using the assignToClusters method (line 2). The iteration updates the content of each cluster using the moveCenter method, by assigning new observations to the cluster with the highest standard deviation (line 3). The iterative loop exits when no more reassignment is needed (line 4).

Tip

K-means algorithm exit condition

In some rare instances, the algorithm may reassign the same few observations between clusters, preventing its convergence toward a solution in a reasonable time. Therefore, it is recommended to add a maximum number of iterations as an exit condition. If K-means does not converge with the maximum number of iterations, then the cluster centroids need to be reinitialized and the iterative process needs to be executed once again.

The companion object for KMeans implements the apply constructor and the computation of the stdDev standard deviation for each cluster. The default constructor uses the Euclidean distance:

def apply[T <% Double](K: Int, maxIters: Int)(implicit order: Ordering[T], m: Manifest[T]): KMeans[T] = new KMeans[T](K, maxIters, euclidean)
def stdDev[T](c: List[Cluster[T]], xt: XTSeries[Array[T]]): List[Double] =  c.map( _.stdDev(xt))

The stdDev method computes the standard deviation of the distances between each data point that belongs to a c cluster and its centroid.

Note

Centroid versus mean

The terms centroid and mean refer to the same entity: the center of a cluster. This chapter uses these two terms interchangeably.

Note that ordering a trait and Manifest have to be provided in the apply constructor because there is no guarantee that such capabilities are provided in runtime by the client code.

Curse of dimensionality

A model with a significant number of features (high dimensions) requires a larger number of observations in order to extract robust clusters. K-means clustering with very small datasets, of size less than 50, produces models with high bias and a limited number of clusters that are affected by the order of observations [4:5]. I have been using the following simple empirical rule of thumb for a training set of size n, expected K clusters, and N features: n < K.N.

Note

Dimensionality versus size of training set

The issue with the dimensionality of models versus the number of observations is not specific to unsupervised learning algorithms. All supervised learning techniques face the same challenge to set up a viable training plan.

Whichever empirical rule you follow, such a restriction is particularly an issue for analyzing stocks using historical quotes. Let's consider our examples of using technical analysis to categorize stocks according to their price behavior over a period of 1 year (or approximately 250 trading days). The dimension of the problem is 250 (250 daily closing prices). The number of stocks (observations) would have exceeded several hundred!

Curse of dimensionality

Price model for K-means clustering

There are options to get around this limitation and shrink the number of observations; among them are:

  • Sampling the trading data without losing a significant amount of information from the raw data, assuming the distribution of observations follows a known probability density function.
  • Smoothing the data to remove the noise as seen in Chapter 3, Data Preprocessing, assuming the noise is Gaussian. In our test, a smoothing technique will remove the price outliers for each stock and therefore reduce the number of features (trading session). This approach differs from the sampling approach because it does not require an assumption that the dataset follows a known density function. On the other hand, the reduction of features will be less significant.

These approaches are workaround solutions at best, used for the sake of this tutorial, but they are not recommended for actual commercial analytical applications. The principal component analysis introduced in the last section of this chapter is one of the most reliable dimension reduction techniques.

Experiment

The objective is to extract clusters from a set of stock price actions during a period of time between January 1 and Dec 31, 2013 as features. For this test, 127 stocks are randomly selected from the S&P 500 list. The following chart visualizes the behavior of the normalized price of a subset of these 127 stocks:

Experiment

Price action of stocks used in K-means clustering

The key is to select the appropriate features prior to clustering and the time window to operate on. It would make sense to consider the entire historical price over the 252 trading days as a feature. However, the number of observations (stocks) is too limited to use the entire price range. The (SAMPLES = 50) observations are the stock closing price for each trading session between the 80th and 130th days. The adjusted daily closing prices are normalized using the minimum and maximum values.

First, let's create a simple function to execute the K-means algorithm:

Val MAX_ITERS = 150
def run(K: Int, obs: DblMatrix): Unit = {
  val kmeans = KMeans[Double](K, MAX_ITERS) //1

  val clusters = kmeans |> XTSeries[DblVector](obs) //2
  clusters.foreach( _.center.foreach( show( _ ))) //3
  clusters.map( _.stdDev(XTSeries[DblVector](obs, euclidean))).foreach( show( _ )  )  //4
}

The KMeans class is first initialized with a number of clusters, K, and a maximum number of iterations, MAX_ITERS (line 1). These two parameters are domain and problem specific. The clustering algorithm is executed (line 2) returning a list of clusters. The clusters' centroid information is then displayed (line 3) and the standard deviation is computed for each of the clusters for a given number of clusters, K, and observations, obs (line 4).

Let's load the data from CSV files using the DataSource class (refer to the Data extraction section in Appendix A, Basic Concepts):

final val path = "resources/data/chap4/"
val extractor = YahooFinancials.adjClose :: List[Array[String] =>Double]() // 5
def symbols = DataSource.listSymbols(path)  //6
        
final val START = 80
final val SAMPLES = 50
val normalize=true
val prices = symbols.map(s =>DataSource(s,path,normalize) |> extractor)  //7
prices.find(_.isEmpty) match {  //8
  case Some(noPrice) = { … }
  case None => {
    val values = prices. map(x => x(0))
                        .map(_.drop(START).take(SAMPLES))
    args.map(_.toInt) foreach( run(_, values)) //9
  }
}

As mentioned earlier, the cluster analysis applies to the closing price in the range between the 80th and 130th trading day. The extractor is defined to extract the adjusted closing price for a stock whose price information is retrieved from YahooFinancials (line 5). The list of stock symbols is used to extract price information from CSV files located at the path (line 6). For instance, the ticker symbol for General Electric Corp. is GE and the trading data is located in GE.csv.

The 50 daily prices for each stock are extracted by an instance of DataSource (line 7). The run method introduced earlier is invoked either for each stock or as soon as K-means fails through an exit condition in the find method (line 8). The normalized data values.toArray for the specific time window is extracted by the combination of calls to drop and take Scala array methods (line 9).

The first test run is executed with K=3 clusters. The mean (or centroid) vector for each cluster is plotted as follows:

Experiment

Chart of means of clusters using K-means K=3

The means vectors of the three clusters are quite distinctive. The top and bottom means 1 and 2 in the chart have the respective standard deviation of 0.34 and 0.27 and share a very similar pattern. The difference between the elements of the 1 and 2 cluster mean vectors is almost constant: 0.37. The cluster with a mean vector 3 represents the group of stocks that behave like the stocks in cluster 2 at the beginning of the time period, and behave like the stocks in cluster 1 towards the end of the time period.

This behavior can be easily explained by the fact that the time window or trading period, the 80th to 130th trading day, correspond to the shift in the monetary policy of the federal reserve in regard to the quantitative easing program. Here is the list of stocks for each of the clusters whose centroid values are displayed on the chart:

Cluster

List of stocks

Cluster 1

AET, AHS, BBBY, BRCM, C, CB, CL, CLX, COH, CVX, CYH, DE, DG, DHI, DO, DUK, EA, EBAY, EXC, EXP, FE, GLW, GPS, IBM, JCP, JNJ, JWN, K, KF, KMI, KO, KRFT, LEN, LINC, LRCX, MSFT, NVMI, THC, XRT

Cluster 2

AA, AAPL, ADBE, ADSK, AFAM, AMZN, AU, BHI, BTU, CAT, CCL, CCMP, COP, CSC, CU, DOW, EMR, ENTG, ETFC, FCX, FDX, FFIV, FISV, FLIR, FLR, FLS, FTR, GLD, GRMN, GT, JCI, QCOM, QQQ, SIL, SLV, SLW

Cluster 3

ADM, ADP, AXP, BA, BBT, BEN, BK, BSX, CA, CBS, CCE, CELG, CHK, CI, CME, CMG, CSCO, CVS, DAL, DD, DNB, EMC, EXPE, F, FDO, FITB, FMC, GCI, GE, GM, GME, GS, HCA, JNPR, JPM, KLAC, LH, LLL, LM, LMT, LNC, LO, MKSI, MU, NEM, TRW, TXN, UNH, WDC, XLF, XLNX, ZNGA

Let's evaluate the impact of the number of clusters K on the characteristics of each cluster.

Tuning the number of clusters

We repeat the previous test on the 127 stocks and the same time window with the number of clusters varying from 2 to 15.

The mean (or centroid) vector for each cluster is plotted as follows for K = 2:

Tuning the number of clusters

Chart of means of clusters using K-means K=2

The chart of the results of the K-means algorithms with 2 clusters shows that the mean vector for the cluster labeled 2 is similar to the mean vector labeled 3 on the chart with K = 3 clusters. However, the cluster with the mean vector 1 reflects somewhat the aggregation or summation of the mean vectors for the clusters 1 and 3 in the chart K =3. The aggregation effect explains why the standard deviation for the cluster 1, 0.55, is twice as much as the standard deviation for the cluster 2, 0.28.

The mean (or centroid) vector for each cluster is plotted as follows for K = 5:

Tuning the number of clusters

Chart of means of clusters using K-means K=5

In this chart, we can assess that the clusters 1 (with the highest mean), 2 (with the lowest mean), and 3 are very similar to the clusters with the same labels in the chart for K =3. The cluster with the mean vector 4 contains stocks whose behaviors are quite similar to those in cluster 3, but in the opposite direction. In other words, the stocks in cluster 3 and 4 reacted in opposite ways following the announcement of the change in the monetary policy.

In the tests with high values of K, the distinction between the different clusters becomes murky, as shown in the following chart for K = 10:

Tuning the number of clusters

Chart of means of clusters using K-means K=10

The means for clusters 1, 2, and 3 seen in the first chart for the case K = 3 are still visible. It is fair to assume that these are very likely the most reliable clusters. These clusters happened to have a low standard deviation or high density.

Let's define the density of a cluster Cj with a centroid cj as the inverse of the Euclidean distance between all members of each cluster and its mean (or centroid):

Tuning the number of clusters

The density of the cluster is plotted against the number of clusters with K = 1 to 13:

Tuning the number of clusters

Bar chart of the average cluster density for K = 1 to 13

As expected, the average density of each cluster increases as K increases. From this experiment, we can draw the simple conclusion that the density of each cluster does not significantly increase in the test runs for K =5 and beyond. You may observe that the density does not always increase as the number of clusters increases (K = 6 to K = 11). The anomaly can be explained by the following three factors:

  • The original data is noisy
  • The model is somewhat dependent on the initialization of the centroids
  • The exit condition is too loose

Validation

There are several methodologies to validate the output of a K-means algorithm from purity to mutual information [4:6]. One effective way to validate the output of a clustering algorithm is to label each cluster and run those clusters through a new batch of labeled observations. For example, if during one of these tests you find that one of the clusters CC contains most of the commodity-related stocks, then you can select another commodity-related stock, SC, which is not part of the first batch, and run the entire clustering algorithm again. If SC is contained in CC, then the clustering has performed as expected. If this is the case, you should run a new set of stocks, some of which are commodity related, and measure the number of true positives, true negatives, false positives, and false negatives. The precision, recall, and F1 measures introduced in the Assessing a model section of Chapter 2, Hello World!, confirms whether the tuning parameters and labels you selected for your cluster are indeed correct.

Tip

Validation

The quality of the clusters, as measured by the F1 statistics, depends on the labeling of the cluster and the rule (that is, label a cluster with the industry with the highest relative percentage of stocks in the cluster) used to assign a label. This process is very subjective. The only sure way to validate a validation methodology is to evaluate several labeling schemes and select the one that generates the highest F1 statistics.

We reviewed some of the tuning parameters that impact the quality of the results of the K-means clustering. They are as follows:

  • Initial selection of centroid
  • Number of K clusters

In some cases, the similarity criterion (that is, Euclidean distance versus cosine value) can have an impact on the cleanness or density of the clusters.

The final and important consideration is the computational complexity of the K-means algorithm. The previous sections of the chapter described some of the performance issues with K-means and possible remedies.

Despite its many benefits, the K-means algorithm does not handle missing data or unobserved features very well. Features that depend on each other indirectly may in fact depend on a common hidden (also known as latent) variable. The expectation-maximization algorithm described in the next section addresses some of these limitations.

Expectation-maximization (EM) algorithm

The expectation-maximization algorithm was originally introduced to estimate the maximum likelihood in the case of incomplete data [4:7]. It is an iterative method to compute the model features that maximize the likely estimate for observed values, taking into account unobserved values.

The iterative algorithm consists of computing:

  • The expectation, E, of the maximum likelihood for the observed data by inferring the latent values (E-step)
  • The model features that maximize the expectation E (M-step)

The expectation-maximization algorithm is applied to solve clustering problems by assuming that each latent variable follows a Normal or Gaussian distribution. This is similar to the K-means algorithm for which the distance of each data point to the center of each cluster follows a Gaussian distribution [4:8]. Therefore, a set of latent variables is a mixture of Gaussian distributions.

Gaussian mixture model

Latent variables Z can be visualized as the behavior (or symptoms) of a model (observed) X for which Z are the root causes of the behavior:

Gaussian mixture model

Visualization of observed and latent features

The latent values Z follow a Gaussian distribution. For the statisticians among us, the mathematics of a mixture model is described in the following information box.

Note

The mixture model

If {xi} is a set of observed features associated with latent features {zk}, the probability for the feature xi given zk has a value j:

Gaussian mixture model

The probability p is called the base distribution. If we extend to the entire model, θ= {xi, zk}, the conditional probability is defined as follows:

Gaussian mixture model

The most widely used mixture model is the Gaussian mixture model that represents the base distribution p as a Normal distribution and the conditional probability as a weighted Normal multivariate distribution:

Gaussian mixture model

EM overview

As far as the implementation is concerned, the expectation-maximization algorithm can be broken down into three stages:

  1. The computation of the log likelihood for the model features given some latent variables (LL).
  2. The computation of the expectation of the log likelihood at iteration t (E-step).
  3. The maximization of the expectation at iteration t (M-step).

Note

Log likelihood

  • LL: Let's consider a set of observed variables X={xi} and latent variables Z={zi}. The log likelihood for X for given Z is:
    EM overview
  • E-step: The expectation for the model variable θ at iteration t is computed as:
    EM overview
  • M-step: The function Q is maximized for the model features θ as:
    EM overview

A formal, detailed, but short mathematical formulation of the EM algorithm can be found in S. Borman's tutorial [4:9].

Implementation

Let's implement the three steps (LL, E-step, and M step) in Scala. The internal calculations of the EM algorithm are a bit complex and overwhelming. You may not benefit much from the details of a specific implementation such as computation of the eigenvalues of the covariance matrix of the expectation of the log likelihood. This implementation hides some complexities by using the Apache Commons Math library package [4:10].

Tip

Inner workings of EM

You may want to download the source code for the implementation of the EM algorithm in the Apache Commons Math library if you need to understand the condition for which an exception is thrown.

First, let's define convenient internal types:

type EM = MultivariateNormalMixtureExpectationMaximization
type EMOutput = List[(Double, DblVector, DblVector)]
import scala.collections.JavaConversions._ //1

The constructor of the MultivariateEM class uses the standard template for machine learning algorithm classes:

  • Parameterized view bound type
  • Implementation of EM as a data transformation by extending PipeOperator

Here is an implementation of the constructor of MultivariateEM:

class MultivariateEM[T <% Double](K: Int) extends PipeOperator[XTSeries[Array[T]], EMOutput]

The Apache Commons Math Java implementation of the EM uses Java container classes that need to be explicitly converted to Scala collections. Those conversions are defined in the JavaConversions package (line 1).

The implementation of the EM algorithm in the data transformation |> operator uses the Apache Commons Math MultivariateNormalMixture class for the Gaussian mixture model and the MultivariateNormalMixtureExpectationMaximization class for the EM algorithm:

def |> : PartialFunction[XTSeries[Array[T]], EMOutput] = {
 case xt: XTSeries[Array[T]] if(xt.size>0 && dimension(xt)>0) =>{
   val data: DblMatrix = xt  //2
   val multivariateEM = new EM(data)
   val est = MultivariateNormalMixtureExpectationMaximization
             .estimate(data, K)
   multivariateEM.fit(est) //3

   val newMixture = multivariateEM.getFittedModel //4
   val components = newMixture.getComponents.toList //5
   components.map(p => (p.getKey.toDouble, p.getValue.getMeans, p.getValue.getStandardDeviations)  ))//6
….

Let's look at the main |> method of the MultivariateEM wrapper class. The first step is to convert the time series into a primitive matrix of Double with observations and historical quotes as rows and the stock symbols as columns (line 2).

The initial mixture of Gaussian distributions can be provided by the user or can be extracted from the dataset as an estimate (line 3). The getFittedModel model triggers the M-step (line 4).

The Apache library uses Java primitives that need to be converted to Scala types using the package import scala.collection.JavaConversions. An instance of java.util.List is converted to scala.collection.immutable.List using toList, which invokes the asScalaIterator method of WrapAsScala, one of the base traits of JavaConversions (line 5).

The <Double, MultivariateNormalDistribution> key-value pair, returned by the call to getFittedModel by the Apache math method, is to be converted to a tuple containing the mean and standard deviation for each cluster (line 6).

Tip

Third-party library exceptions

Scala does not enforce the declaration of exceptions as part of the signature of a method. Therefore, there is no guarantee that all types of exceptions will be caught locally. This problem occurs when exceptions are thrown from a third-party library in two scenarios:

  • The documentation of the API does not list all the types of exceptions
  • The library is updated and a new type of exception is added to a method

One easy workaround is to leverage the Scala exception-handling mechanism:

  Try {
       ..
  } match {
      case Success(results) => …
      case Failure(exception)  => ...
  }

Testing

Let's apply the MultivariateEM class to the clustering of the same 127 stocks used in evaluating the K-means algorithm.

As discussed in the paragraph related to the curse of dimensionality, the number of stocks (127) to analyze restricts the number of observations to be used by the EM algorithm. A simple option is to filter out some of the noise of the stocks and apply a basic sampling method. The maximum sampling rate is restricted by the frequencies in the spectrum of noises of different types in the historical price of every stock.

Tip

Filtering and sampling

The preprocessing of the data using a combination of a simple moving average and fixed interval sampling prior to clustering is very rudimentary in this example. For instance, we cannot assume that the historical quotes of all the stocks share the same noise characteristics. The noise pattern in the quotation of momentum and heavily traded stocks is certainly different from blue-chip securities with a strong ownership, and these stocks are held by large mutual funds.

The sampling rate should take into account the spectrum of frequency of the noise. It should be set as at least twice the frequency of the noise with the lowest frequency.

The object of the test is to evaluate the impact of the sampling rate, samplingRate, and the number K of clusters used in the EM algorithm:

val extractor = YahooFinancials.adjClose :: List[Array[String] =>Double]() //1

val period = 8
val samplingRate = 10
val smAv = SimpleMovingAverage[Double](period) //2
val obs = DataSource.listSymbols(path).map(sym => { //3
  val xs = DataSource(sym, path, true) |> extractor //2
  val values : XTSeries[Double] = XTSeries.|>(xs)).head //4
  val filtered = smAv |> values
  filtered.zipWithIndex //5
          .drop(period+1).toArray //6
          .filter( _._2%samplingRate==0)
          .map( _._1)
})

The first step is to extract the historical quotes for all the stocks using the same extractor as in the K-means test case (line 1).

The symbols of the stocks under consideration are extracted from the name of the files in the path directory. The historical data is contained in the CSV file named path/STOCK_NAME.csv (line 3). An implicit conversion is triggered by an assignment of values of the type XTSeries[Double] (line 4). The simple moving average algorithm zeroed out the first period values in the smoothed data, filtered (line 5). Those null values have to be dropped before applying the sampling (line 6).

The first test is to execute the EM algorithm with K=3 clusters and a sampling period of 10 on data smoothed by a simple moving average with a period of 8:

MultivariateEM[Double](K) |> XTSeries[DblVector](obs) foreach (…)

The driver prints the key (line 3), the mean (coordinates of the centroid vector) (line 4), and the standard deviation for each component (cluster).

The sampling of historical prices of the 127 stocks between January 1, 2013 and December 31, 2013 with a frequency of 0.1 hertz produces 24 data points. The following chart displays the mean or centroid of each of the 3 clusters:

Testing

Chart of the normalized means per cluster using EM K=3

The mean vectors of clusters 2 and 3 have similar patterns, which may suggest that 2 components or clusters could provide a first insight into the similarity within groups of stocks. The following is a chart of the normalized standard deviation per cluster using EM K = 3:

Testing

Chart of the normalized standard deviation per cluster using EM K=3

The distribution of the standard deviation along the mean vector of each cluster can be explained by the fact that the price of stocks from a couple of industries went down in synergy, while others went up as a semihomogenous group following the announcement from the Federal Reserve that the monthly quantity of bonds purchased as part of the quantitative easing program would be reduced in the near future.

Note

Relation to K-means

You may wonder what is the relation between EM and K-means as both techniques address the same problem. The K-means algorithm assigns each observation uniquely to one and only one cluster. The EM algorithm assigns an observation based on posterior probability. K-means is a special case of the EM for Gaussian mixtures [4:11].

Online EM

Online learning is a powerful strategy for training a clustering model when dealing with very large datasets. This strategy has regained interest from scientists lately. The description of online EM is beyond the scope of this tutorial. However, you may need to know that there are several algorithms available for online EM if you ever have to deal with large datasets: batch EM, stepwise EM, incremental EM, and Monte Carlo EM [4:12].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset