Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Unsupervised Learning

Labeling a set of observations for classification or regression can be a daunting task, especially in the case of a large feature set. In some cases, labeled observations are either unavailable or not possible to create. In an attempt to extract some hidden associations or structures from observations, the data scientist relies on unsupervised learning techniques to detect patterns or similarity in data.

The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. These techniques are also applied in reducing the solution or feature space.

There are numerous unsupervised algorithms; some are more appropriate to handle dependent features while others generate affinity groups in the case of hidden features [4:1]. In this chapter, you will learn three of the most common unsupervised learning algorithms:

K-means: This used for clustering observed features
Expectation-maximization (EM): This is used for clustering observed and latent features
Principal Components Analysis (PCA): This is used to reduce the dimension of the model

Any of these algorithms can be applied to technical analysis or fundamental analysis. Fundamental analysis of financial ratios and technical analysis of price movements is discussed in the Technical analysis section under Finances 101 in the Appendix A, Basic Concepts. The K-means algorithm is fully implemented in Scala while expectation-maximization and Principal Components Analysis leverage the Apache Commons Math library.

The chapter concludes with a brief overview of dimension reduction techniques for non-linear models.

Clustering

Problems involving a large number of features for large datasets become quickly intractable, and it is quite difficult to evaluate the independence between features. Any computation that requires some level of optimization and, at a minimum, the computation of first order derivatives requires a significant amount of computing power to manipulate high-dimension matrices. As with many engineering fields, a divide-and-conquer approach to classifying very large datasets is quite effective. The objective is to reduce very large sets of observations into a small group of observations that share some common attributes.

Visualization of data clustering

This approach is known as vector quantization. Vector quantization is a method that divides a set of observations into groups of similar size. The main benefit of vector quantization is that the analysis using a representative of each group is far simpler than an analysis of the entire dataset [4:2].

Clustering, also known as cluster analysis, is a form of vector quantization that relies on a concept of distance or similarity to generate groups known as clusters.

Note

Learning vector quantization (LVQ)

Vector quantization should not be confused with learning vector quantization; learning vector quantization is a special case of artificial neural networks that relies on a winner-take-all learning strategy to compress signals, images, or videos.

This chapter introduces two of the most commonly applied clustering algorithms:

K-means: This is used for quantitative types and minimizes the total error (known as the reconstruction error) given the number of clusters and the distance formula.
Expectation-maximization (EM): This is a two-step probabilistic approach that maximizes the likelihood estimates of a set of parameters. EM is particularly suitable for handling missing data.

K-means clustering

K-means is a popular clustering algorithm that can be implemented either iteratively or recursively. The representative of each cluster is computed as the center of the cluster, known as the centroid. The similarity between observations within a single cluster relies on the concept of distance (or similarity) between observations.

Measuring similarity

There are many ways to measure the similarity between observations. The most appropriate measure has to be intuitive and avoid computational complexity. This section reviews three similarity measures:

The Manhattan distance
The Euclidean distance
The normalized inner or dot product

The Manhattan distance is defined by the absolute distance between two variables or vectors, {x_i} and {y_i}, of the same size (M1):

The implementation is generic enough to compute the distance between two vectors of elements of different types as long as an implicit conversion between each of these types to the Double values is already defined, as follows:

def manhattan[T <: AnyVal, U <: AnyVal](
    x: Array[T], 
    y: Array[U])(implicit f: T => Double): Double = 
 (x,y).zipped.map{case (u,v) => Math.abs(u-v)}.sum

The ubiquitous Euclidean distance between two vectors, {x_i} and {y_i}, of the same size is defined by the following formula (M2):

The code will be as follows:

def euclidean[T <: AnyVal, U <: AnyVal](
    x: Array[T], 
    y: Array[U])(implicit f: T => Double): Double = 
  Math.sqrt((x,y).zipped.map{case (u,v) =>u-v}.map(sqr(_)).sum)

The normalized inner product or cosine distance between two vectors, {x_i} and {y_i}, is defined by the following formula (M3):

In this implementation, the computation of the dot product and the norms for each dataset is done simultaneously using the tuple within the fold method:

def cosine[T <: AnyVal, U < : AnyVal] (
     x: Array[T], 
     y: Array[U])(implicit f : T => Double): Double = {
  val norms = (x,y).zipped
          .map{ case (u,v) => Array[Double](u*v, u*u, v*v)}
          ./:(Array.fill(3)(0.0))((s, t) => s ++ t) 
  norms(0)/Math.sqrt(norms(1)*norms(2))
}

Note

Performance of zip and zipped

The scalar product of two vectors is one of the most common operations. It is tempting to implement the dot product using the generic zip method:

 def dot(x:Array[Double], y:Array[Double]): Array[Double] = x.zip(y).map{case(x, y) => f(x,y) )

A functional alternative is to use the Tuple2.zipped method:

 def dot(x:Array[Double], y:Array[Double]): Array[Double] = (x, y).zipped map ( _ * _)

If readability is not a primary issue, you can always implement the dot method with a while loop, which prevents you from using the ubiquitous while loop.

Defining the algorithm

The main advantage of the K-means algorithm (and the reason for its popularity) is its simplicity [4:3].

Note

K-means objective

M4: Let's consider K clusters, {C_k}, with means or centroids, {m_k}. The K-means algorithm is indeed an optimization problem whose objective is to minimize the reconstruction or total error defined as the total sum of the distance:

The four steps of the K-means algorithm are as follows:

Cluster configuration (initializing the centroids or means m_k of the K clusters).
Cluster assignment (assigning observations to the nearest cluster given the centroids m_k).
Error minimization (computing the total reconstruction error):
1. Compute centroids m_k that minimize the total reconstruction error for the current assignment.
2. Reassign the observations given the new centroids m_k.
3. Repeat the computation of the total reconstruction error until no observations are reassigned.
Classification of a new observation by assigning the observation to the closest cluster.

We need to define the components of the K-means in Scala before implementing the algorithm.

Step 1 – cluster configuration

Let's create the two main components of the K-means algorithms: clusters of observations and the implementation of the K-means algorithm.

Defining clusters

The first step is to define a cluster. A cluster is defined by the following parameters:

The centroid (center) (line 1)
The indices of the observations that belong to this cluster (members) (line 2)

The code will be as follows:

class Cluster[T <: AnyVal](val center: DblArray)
     (implicit f: T => Double) {  //1
  type DistanceFunc[T] = (DblArray, Array[T])=> Double
  val members = new ListBuffer[Int]    //2
  def moveCenter(xt: XVSeries[T]): Cluster[T] 
  ...
}

The cluster is responsible for managing its members (data points) at any point of the iterative computation of the K-means algorithm. It is assumed that a cluster will never contain the same data points twice. The two key methods in the Cluster class are as follows:

moveCenter: This recomputes the centroid of a cluster
stdDev: This computes the standard deviation of the distance between all the observation members and the centroid

The constructor of the Cluster class is implemented by the apply method in the companion object. For convenience, refer to the Class constructor template section in the Appendix A, Basic Concepts:

object Cluster {
  def apply[T <: AnyVal](center: DblArray)
      (implicit f: T => Double): Cluster[T] = 
     new Cluster[T](center)
}

Let's take a look at the moveCenter method. It creates a new cluster with the existing members and a new centroid. The computation of the values of the centroid requires the transposition of the matrix of observations by features into a matrix of feature by observations (line 3). The new centroid is computed by normalizing the sum of each feature across all the observations by the number of data points (line 4):

def moveCenter(xt: XVSeries[T])
       (implicit m: Manifest[T], num: Numeric[T])
       : Cluster[T] = {  
  val sum = transpose(members.map( xt(_)).toList)
             .map(_.sum)  //3
  Cluster[T](sum.map( _ / members.size).toArray) //4
}

The stdDev method computes the standard deviation of all the observations contained in the cluster relative to its center. The distance value between each member and the centroid is extracted through a map invocation (line 5). It is then loaded into a statistics instance to compute the standard deviation (line 6). The function to compute the distance between the center and an observation is an argument of the method. The default distance is euclidean:

def stdDev(xt: XVSeries[T], distance: DistanceFunc): Double = {
  val ts = members.map(xt( _)).map(distance(center,_)) //5
  Stats[Double](ts).stdDev  //6
}

Note

Cluster selection

There are different ways to select the most appropriate cluster when reassigning an observation (updating its membership). In this implementation, we will select the cluster with the larger spread or lowest density. An alternative is to select the cluster with the largest membership.

Initializing clusters

The initialization of the cluster centroids is important to ensure fast convergence of the K-means algorithm. Solutions range from the simple random generation of centroids to the application of genetic algorithms to evaluate the fitness of centroid candidates. We selected an efficient and fast initialization algorithm developed by M. Agha and W. Ashour [4:4].

The steps of the initialization are as follows:

Compute the standard deviation of the set of observations.
Compute the index of the feature {x_k,0, x_k,1 … x_k,n} with the maximum standard deviation.
Rank the observations by their increasing value of standard deviation for the dimension k.
Divide the ranked observations set equally into K sets {S_m}.
Find the median value's size(S_m)/2.
Use the resulting observations as initial centroids.

Let's deconstruct the implementation of the Agha-Ashour algorithm in the initialize method:

def initialize(xt: U): V = {
  val stats = statistics(xt)  //7
  val maxSDevVar = Range(0,stats.size)  //8
                   .maxBy(stats( _ ).stdDev )

  val rankedObs = xt.zipWithIndex
                .map{case (x, n) => (x(maxSDevVar), n)}
                .sortWith( _._1  < _._1)  //9

  val halfSegSize = ((rankedObs.size>>1)/_config.K)
                  .floor.toInt //10
  val centroids = rankedObs
     .filter(isContained( _, halfSegSize, rankedObs.size))
     .map{ case(x, n) => xt(n)} //11
  centroids.aggregate(List[Cluster[T]]())((xs, c) => 
        Cluster[T](c) :: xs, _ ::: _)  //12
}

The statistics method on time series of the XVSeries type is defined in the Time series in Scala section in Chapter 3, Data Preprocessing (line 7). The dimension (or feature) with the maxSDevVar maximum variance or standard deviation is computed using the maxBy method on a Stats instance (line 8). Then, the observations are ranked by the increasing value of the rankedObs standard deviation (line 9).

The ordered sequence of observations is then broken into the xt.size/_config.K segments (line 10) and the indices of the centroids are selected as the midpoint (or median) observations of those segments using the isContained filtering condition (line 11):

def isContained(t: (T,Int), hSz: Int, dim: Int): Boolean = 
    (t._2 % hSz == 0) && (t._2 %(hSz<<1) != 0)

Finally, the list of clusters is generated using an aggregate call on the set of centroids (line 12).

Step 2 – cluster assignment

The second step in the K-means algorithm is the assignment of the observations to the clusters for which the centroids have been initialized in step 1. This feat is accomplished by the private assignToClusters method:

def assignToClusters(xt: U, clusters: V, 
     members: Array[Int]): Int =  {

  xt.zipWithIndex.filter{ case(x, n) => {  //13
    val nearestCluster = getNearestCluster(clusters, x) //14
    val reassigned = nearestCluster != members(n) 

    clusters(nearestCluster) += n //15
    members(n) = nearestCluster //16
    reassigned
  }}.size
}

The core of the assignment of observations to each cluster is the filter on the time series (line 13). The filter computes the index of the closest cluster and checks whether the observation is to be reassigned (line 14). The observation at the n index is added to the nearest cluster, clusters(nearestCluster) (line 15). The current membership of the observations is then updated (line 16).

The cluster closest to an observation data is computed by the private getNearestCluster method as follows:

def getNearestCluster(clusters: V, x: Array[T]): Int = 
  clusters.zipWithIndex./:((Double.MaxValue, 0)){
    case (p, (c, n) ) => { 
     val measure = distance(c.center, x) //17
     if( measure < p._1) (measure, n)  else p
  }}._2

A fold is used to extract the cluster that is closest to the x observation from the list of clusters using the distance metric defined in the K-means constructor (line 17).

As with other data processing units, the extraction of K-means clusters is encapsulated in a data transformation so that clustering can be integrated into a workflow using the composition of mixins described in the Composing mixins to build a workflow section in Chapter 2, Hello World!

Note

K-means algorithm exit condition

In some rare instances, the algorithm may reassign the same few observations between clusters, preventing its convergence toward a solution in a reasonable time. Therefore, it is recommended that you add a maximum number of iterations as an exit condition. If K-means does not converge with the maximum number of iterations, then the cluster centroids need to be reinitialized and the iterative (or recursive) execution needs to be restarted.

The |> transformation requires that the computation of the standard deviation of the distance of the observations related to the centroid, c, is computed in the stdDev method:

type KMeansModel[T} = List[Cluster[T]] 
def stdDev[T](
    clusters: KMeansModel[T], 
    xt: XVSeries[T], 
    distance: DistanceFunc[T]): DblVector = 
 clusters.map( _.stdDev(xt, distance)).toVector

Note

Centroid versus mean

The terms centroid and mean refer to the same entity: the center of a cluster. This chapter uses these two terms interchangeably.

Step 3 – reconstruction/error minimization

The clusters are initialized with predefined set of observations as their members. The algorithm updates the membership of each cluster by minimizing the total reconstruction error. There are two effective strategies to execute the K-means algorithm:

Tail recursive execution
Iterative execution

Creating K-means components

Let's declare the K-means algorithm class, KMeans, with its public methods. KMeans implements an ITransform data transformation using an implicit model extracted from a training set and is described in the Monadic data transformation section in Chapter 2, Hello World! (line 18). The configuration of the KMeansConfig type consists of the tuple (K, maxIters) with K being the number of clusters and maxIters being the maximum number of iterations allowed for the convergence of the algorithm:

case class KMeansConfig(val K: Int, maxIters: Int)

The KMeans class takes the following three arguments:

config: This is the configuration used for the execution of the algorithm
distance: This is the function used to compute the distance between any observation and a cluster centroid
xt: This is the training set

The implicit conversion of the T type to a Double is implemented as a view bound. The instantiation of the KMeans class initializes a V type of output from K-means as Cluster[T] (line 20). The num instance of the Numeric class has to be passed implicitly as a class parameter because it is required by the sortWith invocation in initialize, the maxBy method, and the Cluster.moveCenter method (line 19). The Manifest is required to preserve the erasure type for Array[T] in the JVM:

class KMeans[T <: AnyVal](config: KMeansConfig,  //18
    distance: DistanceFunc[T],
    xt: XVSeries[T])
    (implicit m: Manifest[T], num: Numeric[T], f: T=>Double) //19
  extends ITransform[Array[T]](xt) with Monitor[T] { 

  type V = Cluster[T]     //20
  val model: Option[KMeansModel[T]] = train
  def train: Option[KMeansModel[T]]
  override def |> : PartialFunction[U, Try[V]]
  ...
}

The KMeansModel model is defined as the list of clusters extracted through training.

Tail recursive implementation

The transformation or clustering function is implemented by the train training method that creates a partial function with XVSeries[T] as the input and KMeansModel[T] as the output:

def train: Option[KMeansModel[T]] = Try {
         // STEP 1
  val clusters =initialize(xt) //21
  if( clusters.isEmpty)  /* ...  */
  else  {
         // STEP 2
    val members = Array.fill(xt.size)(0)
    assignToClusters(xt, clusters, members) //22
    var iters = 0

    // Declaration of the tail recursion def update      
    if( iters >= _config.maxIters )
      throw new IllegalStateException( /* .. */)
         // STEP 3
    update(clusters, xt, members)  //23
  } 
} match {
   case Success(clusters) => Some(clusters)
   case Failure(e) => /* … */
}

The K-means training algorithm is implemented through the following three steps:

Initialize the cluster's centroid using the initialize method (line 21).
Assign observations to each cluster using the assignToClusters method (line 22).
Recompute the total error reconstruction using the update recursive method (line 23).

The computation of the total error reconstruction is implemented as a tail recursive method, update, as follows:

@tailrec
def update(clusters: KMeansModel[T], xt: U, 
       members: Array[Int]): KMeansModel[T] = {  //24

  val newClusters = clusters.map( c => {         
      if( c.size > 0) c.moveCenter(xt) //25
    else clusters.filter( _.size >0)
          .maxBy(_.stdDev(xt, distance)) //26
  }) 
  iters += 1
  if(iters >= config.maxIters ||       //27
      assignToClusters(xt, newClusters, members) ==0) 
    newClusters
  else 
    update(newClusters, xt, membership)   //28
}

The recursion takes the following three arguments (line 24):

The current list of clusters that is updated during the recursion
The xt input time series
The indices of membership to the clusters, members

A new list of clusters, newClusters, is computed by either recalculating each centroid if the cluster is not empty (line 25) or evaluating the standard deviation of the distance of each observation relative to each centroid (line 26). The execution exits when either the maximum number of the maxIters recursive calls is reached or when no more observations are reassigned to a different cluster (line 27). Otherwise, the method invokes itself with an updated list of clusters (line 28).

Iterative implementation

The implementation of an iterative execution is presented for an informational purpose. It follows the same sequence of calls as with the recursive implementation. The new clusters are computed (line 29) and the execution exits when either the maximum number of allowed iterations is reached (line 30) or when no more observations are reassigned to a different cluster (line 31):

val members = Array.fill(xt.size)(0)
assignToClusters(xt, clusters, members) 
var newClusters: KMeansModel[T] = List.empty[Cluster[T]]
Range(0,  maxIters).find( _ => {  //29
  newClusters = clusters.map( c => {   //30
    if( c.size > 0)  c.moveCenter(xt) 
    else clusters.filter( _.size > 0)
           .maxBy(_.stdDev(xt, distance))
  }) 
  assignToClusters(xt, newClusters, members) > 0  //31
}).map(_ => newClusters)

The density of the clusters is computed in the KMeans class as follows:

def density: Option[DblVector] = 
  model.map( _.map( c => 
    c.getMembers.map(xt(_)).map( distance(c.center, _)).sum)

Step 4 – classification

The objective of the classification is to assign an observation to a cluster with the closest centroid:

override def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] if( x.length == dimension(xt) 
                     && model != None) => 
  Try (model.map( _.minBy(c => distance(c.center,x))).get )
}

The most appropriate cluster is computed by selecting the c cluster whose center is the closest to the x observation using the minBy higher order method.

The curse of dimensionality

A model with a significant number of features (high dimensions) requires a larger number of observations in order to extract relevant and reliable clusters. K-means clustering with very small datasets (< 50) produces models with high bias and a limited number of clusters, which are affected by the order of observations [4:5]. I have been using the following simple empirical rule of thumb for a training set of size n, expected K clusters, and N features: n < K.N.

Note

Dimensionality and the size of the training set

The issue of sizing the training set given the dimensionality of a model is not specific to unsupervised learning algorithms. All supervised learning techniques face the same challenge to set up a viable training plan.

Whichever empirical rule you follow, such a restriction is particularly an issue for analyzing stocks using historical quotes. Let's consider our examples of using technical analysis to categorize stocks according to their price behavior over a period of 1 year (or approximately 250 trading days). The dimension of the problem is 250 (250 daily closing prices). The number of stocks (observations) would have exceeded several hundred!

Price model for K-means clustering

There are options to get around this limitation and shrink the numbers of observations, which are as follows:

Sampling the trading data without losing a significant amount of information from the raw data, assuming that the distribution of observations follows a known probability density function.
Smoothing the data to remove the noise as seen in Chapter 3, Data Preprocessing, assuming that the noise is Gaussian. In our test, a smoothing technique will remove the price outliers for each stock and therefore reduce the number of features (trading session). This approach differs from the first (sampling) technique because it does not require an assumption that the dataset follows a known density function. On the other hand, the reduction of features will be less significant.

These approaches are workaround solutions at best, used for the sake of this tutorial. You need to consider the quality of your data before applying these techniques to the actual commercial applications. The principal component analysis introduced in the last paragraph of this chapter is one of the most reliable dimension reduction techniques.

Setting up the evaluation

The objective is to extract clusters from a set of stock price actions during a period of time between January 1 and Dec 31, 2013 as features. For this test, 127 stocks are randomly selected from the S&P 500 list. The following chart visualizes the behavior of the normalized price of a subset of these 127 stocks:

Price action of a basket of stocks used in K-means clustering

The key is to select the appropriate features prior to clustering and the time window to operate on. It would make sense to consider the entire historical price over the 252 trading days as a feature. However, the number of observations (stocks) is too limited to use the entire price range. The observations are the stock closing prices for each trading session between the 80^th and 130^th trading days. The adjusted daily closing prices are normalized using their respective minimum and maximum values.

First, let's create a simple method to compute the density of the clusters:

val MAX_ITERS = 150
def density(K: Int, obs: XVSeries[Double]): DblVector = 
  KMeans[Double](KMeansConfig(K, MAX_ITERS)).density.get //32

The density method invokes KMeans.density described in step 3. Let's load the data from CSV files using the DataSource class, as described in the Data extraction section in the Appendix A, Basic Concepts:

import YahooFinancials.) 
val START_INDEX = 80; val NUM_SAMPLES = 50  //33
val PATH = "resources/data/chap4/"

type INPUT = Array[String] => Double
val extractor = adjClose :: List[INPUT]() //34
val symbolFiles = DataSource.listSymbolFiles(PATH) //35

for {
  prices <- getPrices  //36
  values <- Try(getPricesRange(prices))  //37
  stdDev <- Try(ks.map( density(_, values.toVector))) //38
  pfnKmeans <- Try { 
    KMeans[Double](KMeansConfig(5,MAX_ITERS),values.toVector) |> 
  }   //39
  predict <- pfnKmeans(values.head)   //40
} yield {
  val results = s"""Daily prices ${prices.size} stocks")
     | 
Clusters density ${stdDev.mkString(", ")}"""
  .stripMargin
  show(results)
}

As mentioned earlier, the cluster analysis applies to the closing price in the range between the 80^th and 130^th trading days (line 33). The extractor function retrieves the adjusted closing price for a stock from YahooFinancials (line 34). The list of stock tickers (or symbols) are extracted as a list of CSV filenames located in path (line 35). For instance, the ticker symbol for General Electric Corp. is GE and the trading data is located in GE.csv.

The execution extracts 50 daily prices using DataSource and then filters out the incorrectly formatted data using filter (line 36):

type XVSeriesSet = Array[XVSeries[Double]]
def getPrices: Try[XVSeriesSet] = Try {
   symbolFiles.map( DataSource(_, path) |> extractor )
   .filter( _.isSuccess ).map( _.get)
}

The historical stock prices for the trading session between the 80^th and 130^th days are generated by the getPricesRange closure (line 37):

def getPricesRange(prices: XVSeriesSet) = 
   prices.view.map(_.head.toArray)
    .map( _.drop(START_INDEX).take(NUM_SAMPLES))

It computes the density of the clusters by invoking the density method for each ks value of the number of clusters (line 38).

The pfnKmeans partial classification function is created for a 5-cluster, KMeans (line 39), and then used to classify one of the observations (line 40).

Evaluating the results

The first test run is executed with K=3 clusters. The mean (or centroid) vector for each cluster is plotted as follows:

A chart of means of clusters using K-means K=3

The means vectors of the three clusters are quite distinctive. The top and bottom means 1 and 2 in the chart have the respective standard deviation of 0.34 and 0.27 and share a very similar pattern. The difference between the elements of the 1 and 2 cluster mean vectors is almost constant: 0.37. The cluster with a mean vector 3 represents the group of stocks that behave like the stocks in cluster 2 at the beginning of the time period and behave like the stocks in cluster 1 toward the end of the time period.

This behavior can be easily explained by the fact that the time window or trading period, the 80^th to 130^th trading day, correspond to the shift in the monetary policy of the federal reserve in regard to the quantitative easing program. Here is the partial list of stocks for each of the clusters whose centroid values are displayed on the chart:

Cluster 1	AET, AHS, BBBY, BRCM, C, CB, CL, CLX, COH, CVX, CYH, DE, …
Cluster 2	AA, AAPL, ADBE, ADSK, AFAM, AMZN, AU, BHI, BTU, CAT, CCL, …
Cluster 3	ADM, ADP, AXP, BA, BBT, BEN, BK, BSX, CA, CBS, CCE, CELG, CHK, …

Let's evaluate the impact of the number of clusters K on the characteristics of each cluster.

Tuning the number of clusters

We repeat the previous test on the 127 stocks and the same time window with the number of clusters varying from 2 to 15.

The mean (or centroid) vector for each cluster for K = 2 is plotted as follows:

A chart of means of clusters using K-means K=2

The chart of the results of the K-means algorithms with 2 clusters shows that the mean vector for the cluster labeled 2 is similar to the mean vector labeled 3 on the chart with K = 5 clusters. However, the cluster with the mean vector 1 reflects somewhat the aggregation or summation of the mean vectors for the clusters 1and 3 in the chart K = 5. The aggregation effect explains why the standard deviation for the cluster 1 (0.55) is twice as much as the standard deviation for the cluster 2 (0.28).

The mean (or centroid) vector for each cluster for K = 5 is plotted as follows:

A chart of means of clusters using K-means K=5

In this chart, we can assess that the clusters 1 (with the highest mean), 2 (with the lowest mean), and 3 are very similar to the clusters with the same labels in the chart for K = 3. The cluster with the mean vector 4 contains stocks whose behaviors are quite similar to those in cluster 3, but in the opposite direction. In other words, the stocks in cluster 3 and 4 reacted in opposite ways following the announcement of the change in the monetary policy.

In the tests with high values of K, the distinction between the different clusters becomes murky, as shown in the following chart for K = 10:

A chart of means of clusters using K-means K=10

The means for clusters 1, 2, and 3 seen in the first chart for the case K = 2 are still visible. It is fair to assume that these are very likely the most reliable clusters. These clusters happened to have a low standard deviation or high density.

Let's define the density of a cluster, C_j, with a centroid, c_j, as the inverse of the Euclidean distance between all the members of each cluster and its mean (or centroid) (M6):

The density of the cluster is plotted against the number of clusters with K = 1 to K = 13:

A bar chart of the average cluster density for K = 1 to 13

As expected, the average density of each cluster increases as K increases. From this experiment, we can draw the simple conclusion that the density of each cluster does not significantly increase in the test runs for K = 5 and beyond. You may observe that the density does not always increase as the number of clusters increases (K = 6 and K = 11). The anomaly can be explained by the following three factors:

The original data is noisy
The model is somewhat dependent on the initialization of the centroids
The exit condition is too loose

Validation

There are several methodologies to validate the output of a K-means algorithm from purity to mutual information [4:6]. One effective way to validate the output of a clustering algorithm is to label each cluster and run those clusters through a new batch of labeled observations. For example, if during one of these tests, you find that one of the clusters, CC, contains most of the commodity-related stocks, then you can select another commodity-related stock, SC, which is not part of the first batch, and run the entire clustering algorithm again. If SC is a subset of CC, then K-means has performed as expected. If this is the case, you should run a new set of stocks, some of which are commodity-related, and measure the number of true positives, true negatives, false positives, and false negatives. The values for the precision, recall, and F₁ score introduced in the Assessing a model section in Chapter 2, Hello World!, confirms whether the tuning parameters and labels you selected for your cluster are indeed correct.

Note

F1 validation for K-means

The quality of the clusters, as measured by the F₁ score, depends on the rule, policy, or formula used to label observations (that is, label a cluster with the industry with the highest relative percentage of stocks in the cluster). This process is quite subjective. The only sure way to validate a methodology is to evaluate several labeling schemes and select the one that generate the highest F₁ score.

An alternative to measure the homogeneity of the distribution of observations across the clusters is to compute the statistical entropy. A low entropy value indicates that the clusters have a low level of impurity. Entropy can be used to find the optimal number of clusters K.

We reviewed some of the tuning parameters that affect the quality of the results of the K-means clustering, which are as follows:

Initial selection of a centroid
Number of K clusters

In some cases, the similarity criterion (that is, the Euclidean distance or cosine distance) can have an impact on the cleanness or density of the clusters.

The final and important consideration is the computational complexity of the K-means algorithm. The previous sections of the chapter described some of the performance issues with K-means and possible remedies.

Despite its many benefits, the K-means algorithm does not handle missing data or unobserved features very well. Features that depend on each other indirectly may in fact depend on a common hidden (also known as latent) feature. The expectation-maximization algorithm described in the next section addresses some of these limitations.

The expectation-maximization algorithm

The expectation-maximization algorithm was originally introduced to estimate the maximum likelihood in the case of incomplete data [4:7]. It is an iterative method to compute the model features that maximize the likely estimate for observed values, taking into account unobserved values.

The iterative algorithm consists of computing the following:

The expectation, E, of the maximum likelihood for the observed data by inferring the latent values (E-step)
The model features that maximize the expectation E (M-step)

The expectation-maximization algorithm is applied to solve clustering problems by assuming that each latent variable follows a normal or Gaussian distribution. This is similar to the K-means algorithm for which the distance of each data point to the center of each cluster follows a Gaussian distribution [4:8]. Therefore, a set of latent variables is a mixture of Gaussian distributions.

Gaussian mixture models

Latent variables, Z_i, can be visualized as the behavior (or symptoms) of a model (observed) X for which Z are the root causes of the behavior:

Visualization of observed and latent features

The latent values, Z, follow a Gaussian distribution. For the statisticians among us, the mathematics of a mixture model is described here:

Note

Maximization of the log likelihood

M7: If x = {x_i} is a set of observed features associated with latent features z = {z_i}, the probability for the feature x_i, of the observation x, given a model parameter θ, is defined as:

M8: The objective is to maximize the likelihood, L(θ), as shown here:

Overview of EM

As far as the implementation is concerned, the expectation-maximization algorithm can be broken down into three stages:

The computation of the log likelihood for the model features given some latent variables (initial step).
The computation of the expectation of the log likelihood at iteration t (E step).
The maximization of the expectation at iteration t (M step).

Note

The E step

M9: The expectation, Q, of the complete data log likelihood for the model parameters, θ_n, at iteration, n, is computed using the posterior distribution of latent variable, z, p(z|x, θ), and the joint probability of the observation and the latent variable:

The M-step

M10: The expectation function Q is maximized for the model features θ to compute the model parameters θ_n+1 for the next iteration:

A formal, detailed, but short mathematical formulation of the EM algorithm can be found in S. Borman's tutorial [4:9].

Implementation

Let's implement the three steps (initial step, E step, and M step) in Scala. The internal calculations of the EM algorithm are a bit complex and overwhelming. You may not benefit much from the details of a specific implementation such as computation of the eigenvalues of the covariance matrix of the expectation of the log likelihood. This implementation hides some complexities using the Apache Commons Math library package [4:10].

Note

The inner workings of EM

You may want to download the source code for the implementation of the EM algorithm in the Apache Commons Math library, if you need to understand the condition for which an exception is thrown.

The expectation-maximization algorithm of the MultivariateEM type is implemented as a data transformation of an ITransform type, as described in the Monadic data transformation section in Chapter 2, Hello World!. The two arguments of the constructors are the number of K clusters (or gauss distribution) and the xt training set (line 1). The constructor initializes the V type of the output as EMCluster (line 2):

class MultivariateEM[T <: AnyVal](K: Int, 
     xt: XVSeries[T])(implicit f: T => Double)
  extends ITransform[Array[T]](xt) with Monitor[T] {  //1
  type V = EMCluster  //2
  val model: Option[EMModel ] = train //3
  override def |> : PartialFunction[U, Try[V]]
}

The multivariate expectation-maximization class has a model that consists of a list of EM clusters of the EMCluster type. The Monitor trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).

The information about an EM cluster, EMCluster, is defined by key, the centroid or means value, and density of the cluster that is the standard deviation of the distance of all the data points to the mean (line 4):

case class EMCluster(key: Double, val means: DblArray, 
      val density: DblArray)   //4
type EMModel = List[EMCluster]

The implementation of the EM algorithm in the train method uses the Apache Commons Math MultivariateNormalMixture for the Gaussian mixture model and MultivariateNormalMixtureExpectationMaximization for the EM algorithm:

  def train: Option[EMModel ] = Try {
  val data: DblMatrix = xt  //5
  val multivariateEM = new EM(data)
  multivariateEM.fit( estimate(data, K) ) //6

  val newMixture = multivariateEM.getFittedModel  //7
  val components = newMixture.getComponents.toList  //8
  components.map(p =>EMCluster(p.getKey, p.getValue.getMeans, 
                      p.getValue.getStandardDeviations)) //9
} match {/* … */}

Let's take a look at the main train method of the MultivariateEM wrapper class. The first step is to convert the time series into a primitive matrix of Double with observations/historical quotes as rows and the stock symbols as columns.

The xt time series of the XVSeries[T] type is converted to a DblMatrix through an induced implicit conversion (line 5).

The initial mixture of Gaussian distributions can be provided by the user or can be extracted from the estimate datasets (line 6). The getFittedModel triggers the M-step (line 7).

Note

Conversion from Java and Scala collections

Java primitives need to be converted to Scala types using the import scala.collection.JavaConversions package. For example, java.util.List is converted to scala.collection.immutable.List by invoking the asScalaIterator method of the WrapAsScala class, one of the base traits of JavaConversions.

The Apache Commons Math getComponents method returns a java.util.List that is converted to scala.collection.immutable.List by invoking the toList method (line 8). Finally, the data transform returns a list of cluster information of the EMCluster type (line 9).

Note

Third-party library exceptions

Scala does not enforce the declaration of exceptions as part of the signature of a method. Therefore, there is no guarantee that all types of exceptions will be caught locally. This problem occurs when exceptions are thrown from a third-party library in two scenarios:

The documentation of the API does not list all the types of exceptions
The library is updated and a new type of exception is added to a method

One easy workaround is to leverage the Scala exception-handling mechanism:

Try {
     ..
} match {
    case Success(results) => …
    case Failure(exception)  => ...
}

Classification

The classification of a new observation or data points is implemented by the |> method:

override def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] 
    if(isModel && x.length == dimension(xt)) => 
  Try( model.map(_.minBy(c => euclidean(c.means,x))).get)
}

The |> method is similar to the KMeans.|> classifier.

Testing

Let's apply the MultivariateEM class to the clustering of the same 127 stocks used in evaluating the K-means algorithm.

As discussed in the The curse of dimensionality section, the number of stocks (127) to analyze restricts the number of observations to be used by the EM algorithm. A simple option is to filter out some of the noise of the stock's prices and apply a simple sampling method. The maximum sampling rate is restricted by the frequencies in the spectrum of noises of different types in the historical price of every stock.

Note

Filtering and sampling

The preprocessing of the data using a combination of a simple moving average and fixed interval sampling prior to clustering is very rudimentary in this example. For instance, we cannot assume that the historical price of all the stocks share the same noise characteristics. The noise pattern in the quotation of momentum and heavily traded stocks is certainly different from blue-chip securities with a strong ownership, and these stocks are held by large mutual funds.

The sampling rate should take into account the spectrum of frequency of the noise. It should be set as at least twice the frequency of the noise with the lowest frequency.

The object of the test is to evaluate the impact of the sampling rate, samplingRate, and the number of K clusters used in the EM algorithm:

val K = 4; val period = 8
val smAve = SimpleMovingAverage[Double](period)  //10
val pfnSmAve = smAve |>    //11

val obs = symbolFiles.map(sym => (
  for {
    xs <- DataSource(sym, path, true, 1) |>extractor //12
    values <- pfnSmAve(xs.head)  //13
    y <- Try {   
        values.view.zipWithIndex.drop(period+1).toVector
          .filter( _._2 % samplingRate == 0)
          .map( _._1).toArray //14
    }
  } yield y).get) 

em(K, obs)  //15

The first step is to create a simple moving average with a predefined period (line 10), as described in the The simple moving average section in Chapter 3, Data Preprocessing. The test code instantiates the pfnSmAve partial function that implements the moving average computation (line 11). The symbols of the stocks under consideration are extracted from the name of the files in the path directory. The historical data is contained in the CSV file whose name is path/STOCK_NAME.csv (line 12).

The execution of the moving average (line 13) generates a set of smoothed values that is sampled given a sampling rate, samplingRate (line 14). Finally, the expectation-maximization algorithm is instantiated to cluster the sampled data in the em method (line 15):

def em(K: Int, obs: DblMatrix): Int = {
  val em = MultivariateEM[Double](K, obs.toVector) //16
  show(s"${em.toString}")  //17
}

The em method instantiates the EM algorithm for a specific number K of clusters (line 16). The content of the model is displayed by invoking MultivariateEM.toString. The results are aggregated, then displayed in a textual format on the standard output (line 17).

The first test is to execute the EM algorithm with K = 3 clusters and a sampling period of 10 on data smoothed by a simple moving average with period of 8. The sampling of historical prices of the 127 stocks between January 1, 2013 and December 31, 2013 with a frequency of 0.1 hertz produces 24 data points. The following chart displays the mean of each of the three clusters:

A chart of the normalized means per cluster using EM K=3

The mean vectors of clusters 2 and 3 have similar patterns, which may suggest that a set of three clusters is accurate enough to provide a first insight into the similarity within groups of stocks. The following is a chart of the normalized standard deviation per cluster using EM with K = 3:

A chart of the normalized standard deviation per cluster using EM K=3

The distribution of the standard deviation along with the mean vector of each cluster can be explained by the fact that the price of stocks from a couple of industries went down in synergy, while others went up as a semi-homogenous group following the announcement from the Federal Reserve that the monthly quantity of bonds purchased as part of the quantitative easing program would be reduced in the near future.

Note

Relation to K-means

You may wonder what is the relation between EM and K-means, as both the techniques address the same problem. The K-means algorithm assigns each observation uniquely to one and only one cluster. The EM algorithm assigns an observation based on posterior probability. K-means is a special case of the EM for Gaussian mixtures [4:11].

The online EM algorithm

Online learning is a powerful strategy for training a clustering model when dealing with very large datasets. This strategy has regained interest from scientists lately. The description of the online EM algorithm is beyond the scope of this tutorial. However, you may need to know that there are several algorithms for online EM available if you ever have to deal with large datasets, such as batch EM, stepwise EM, incremental EM, and Monte Carlo EM [4:12].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Unsupervised Learning

Create new playlist

Sign In

Sign Up

Chapter 4. Unsupervised Learning

Clustering

Note

K-means clustering

Measuring similarity

Note

Defining the algorithm

Note

Step 1 – cluster configuration

Defining clusters

Note

Initializing clusters

Step 2 – cluster assignment

Note

Note

Step 3 – reconstruction/error minimization

Creating K-means components

Tail recursive implementation

Iterative implementation

Step 4 – classification

The curse of dimensionality

Note

Setting up the evaluation

Evaluating the results

Tuning the number of clusters

Validation

Note

The expectation-maximization algorithm

Gaussian mixture models

Note

Overview of EM

Note

Implementation

Note

Note

Note

Classification

Testing

Note

Note

The online EM algorithm

Table of Contents for
4. Unsupervised Learning