Labeling a set of observations for classification or regression can be a daunting task, especially in the case of a large feature set. In some cases, labeled observations are either unavailable or not possible to create. In an attempt to extract some hidden associations or structures from observations, the data scientist relies on unsupervised learning techniques to detect patterns or similarity in data.
The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. These techniques are also applied in reducing the solution or feature space.
There are numerous unsupervised algorithms; some are more appropriate to handle dependent features while others generate affinity groups in the case of hidden features [4:1]. In this chapter, you will learn three of the most common unsupervised learning algorithms:
Any of these algorithms can be applied to technical analysis or fundamental analysis. Fundamental analysis of financial ratios and technical analysis of price movements is discussed in the Technical analysis section under Finances 101 in the Appendix A, Basic Concepts. The K-means algorithm is fully implemented in Scala while expectation-maximization and Principal Components Analysis leverage the Apache Commons Math library.
The chapter concludes with a brief overview of dimension reduction techniques for non-linear models.
Problems involving a large number of features for large datasets become quickly intractable, and it is quite difficult to evaluate the independence between features. Any computation that requires some level of optimization and, at a minimum, the computation of first order derivatives requires a significant amount of computing power to manipulate high-dimension matrices. As with many engineering fields, a divide-and-conquer approach to classifying very large datasets is quite effective. The objective is to reduce very large sets of observations into a small group of observations that share some common attributes.
This approach is known as vector quantization. Vector quantization is a method that divides a set of observations into groups of similar size. The main benefit of vector quantization is that the analysis using a representative of each group is far simpler than an analysis of the entire dataset [4:2].
Clustering, also known as cluster analysis, is a form of vector quantization that relies on a concept of distance or similarity to generate groups known as clusters.
This chapter introduces two of the most commonly applied clustering algorithms:
K-means is a popular clustering algorithm that can be implemented either iteratively or recursively. The representative of each cluster is computed as the center of the cluster, known as the centroid. The similarity between observations within a single cluster relies on the concept of distance (or similarity) between observations.
There are many ways to measure the similarity between observations. The most appropriate measure has to be intuitive and avoid computational complexity. This section reviews three similarity measures:
The Manhattan distance is defined by the absolute distance between two variables or vectors, {xi} and {yi}, of the same size (M1):
The implementation is generic enough to compute the distance between two vectors of elements of different types as long as an implicit conversion between each of these types to the Double
values is already defined, as follows:
def manhattan[T <: AnyVal, U <: AnyVal](
x: Array[T],
y: Array[U])(implicit f: T => Double): Double =
(x,y).zipped.map{case (u,v) => Math.abs(u-v)}.sum
The ubiquitous Euclidean distance between two vectors, {xi} and {yi}, of the same size is defined by the following formula (M2):
The code will be as follows:
def euclidean[T <: AnyVal, U <: AnyVal](
x: Array[T],
y: Array[U])(implicit f: T => Double): Double =
Math.sqrt((x,y).zipped.map{case (u,v) =>u-v}.map(sqr(_)).sum)
The normalized inner product or cosine distance between two vectors, {xi} and {yi}, is defined by the following formula (M3):
In this implementation, the computation of the dot product and the norms for each dataset is done simultaneously using the tuple within the fold
method:
def cosine[T <: AnyVal, U < : AnyVal] (
x: Array[T],
y: Array[U])(implicit f : T => Double): Double = {
val norms = (x,y).zipped
.map{ case (u,v) => Array[Double](u*v, u*u, v*v)}
./:(Array.fill(3)(0.0))((s, t) => s ++ t)
norms(0)/Math.sqrt(norms(1)*norms(2))
}
Performance of zip and zipped
The scalar product of two vectors is one of the most common operations. It is tempting to implement the dot product using the generic zip
method:
def dot(x:Array[Double], y:Array[Double]): Array[Double] = x.zip(y).map{case(x, y) => f(x,y) )
A functional alternative is to use the Tuple2.zipped
method:
def dot(x:Array[Double], y:Array[Double]): Array[Double] = (x, y).zipped map ( _ * _)
If readability is not a primary issue, you can always implement the dot
method with a while
loop, which prevents you from using the ubiquitous while
loop.
The main advantage of the K-means algorithm (and the reason for its popularity) is its simplicity [4:3].
The four steps of the K-means algorithm are as follows:
We need to define the components of the K-means in Scala before implementing the algorithm.
Let's create the two main components of the K-means algorithms: clusters of observations and the implementation of the K-means algorithm.
The first step is to define a cluster. A cluster is defined by the following parameters:
center
) (line 1
)members
) (line 2
)The code will be as follows:
class Cluster[T <: AnyVal](val center: DblArray) (implicit f: T => Double) { //1 type DistanceFunc[T] = (DblArray, Array[T])=> Double val members = new ListBuffer[Int] //2 def moveCenter(xt: XVSeries[T]): Cluster[T] ... }
The cluster is responsible for managing its members (data points) at any point of the iterative computation of the K-means algorithm. It is assumed that a cluster will never contain the same data points twice. The two key methods in the Cluster
class are as follows:
moveCenter
: This recomputes the centroid of a clusterstdDev
: This computes the standard deviation of the distance between all the observation members and the centroidThe constructor of the Cluster
class is implemented by the apply
method in the companion object. For convenience, refer to the Class constructor template section in the Appendix A, Basic Concepts:
object Cluster { def apply[T <: AnyVal](center: DblArray) (implicit f: T => Double): Cluster[T] = new Cluster[T](center) }
Let's take a look at the moveCenter
method. It creates a new cluster with the existing members and a new centroid. The computation of the values of the centroid requires the transposition of the matrix of observations by features into a matrix of feature by observations (line 3
). The new centroid is computed by normalizing the sum of each feature across all the observations by the number of data points (line 4
):
def moveCenter(xt: XVSeries[T]) (implicit m: Manifest[T], num: Numeric[T]) : Cluster[T] = { val sum = transpose(members.map( xt(_)).toList) .map(_.sum) //3 Cluster[T](sum.map( _ / members.size).toArray) //4 }
The stdDev
method computes the standard deviation of all the observations contained in the cluster relative to its center. The distance
value between each member and the centroid is extracted through a map invocation (line 5
). It is then loaded into a statistics instance to compute the standard deviation (line 6
). The function to compute the distance between the center and an observation is an argument of the method. The default distance is euclidean
:
def stdDev(xt: XVSeries[T], distance: DistanceFunc): Double = { val ts = members.map(xt( _)).map(distance(center,_)) //5 Stats[Double](ts).stdDev //6 }
Cluster selection
There are different ways to select the most appropriate cluster when reassigning an observation (updating its membership). In this implementation, we will select the cluster with the larger spread or lowest density. An alternative is to select the cluster with the largest membership.
The initialization of the cluster centroids is important to ensure fast convergence of the K-means algorithm. Solutions range from the simple random generation of centroids to the application of genetic algorithms to evaluate the fitness of centroid candidates. We selected an efficient and fast initialization algorithm developed by M. Agha and W. Ashour [4:4].
The steps of the initialization are as follows:
Let's deconstruct the implementation of the Agha-Ashour algorithm in the initialize
method:
def initialize(xt: U): V = { val stats = statistics(xt) //7 val maxSDevVar = Range(0,stats.size) //8 .maxBy(stats( _ ).stdDev ) val rankedObs = xt.zipWithIndex .map{case (x, n) => (x(maxSDevVar), n)} .sortWith( _._1 < _._1) //9 val halfSegSize = ((rankedObs.size>>1)/_config.K) .floor.toInt //10 val centroids = rankedObs .filter(isContained( _, halfSegSize, rankedObs.size)) .map{ case(x, n) => xt(n)} //11 centroids.aggregate(List[Cluster[T]]())((xs, c) => Cluster[T](c) :: xs, _ ::: _) //12 }
The statistics
method on time series of the XVSeries
type is defined in the Time series in Scala section in Chapter 3, Data Preprocessing (line 7
). The dimension (or feature) with the maxSDevVar
maximum variance or standard deviation is computed using the maxBy
method on a Stats
instance (line 8
). Then, the observations are ranked by the increasing value of the rankedObs
standard deviation (line 9
).
The ordered sequence of observations is then broken into the xt.size/_config.K
segments (line 10
) and the indices of the centroids are selected as the midpoint (or median) observations of those segments using the isContained
filtering condition (line 11
):
def isContained(t: (T,Int), hSz: Int, dim: Int): Boolean =
(t._2 % hSz == 0) && (t._2 %(hSz<<1) != 0)
Finally, the list of clusters is generated using an aggregate
call on the set of centroids (line 12
).
The second step in the K-means algorithm is the assignment of the observations to the clusters for which the centroids have been initialized in step 1. This feat is accomplished by the private assignToClusters
method:
def assignToClusters(xt: U, clusters: V, members: Array[Int]): Int = { xt.zipWithIndex.filter{ case(x, n) => { //13 val nearestCluster = getNearestCluster(clusters, x) //14 val reassigned = nearestCluster != members(n) clusters(nearestCluster) += n //15 members(n) = nearestCluster //16 reassigned }}.size }
The core of the assignment of observations to each cluster is the filter on the time series (line 13
). The filter computes the index of the closest cluster and checks whether the observation is to be reassigned (line 14
). The observation at the n
index is added to the nearest cluster, clusters(nearestCluster)
(line 15
). The current membership of the observations is then updated (line 16
).
The cluster closest to an observation data is computed by the private getNearestCluster
method as follows:
def getNearestCluster(clusters: V, x: Array[T]): Int = clusters.zipWithIndex./:((Double.MaxValue, 0)){ case (p, (c, n) ) => { val measure = distance(c.center, x) //17 if( measure < p._1) (measure, n) else p }}._2
A fold is used to extract the cluster that is closest to the x
observation from the list of clusters using the distance metric defined in the K-means constructor (line 17
).
As with other data processing units, the extraction of K-means clusters is encapsulated in a data transformation so that clustering can be integrated into a workflow using the composition of mixins described in the Composing mixins to build a workflow section in Chapter 2, Hello World!
K-means algorithm exit condition
In some rare instances, the algorithm may reassign the same few observations between clusters, preventing its convergence toward a solution in a reasonable time. Therefore, it is recommended that you add a maximum number of iterations as an exit condition. If K-means does not converge with the maximum number of iterations, then the cluster centroids need to be reinitialized and the iterative (or recursive) execution needs to be restarted.
The |>
transformation requires that the computation of the standard deviation of the distance of the observations related to the centroid, c
, is computed in the stdDev
method:
type KMeansModel[T} = List[Cluster[T]]
def stdDev[T](
clusters: KMeansModel[T],
xt: XVSeries[T],
distance: DistanceFunc[T]): DblVector =
clusters.map( _.stdDev(xt, distance)).toVector
The clusters are initialized with predefined set of observations as their members. The algorithm updates the membership of each cluster by minimizing the total reconstruction error. There are two effective strategies to execute the K-means algorithm:
Let's declare the K-means algorithm class, KMeans
, with its public methods. KMeans
implements an ITransform
data transformation using an implicit model extracted from a training set and is described in the Monadic data transformation section in Chapter 2, Hello World! (line 18
). The configuration of the KMeansConfig
type consists of the tuple (K
, maxIters
) with K
being the number of clusters and maxIters
being the maximum number of iterations allowed for the convergence of the algorithm:
case class KMeansConfig(val K: Int, maxIters: Int)
The KMeans
class takes the following three arguments:
config
: This is the configuration used for the execution of the algorithmdistance
: This is the function used to compute the distance between any observation and a cluster centroidxt
: This is the training setThe implicit conversion of the T
type to a Double
is implemented as a view bound. The instantiation of the KMeans
class initializes a V
type of output from K-means as Cluster[T]
(line 20
). The num
instance of the Numeric
class has to be passed implicitly as a class parameter because it is required by the sortWith
invocation in initialize
, the maxBy
method, and the Cluster.moveCenter
method (line 19
). The Manifest
is required to preserve the erasure type for Array[T]
in the JVM:
class KMeans[T <: AnyVal](config: KMeansConfig, //18 distance: DistanceFunc[T], xt: XVSeries[T]) (implicit m: Manifest[T], num: Numeric[T], f: T=>Double) //19 extends ITransform[Array[T]](xt) with Monitor[T] { type V = Cluster[T] //20 val model: Option[KMeansModel[T]] = train def train: Option[KMeansModel[T]] override def |> : PartialFunction[U, Try[V]] ... }
The KMeansModel
model is defined as the list of clusters extracted through training.
The transformation or clustering function is implemented by the train
training method that creates a partial function with XVSeries[T]
as the input and KMeansModel[T]
as the output:
def train: Option[KMeansModel[T]] = Try { // STEP 1 val clusters =initialize(xt) //21 if( clusters.isEmpty) /* ... */ else { // STEP 2 val members = Array.fill(xt.size)(0) assignToClusters(xt, clusters, members) //22 var iters = 0 // Declaration of the tail recursion def update if( iters >= _config.maxIters ) throw new IllegalStateException( /* .. */) // STEP 3 update(clusters, xt, members) //23 } } match { case Success(clusters) => Some(clusters) case Failure(e) => /* … */ }
The K-means training algorithm is implemented through the following three steps:
initialize
method (line 21
).assignToClusters
method (line 22
).update
recursive method (line 23
).The computation of the total error reconstruction is implemented as a tail recursive method, update
, as follows:
@tailrec def update(clusters: KMeansModel[T], xt: U, members: Array[Int]): KMeansModel[T] = { //24 val newClusters = clusters.map( c => { if( c.size > 0) c.moveCenter(xt) //25 else clusters.filter( _.size >0) .maxBy(_.stdDev(xt, distance)) //26 }) iters += 1 if(iters >= config.maxIters || //27 assignToClusters(xt, newClusters, members) ==0) newClusters else update(newClusters, xt, membership) //28 }
The recursion takes the following three arguments (line 24
):
clusters
that is updated during the recursionxt
input time seriesmembers
A new list of clusters, newClusters
, is computed by either recalculating each centroid if the cluster is not empty (line 25
) or evaluating the standard deviation of the distance of each observation relative to each centroid (line 26
). The execution exits when either the maximum number of the maxIters
recursive calls is reached or when no more observations are reassigned to a different cluster (line 27
). Otherwise, the method invokes itself with an updated list of clusters (line 28
).
The implementation of an iterative execution is presented for an informational purpose. It follows the same sequence of calls as with the recursive implementation. The new clusters are computed (line 29
) and the execution exits when either the maximum number of allowed iterations is reached (line 30
) or when no more observations are reassigned to a different cluster (line 31
):
val members = Array.fill(xt.size)(0) assignToClusters(xt, clusters, members) var newClusters: KMeansModel[T] = List.empty[Cluster[T]] Range(0, maxIters).find( _ => { //29 newClusters = clusters.map( c => { //30 if( c.size > 0) c.moveCenter(xt) else clusters.filter( _.size > 0) .maxBy(_.stdDev(xt, distance)) }) assignToClusters(xt, newClusters, members) > 0 //31 }).map(_ => newClusters)
The density of the clusters is computed in the KMeans
class as follows:
def density: Option[DblVector] =
model.map( _.map( c =>
c.getMembers.map(xt(_)).map( distance(c.center, _)).sum)
The objective of the classification is to assign an observation to a cluster with the closest centroid:
override def |> : PartialFunction[Array[T], Try[V]] = {
case x: Array[T] if( x.length == dimension(xt)
&& model != None) =>
Try (model.map( _.minBy(c => distance(c.center,x))).get )
}
The most appropriate cluster is computed by selecting the c
cluster whose center
is the closest to the x
observation using the minBy
higher order method.
A model with a significant number of features (high dimensions) requires a larger number of observations in order to extract relevant and reliable clusters. K-means clustering with very small datasets (< 50) produces models with high bias and a limited number of clusters, which are affected by the order of observations [4:5]. I have been using the following simple empirical rule of thumb for a training set of size n, expected K clusters, and N features: n < K.N.
Whichever empirical rule you follow, such a restriction is particularly an issue for analyzing stocks using historical quotes. Let's consider our examples of using technical analysis to categorize stocks according to their price behavior over a period of 1 year (or approximately 250 trading days). The dimension of the problem is 250 (250 daily closing prices). The number of stocks (observations) would have exceeded several hundred!
There are options to get around this limitation and shrink the numbers of observations, which are as follows:
These approaches are workaround solutions at best, used for the sake of this tutorial. You need to consider the quality of your data before applying these techniques to the actual commercial applications. The principal component analysis introduced in the last paragraph of this chapter is one of the most reliable dimension reduction techniques.
The objective is to extract clusters from a set of stock price actions during a period of time between January 1 and Dec 31, 2013 as features. For this test, 127 stocks are randomly selected from the S&P 500 list. The following chart visualizes the behavior of the normalized price of a subset of these 127 stocks:
The key is to select the appropriate features prior to clustering and the time window to operate on. It would make sense to consider the entire historical price over the 252 trading days as a feature. However, the number of observations (stocks) is too limited to use the entire price range. The observations are the stock closing prices for each trading session between the 80th and 130th trading days. The adjusted daily closing prices are normalized using their respective minimum and maximum values.
First, let's create a simple method to compute the density of the clusters:
val MAX_ITERS = 150 def density(K: Int, obs: XVSeries[Double]): DblVector = KMeans[Double](KMeansConfig(K, MAX_ITERS)).density.get //32
The density
method invokes KMeans.density
described in step 3. Let's load the data from CSV files using the DataSource
class, as described in the Data extraction section in the Appendix A, Basic Concepts:
import YahooFinancials.) val START_INDEX = 80; val NUM_SAMPLES = 50 //33 val PATH = "resources/data/chap4/" type INPUT = Array[String] => Double val extractor = adjClose :: List[INPUT]() //34 val symbolFiles = DataSource.listSymbolFiles(PATH) //35 for { prices <- getPrices //36 values <- Try(getPricesRange(prices)) //37 stdDev <- Try(ks.map( density(_, values.toVector))) //38 pfnKmeans <- Try { KMeans[Double](KMeansConfig(5,MAX_ITERS),values.toVector) |> } //39 predict <- pfnKmeans(values.head) //40 } yield { val results = s"""Daily prices ${prices.size} stocks") | Clusters density ${stdDev.mkString(", ")}""" .stripMargin show(results) }
As mentioned earlier, the cluster analysis applies to the closing price in the range between the 80th and 130th trading days (line 33
). The extractor
function retrieves the adjusted closing price for a stock from YahooFinancials
(line 34
). The list of stock tickers (or symbols) are extracted as a list of CSV filenames located in path
(line 35
). For instance, the ticker symbol for General Electric Corp. is GE and the trading data is located in GE.csv
.
The execution extracts 50 daily prices using DataSource
and then filters out the incorrectly formatted data using filter
(line 36
):
type XVSeriesSet = Array[XVSeries[Double]] def getPrices: Try[XVSeriesSet] = Try { symbolFiles.map( DataSource(_, path) |> extractor ) .filter( _.isSuccess ).map( _.get) }
The historical stock prices for the trading session between the 80th and 130th days are generated by the getPricesRange
closure (line 37
):
def getPricesRange(prices: XVSeriesSet) =
prices.view.map(_.head.toArray)
.map( _.drop(START_INDEX).take(NUM_SAMPLES))
It computes the density of the clusters by invoking the density
method for each ks
value of the number of clusters (line 38
).
The pfnKmeans
partial classification function is created for a 5-cluster, KMeans
(line 39
), and then used to classify one of the observations (line 40
).
The first test run is executed with K=3 clusters. The mean (or centroid) vector for each cluster is plotted as follows:
The means vectors of the three clusters are quite distinctive. The top and bottom means 1 and 2 in the chart have the respective standard deviation of 0.34 and 0.27 and share a very similar pattern. The difference between the elements of the 1 and 2 cluster mean vectors is almost constant: 0.37. The cluster with a mean vector 3 represents the group of stocks that behave like the stocks in cluster 2 at the beginning of the time period and behave like the stocks in cluster 1 toward the end of the time period.
This behavior can be easily explained by the fact that the time window or trading period, the 80th to 130th trading day, correspond to the shift in the monetary policy of the federal reserve in regard to the quantitative easing program. Here is the partial list of stocks for each of the clusters whose centroid values are displayed on the chart:
Cluster 1 |
AET, AHS, BBBY, BRCM, C, CB, CL, CLX, COH, CVX, CYH, DE, … |
Cluster 2 |
AA, AAPL, ADBE, ADSK, AFAM, AMZN, AU, BHI, BTU, CAT, CCL, … |
Cluster 3 |
ADM, ADP, AXP, BA, BBT, BEN, BK, BSX, CA, CBS, CCE, CELG, CHK, … |
Let's evaluate the impact of the number of clusters K on the characteristics of each cluster.
We repeat the previous test on the 127 stocks and the same time window with the number of clusters varying from 2 to 15.
The mean (or centroid) vector for each cluster for K = 2 is plotted as follows:
The chart of the results of the K-means algorithms with 2 clusters shows that the mean vector for the cluster labeled 2 is similar to the mean vector labeled 3 on the chart with K = 5 clusters. However, the cluster with the mean vector 1 reflects somewhat the aggregation or summation of the mean vectors for the clusters 1and 3 in the chart K = 5. The aggregation effect explains why the standard deviation for the cluster 1 (0.55) is twice as much as the standard deviation for the cluster 2 (0.28).
The mean (or centroid) vector for each cluster for K = 5 is plotted as follows:
In this chart, we can assess that the clusters 1 (with the highest mean), 2 (with the lowest mean), and 3 are very similar to the clusters with the same labels in the chart for K = 3. The cluster with the mean vector 4 contains stocks whose behaviors are quite similar to those in cluster 3, but in the opposite direction. In other words, the stocks in cluster 3 and 4 reacted in opposite ways following the announcement of the change in the monetary policy.
In the tests with high values of K, the distinction between the different clusters becomes murky, as shown in the following chart for K = 10:
The means for clusters 1, 2, and 3 seen in the first chart for the case K = 2 are still visible. It is fair to assume that these are very likely the most reliable clusters. These clusters happened to have a low standard deviation or high density.
Let's define the density of a cluster, Cj, with a centroid, cj, as the inverse of the Euclidean distance between all the members of each cluster and its mean (or centroid) (M6):
The density of the cluster is plotted against the number of clusters with K = 1 to K = 13:
As expected, the average density of each cluster increases as K increases. From this experiment, we can draw the simple conclusion that the density of each cluster does not significantly increase in the test runs for K = 5 and beyond. You may observe that the density does not always increase as the number of clusters increases (K = 6 and K = 11). The anomaly can be explained by the following three factors:
There are several methodologies to validate the output of a K-means algorithm from purity to mutual information [4:6]. One effective way to validate the output of a clustering algorithm is to label each cluster and run those clusters through a new batch of labeled observations. For example, if during one of these tests, you find that one of the clusters, CC, contains most of the commodity-related stocks, then you can select another commodity-related stock, SC, which is not part of the first batch, and run the entire clustering algorithm again. If SC is a subset of CC, then K-means has performed as expected. If this is the case, you should run a new set of stocks, some of which are commodity-related, and measure the number of true positives, true negatives, false positives, and false negatives. The values for the precision, recall, and F1 score introduced in the Assessing a model section in Chapter 2, Hello World!, confirms whether the tuning parameters and labels you selected for your cluster are indeed correct.
F1 validation for K-means
The quality of the clusters, as measured by the F1 score, depends on the rule, policy, or formula used to label observations (that is, label a cluster with the industry with the highest relative percentage of stocks in the cluster). This process is quite subjective. The only sure way to validate a methodology is to evaluate several labeling schemes and select the one that generate the highest F1 score.
An alternative to measure the homogeneity of the distribution of observations across the clusters is to compute the statistical entropy. A low entropy value indicates that the clusters have a low level of impurity. Entropy can be used to find the optimal number of clusters K.
We reviewed some of the tuning parameters that affect the quality of the results of the K-means clustering, which are as follows:
In some cases, the similarity criterion (that is, the Euclidean distance or cosine distance) can have an impact on the cleanness or density of the clusters.
The final and important consideration is the computational complexity of the K-means algorithm. The previous sections of the chapter described some of the performance issues with K-means and possible remedies.
Despite its many benefits, the K-means algorithm does not handle missing data or unobserved features very well. Features that depend on each other indirectly may in fact depend on a common hidden (also known as latent) feature. The expectation-maximization algorithm described in the next section addresses some of these limitations.
The expectation-maximization algorithm was originally introduced to estimate the maximum likelihood in the case of incomplete data [4:7]. It is an iterative method to compute the model features that maximize the likely estimate for observed values, taking into account unobserved values.
The iterative algorithm consists of computing the following:
The expectation-maximization algorithm is applied to solve clustering problems by assuming that each latent variable follows a normal or Gaussian distribution. This is similar to the K-means algorithm for which the distance of each data point to the center of each cluster follows a Gaussian distribution [4:8]. Therefore, a set of latent variables is a mixture of Gaussian distributions.
Latent variables, Zi, can be visualized as the behavior (or symptoms) of a model (observed) X for which Z are the root causes of the behavior:
The latent values, Z, follow a Gaussian distribution. For the statisticians among us, the mathematics of a mixture model is described here:
As far as the implementation is concerned, the expectation-maximization algorithm can be broken down into three stages:
The E step
M9: The expectation, Q, of the complete data log likelihood for the model parameters, θn, at iteration, n, is computed using the posterior distribution of latent variable, z, p(z|x, θ), and the joint probability of the observation and the latent variable:
The M-step
M10: The expectation function Q is maximized for the model features θ to compute the model parameters θn+1 for the next iteration:
A formal, detailed, but short mathematical formulation of the EM algorithm can be found in S. Borman's tutorial [4:9].
Let's implement the three steps (initial step, E step, and M step) in Scala. The internal calculations of the EM algorithm are a bit complex and overwhelming. You may not benefit much from the details of a specific implementation such as computation of the eigenvalues of the covariance matrix of the expectation of the log likelihood. This implementation hides some complexities using the Apache Commons Math library package [4:10].
The expectation-maximization algorithm of the MultivariateEM
type is implemented as a data transformation of an ITransform
type, as described in the Monadic data transformation section in Chapter 2, Hello World!. The two arguments of the constructors are the number of K
clusters (or gauss distribution) and the xt
training set (line 1
). The constructor initializes the V
type of the output as EMCluster
(line 2
):
class MultivariateEM[T <: AnyVal](K: Int, xt: XVSeries[T])(implicit f: T => Double) extends ITransform[Array[T]](xt) with Monitor[T] { //1 type V = EMCluster //2 val model: Option[EMModel ] = train //3 override def |> : PartialFunction[U, Try[V]] }
The multivariate expectation-maximization class has a model that consists of a list of EM clusters of the EMCluster
type. The Monitor
trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).
The information about an EM cluster, EMCluster
, is defined by key
, the centroid or means
value, and density
of the cluster that is the standard deviation of the distance of all the data points to the mean (line 4
):
case class EMCluster(key: Double, val means: DblArray, val density: DblArray) //4 type EMModel = List[EMCluster]
The implementation of the EM algorithm in the train
method uses the Apache Commons Math MultivariateNormalMixture
for the Gaussian mixture model and MultivariateNormalMixtureExpectationMaximization
for the EM algorithm:
def train: Option[EMModel ] = Try { val data: DblMatrix = xt //5 val multivariateEM = new EM(data) multivariateEM.fit( estimate(data, K) ) //6 val newMixture = multivariateEM.getFittedModel //7 val components = newMixture.getComponents.toList //8 components.map(p =>EMCluster(p.getKey, p.getValue.getMeans, p.getValue.getStandardDeviations)) //9 } match {/* … */}
Let's take a look at the main train
method of the MultivariateEM
wrapper class. The first step is to convert the time series into a primitive matrix of Double
with observations/historical quotes as rows and the stock symbols as columns.
The xt
time series of the XVSeries[T]
type is converted to a DblMatrix
through an induced implicit conversion (line 5
).
The initial mixture of Gaussian distributions can be provided by the user or can be extracted from the estimate
datasets (line 6
). The getFittedModel
triggers the M-step (line 7
).
Conversion from Java and Scala collections
Java primitives need to be converted to Scala types using the import scala.collection.JavaConversions
package. For example, java.util.List
is converted to scala.collection.immutable.List
by invoking the asScalaIterator
method of the WrapAsScala
class, one of the base traits of JavaConversions
.
The Apache Commons Math getComponents
method returns a java.util.List
that is converted to scala.collection.immutable.List
by invoking the toList
method (line 8
). Finally, the data transform returns a list of cluster information of the EMCluster
type (line 9
).
Third-party library exceptions
Scala does not enforce the declaration of exceptions as part of the signature of a method. Therefore, there is no guarantee that all types of exceptions will be caught locally. This problem occurs when exceptions are thrown from a third-party library in two scenarios:
One easy workaround is to leverage the Scala exception-handling mechanism:
Try { .. } match { case Success(results) => … case Failure(exception) => ... }
The classification of a new observation or data points is implemented by the |>
method:
override def |> : PartialFunction[Array[T], Try[V]] = {
case x: Array[T]
if(isModel && x.length == dimension(xt)) =>
Try( model.map(_.minBy(c => euclidean(c.means,x))).get)
}
Let's apply the MultivariateEM
class to the clustering of the same 127 stocks used in evaluating the K-means algorithm.
As discussed in the The curse of dimensionality section, the number of stocks (127) to analyze restricts the number of observations to be used by the EM algorithm. A simple option is to filter out some of the noise of the stock's prices and apply a simple sampling method. The maximum sampling rate is restricted by the frequencies in the spectrum of noises of different types in the historical price of every stock.
Filtering and sampling
The preprocessing of the data using a combination of a simple moving average and fixed interval sampling prior to clustering is very rudimentary in this example. For instance, we cannot assume that the historical price of all the stocks share the same noise characteristics. The noise pattern in the quotation of momentum and heavily traded stocks is certainly different from blue-chip securities with a strong ownership, and these stocks are held by large mutual funds.
The sampling rate should take into account the spectrum of frequency of the noise. It should be set as at least twice the frequency of the noise with the lowest frequency.
The object of the test is to evaluate the impact of the sampling rate, samplingRate
, and the number of K
clusters used in the EM algorithm:
val K = 4; val period = 8 val smAve = SimpleMovingAverage[Double](period) //10 val pfnSmAve = smAve |> //11 val obs = symbolFiles.map(sym => ( for { xs <- DataSource(sym, path, true, 1) |>extractor //12 values <- pfnSmAve(xs.head) //13 y <- Try { values.view.zipWithIndex.drop(period+1).toVector .filter( _._2 % samplingRate == 0) .map( _._1).toArray //14 } } yield y).get) em(K, obs) //15
The first step is to create a simple moving average with a predefined period (line 10
), as described in the The simple moving average section in Chapter 3, Data Preprocessing. The test code instantiates the pfnSmAve
partial function that implements the moving average computation (line 11
). The symbols of the stocks under consideration are extracted from the name of the files in the path directory. The historical data is contained in the CSV file whose name is path/STOCK_NAME.csv
(line 12
).
The execution of the moving average (line 13
) generates a set of smoothed values that is sampled given a sampling rate, samplingRate
(line 14
). Finally, the expectation-maximization algorithm is instantiated to cluster the sampled data in the em
method (line 15
):
def em(K: Int, obs: DblMatrix): Int = { val em = MultivariateEM[Double](K, obs.toVector) //16 show(s"${em.toString}") //17 }
The em
method instantiates the EM algorithm for a specific number K
of clusters (line 16
). The content of the model is displayed by invoking MultivariateEM.toString
. The results are aggregated, then displayed in a textual format on the standard output (line 17
).
The first test is to execute the EM algorithm with K = 3 clusters and a sampling period of 10 on data smoothed by a simple moving average with period of 8. The sampling of historical prices of the 127 stocks between January 1, 2013 and December 31, 2013 with a frequency of 0.1 hertz produces 24 data points. The following chart displays the mean of each of the three clusters:
The mean vectors of clusters 2 and 3 have similar patterns, which may suggest that a set of three clusters is accurate enough to provide a first insight into the similarity within groups of stocks. The following is a chart of the normalized standard deviation per cluster using EM with K = 3:
The distribution of the standard deviation along with the mean vector of each cluster can be explained by the fact that the price of stocks from a couple of industries went down in synergy, while others went up as a semi-homogenous group following the announcement from the Federal Reserve that the monthly quantity of bonds purchased as part of the quantitative easing program would be reduced in the near future.
Relation to K-means
You may wonder what is the relation between EM and K-means, as both the techniques address the same problem. The K-means algorithm assigns each observation uniquely to one and only one cluster. The EM algorithm assigns an observation based on posterior probability. K-means is a special case of the EM for Gaussian mixtures [4:11].
Online learning is a powerful strategy for training a clustering model when dealing with very large datasets. This strategy has regained interest from scientists lately. The description of the online EM algorithm is beyond the scope of this tutorial. However, you may need to know that there are several algorithms for online EM available if you ever have to deal with large datasets, such as batch EM, stepwise EM, incremental EM, and Monte Carlo EM [4:12].