Labeling a set of observations for classification or regression can be a daunting task, especially in the case of a large feature set. In some cases, labeled observations are either not available or not possible to create. In an attempt to extract some hidden association or structures from observations, the data scientist relies on unsupervised learning techniques to detect patterns or similarity in data.
The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. These techniques are also applied in reducing the solution space or feature set similarly to the divide-and-conquer approach commonly used in Computer Science.
There are numerous unsupervised algorithms; some are more appropriate to handle dependent features while others generate more relevant groups in the case of hidden features [4:1]. In this chapter, you will learn three of the most common unsupervised learning algorithms:
Any of these algorithms can be applied to technical analysis or fundamental analysis. Fundamental analysis of financial ratios and technical analysis of price movements are described in the Technical analysis section under Finances 101 in Appendix A, Basic Concepts. The K-means algorithm is fully implemented in Scala while expectation-maximization and principal components analysis leverage the Apache Commons Math library.
Problems involving a large number of features for large datasets become quickly intractable, and it is quite difficult to evaluate the independence between features. Any computation that requires some level of optimization and, at a minimum, computation of first order derivatives requires a significant amount of computing power to manipulate high-dimension matrices. As with many engineering fields, a divide-and-conquer approach to classifying very large datasets is quite effective. The objective is to reduce continuous, infinite, or very large datasets into a small group of observations that share some common attributes.
This approach is known as vector quantization. Vector quantization is a method that divides a set of observations into groups of similar size. The main benefit of vector quantization is that the analysis using a representative of each group is far simpler than an analysis of the entire dataset [4:2].
Clustering, also known as cluster analysis, is a form of vector quantization that relies on a concept of distance or similarity to generate groups known as clusters.
This chapter introduces two of the most commonly applied clustering algorithms:
K-means is a popular iterative clustering algorithm. The representative of each cluster is computed as the center of the cluster, known as the centroid. The similarity between observations within a single cluster relies on the concept of distance between observations.
There are many ways to measure the similarity between observations. The most appropriate measure has to be intuitive and avoid computational complexity. This section reviews three similarity measures:
The Manhattan distance is defined by the absolute distance between two variables or vectors, {xi} and {yi}, of the same size:
The implementation is generic enough to compute the distance between two arrays of elements of different types as long as an implicit conversion between each of these types to Double
values is already defined, as shown here:
def manhattan[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double = (x, y).zipped.foldLeft(0.0)((s, t) => s + Math.abs(t._1 - t._2))
The ubiquitous Euclidean distance is defined as the square of the distance between two vectors, {xi} and {yi}, of the same size:
def euclidean[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double = Math.sqrt((x, y).zipped.foldLeft(0.0)((s, t) => { val d = t._1 - t._2; s + d*d} ))
The cosine distance is defined as the cosine of an angle between two vectors, {xi} and {yi}, of the same size:
In this implementation, the computation of the dot product and the norms for each dataset is done simultaneously using the tuple within the fold method:
def cosine[T <% Double, U <% Double](x: Array[T], y: Array[U]): Double = {
val zeros = (0.0, 0.0, 0.0)
val norms = (x, y).zipped.foldLeft(zeros)((s, t) =>
(s._1 + t._1*t._2, s._2 + t._1*t._1, s._3 + t._2*t._2))
norms._1/Math.sqrt(norms._2*norms._3)
}
Performance of zip
The scalar product of two vectors is one of the most common operations. It is tempting to implement the dot product using the generic zip
method:
def dot (x:Array[Double], y:Array[Double]): Array[Double] = x.zip(y).map( x => f(x._1, x._2) ) An functional alternative is to use the Tuple2.zipped method. def dot(x:Array[Double], y:Array[Double]): Array[Double] = (x, y).zipped map ( _ * _)
If readability is not a primary issue, you can always implement the dot
method with a while
loop.
The main advantage of the K-means algorithm (and the reason for its popularity) is its simplicity [4:3].
The steps of the iterative algorithm are:
The configuration of the K clusters consists of defining the following parameters for the K-means algorithm: number of K clusters, the distance metrics, the maximum number of iterations, and the initial value of the cluster's centroid.
The first step is to define a cluster. A cluster is defined by the following parameters:
center
members
The following code shows the definition of a cluster:
class Cluster[T <% Double](val center: DblVector) {
val members = new ListBuffer[Int]
The cluster is responsible for managing its members (data points) at any point of the iterative computation of the K-means algorithm. It is assumed that a cluster will never contain the same data points twice.
The constructor of the Cluster
class is implemented by the apply
method in the companion object (for convenience, refer to the Class constructor template section in Appendix A, Basic Concepts):
object Cluster { def apply[T <% Double](c:DblVector):Cluster[T] = new Cluster[T](c) }
At a minimum, a cluster should be able to manage its membership of observations, update its center, and compute the variance or standard deviation of all its member observations:
def += (n:Int): Unit = members.append(n) def moveCenter(xt: XTSeries[Array[T]): Cluster[T] ={ val sums = members.map(xt(_).map(_.toDouble)).toList .transpose .map( _.sum) Cluster[T](sums.map( _ / members.size).toArray) } def stdDev(xt: XTSeries[Array[T]], distance: (DblVector, Array[T]) => Double): Double = { Stats[Double](members.map(xt( _)) .map( distance(center, _)).toArray).stdDev }
The three important methods that define the behavior of a cluster instance are as follows:
+=
: Add a member (index of an observation in the original time series).moveCenter
: Create a new cluster with the existing members and a new centroid computed as the mean of all the observations contained in the cluster.stdDev
: Compute the standard deviation (or density) of all the observations contained in the cluster relative to its center. The distance between each member and the centroid is extracted through a map, and then folded to generate the statistics. The function to compute the distance between the center and an observation is an argument of the method. The default distance is Euclidean.Cluster selection
There are different ways to select the most appropriate cluster when reassigning an observation (updating its membership). In this implementation, we will select the cluster with the larger spread or lowest density. An alternative is to select the cluster with the largest membership.
Let's declare the K-means algorithm class, KMeans
, with its public methods.
The KMeans
class takes the number of clusters, K
, and the maximum number of iterations, maxIters
, as parameters. The implicit conversion of type T
to a Double
is specified by the T <% Double
view bound. The Ordering
class has to be passed implicitly as a parameter because it is required by the sortWith
method in the initialize
and maxBy
methods. The Manifest
method is required to preserve the type erasure for Array[T]
in the JVM:
class KMeans[T <% Double](K: Int, maxIters: Int, distance: (DblVector,Array[T]) => Double)(implicit order: Ordering[T], m: Manifest[T]) extends PipeOperator[XTSeries[Array[T]], List[Cluster[T]]] { def |> : PartialFunction[XTSeries[Array[T]], List[Cluster[T]]] def initialize(xt:XTSeries[Array[T]]): List[Cluster[T]]
As with other data processing units, the extraction of K-means clusters is encapsulated by the pipe operator |>
, so clustering can be integrated into a workflow using dependency injection described in the Dependency injection section in Chapter 2, Hello World!. The initialization of the centroids of each of the K
clusters is performed by the private initialize
method.
The initialization of the cluster centroids is important to ensure fast convergence of K-means. Solutions range from the simple random generation of centroids to the application of genetic algorithms to evaluate the fitness of centroid candidates. We selected an efficient and fast initialization algorithm developed by M. Agha and W. Ashour [4:4].
The steps of the initialization are as follows:
The initialization algorithm is implemented by the private initialize
method:
def initialize(xt:XTSeries[Array[T]]): List[Cluster[T]]={ val stats = statistics(xt) //1 val maxSDevDim = Range(0,stats.size).maxBy (stats( _ ).stdDev)//2 val rankedObs = xt.zipWithIndex .map(x=> (x._1(maxSDevDim), x._2)) //2 .sortWith( _._1 < _._1) //3 val halfSegSize = ((rankedObs.size>>1)/K).floor.toInt //4 val centroids = rankedObs.filter(isContained( _, halfSegSize, rankedObs.size) ).map(n => xt(n._2)) //6 Range(0, K).foldLeft(List[Cluster[T]]())((xs, i) => Cluster[T](centroids(i)) :: xs) //7 }
Let's deconstruct the implementation of the Agha-Ashour algorithm in the initialize
method.
The statistics
function is applied to the input time series to extract the standard deviation for each dimension in the observations set (line 1
). The dimension with the maxSDevDim
maximum variance or standard deviation is computed by using the maxBy
method on a Stats
instance (line 2
). Then, the observations are ranked by the increasing value of the standard deviation, rankedObs
(line 3
).
The ordered sequence of observations is then broken into xt.size/K segments (line 4
) and the indices of the centroids
are selected as the midpoint (or median) observations of those segments using the filtering condition, isContained
:
def isContained(t: (T,Int), hSz: Int, dim: Int): Boolean =
((t._2 % hSz == 0) && (t._2 %(hSz<<1) != 0)
The indices of the centroid in the time series are converted to actual observations using a map
method (line 6
). Finally, the list of clusters is generated using a fold (foldLeft
) method on the range of cluster indices (0, K-1) (line 7
).
The second step in the K-means algorithm is the assignment of the observations to the clusters for which the centroids have been initialized in step 1. This feat is accomplished by the private assignToClusters
method:
def assignToClusters(xt: XTSeries[Array[T]], clusters: List[Cluster[T]], membership: Array[Int]): Int = { xt.toArray .zipWithIndex .filter(x => { //1 val nearestCluster = getNearestCluster(clusters, x._1)//2 val reassigned = nearestCluster != membership(x._2) clusters(nearestCluster) += x._2 //3 membership(x._2) = nearestCluster //4 reassigned }).size }
The core of the assignment of observations to each cluster is the filter on the time series (line 1
). The filter computes the index of the closest cluster and checks whether the observation is to be reassigned (line 2
). The observation at the index x._2
is added to the nearest cluster, clusters(nearestCluster)
(line 3
). The current membership of the observations is then updated (line 4
).
The cluster closest to an observation data is computed by the getNearestCluster
method as follows:
def getNearestCluster(clusters: List[Cluster[T]], x:Array[T]): Int={ clusters.zipWithIndex..foldLeft((Double.MaxValue,0))((p,c) => { val measure = distance(c._1.center, x) if(measure < p._1) (measure, c._2) else p })._2
A fold is used to extract from the list of clusters the cluster that is closest to the observation x
using the distance
metric defined in the K-means constructor.
The final step is to implement the iterative computation of the reconstruction error. In this implementation, the iteration terminates when no more observations are reassigned to different clusters. As with other data processing units, the extraction of K-means clusters is encapsulated by the pipe operator |>, so that clustering can be integrated into a workflow using dependency injection described in the Dependency Injection section in Chapter 2, Hello World!.
The generation of the K clusters is executed by the data transformation |>
:
def |> :PartialFunction[XTSeries[Array[T]], List[Cluster[T]]] = { case xt: XTSeries[Array[T]] if(xt.size>2 && xt(0).size>0) => { val clusters = initialize(xt) //1 if( clusters.isEmpty) List.empty else { val membership = Array.fill(xt.size)(0) val reassigned = assignToClusters(xt,clusters,membership)//2 var newClusters: List[Cluster[T]] = List.empty Range(0, maxIters).find( _ => { newClusters = clusters.map( c => { if( c.size > 0) c.moveCenter(xt, dimension(xt)) else clusters.filter( _.size > 0) .maxBy( _.stdDev(xt, distance)) }) //3 assignToClusters(xt, newClusters, membership) == 0 }) match { case Some(index) => newClusters case None => { … } } //4 } }
As described in the algorithm overview section, the main method initializes the membership for all the observations (line 1
), creates and initializes the clusters, and assigns the observations to clusters using the assignToClusters
method (line 2
). The iteration updates the content of each cluster using the moveCenter
method, by assigning new observations to the cluster with the highest standard deviation (line 3
). The iterative loop exits when no more reassignment is needed (line 4
).
K-means algorithm exit condition
In some rare instances, the algorithm may reassign the same few observations between clusters, preventing its convergence toward a solution in a reasonable time. Therefore, it is recommended to add a maximum number of iterations as an exit condition. If K-means does not converge with the maximum number of iterations, then the cluster centroids need to be reinitialized and the iterative process needs to be executed once again.
The companion object for KMeans
implements the apply
constructor and the computation of the stdDev
standard deviation for each cluster. The default constructor uses the Euclidean distance:
def apply[T <% Double](K: Int, maxIters: Int)(implicit order: Ordering[T], m: Manifest[T]): KMeans[T] = new KMeans[T](K, maxIters, euclidean) def stdDev[T](c: List[Cluster[T]], xt: XTSeries[Array[T]]): List[Double] = c.map( _.stdDev(xt))
The stdDev
method computes the standard deviation of the distances between each data point that belongs to a c
cluster and its centroid.
Note that ordering a trait and Manifest
have to be provided in the apply
constructor because there is no guarantee that such capabilities are provided in runtime by the client code.
A model with a significant number of features (high dimensions) requires a larger number of observations in order to extract robust clusters. K-means clustering with very small datasets, of size less than 50, produces models with high bias and a limited number of clusters that are affected by the order of observations [4:5]. I have been using the following simple empirical rule of thumb for a training set of size n, expected K clusters, and N features: n < K.N.
Whichever empirical rule you follow, such a restriction is particularly an issue for analyzing stocks using historical quotes. Let's consider our examples of using technical analysis to categorize stocks according to their price behavior over a period of 1 year (or approximately 250 trading days). The dimension of the problem is 250 (250 daily closing prices). The number of stocks (observations) would have exceeded several hundred!
There are options to get around this limitation and shrink the number of observations; among them are:
These approaches are workaround solutions at best, used for the sake of this tutorial, but they are not recommended for actual commercial analytical applications. The principal component analysis introduced in the last section of this chapter is one of the most reliable dimension reduction techniques.
The objective is to extract clusters from a set of stock price actions during a period of time between January 1 and Dec 31, 2013 as features. For this test, 127 stocks are randomly selected from the S&P 500 list. The following chart visualizes the behavior of the normalized price of a subset of these 127 stocks:
The key is to select the appropriate features prior to clustering and the time window to operate on. It would make sense to consider the entire historical price over the 252 trading days as a feature. However, the number of observations (stocks) is too limited to use the entire price range. The (SAMPLES = 50)
observations are the stock closing price for each trading session between the 80th and 130th days. The adjusted daily closing prices are normalized using the minimum and maximum values.
First, let's create a simple function to execute the K-means algorithm:
Val MAX_ITERS = 150 def run(K: Int, obs: DblMatrix): Unit = { val kmeans = KMeans[Double](K, MAX_ITERS) //1 val clusters = kmeans |> XTSeries[DblVector](obs) //2 clusters.foreach( _.center.foreach( show( _ ))) //3 clusters.map( _.stdDev(XTSeries[DblVector](obs, euclidean))).foreach( show( _ ) ) //4 }
The KMeans
class is first initialized with a number of clusters, K
, and a maximum number of iterations, MAX_ITERS
(line 1
). These two parameters are domain and problem specific. The clustering algorithm is executed (line 2
) returning a list of clusters. The clusters' centroid information is then displayed (line 3
) and the standard deviation is computed for each of the clusters for a given number of clusters, K
, and observations, obs
(line 4
).
Let's load the data from CSV files using the DataSource
class (refer to the Data extraction section in Appendix A, Basic Concepts):
final val path = "resources/data/chap4/" val extractor = YahooFinancials.adjClose :: List[Array[String] =>Double]() // 5 def symbols = DataSource.listSymbols(path) //6 final val START = 80 final val SAMPLES = 50 val normalize=true val prices = symbols.map(s =>DataSource(s,path,normalize) |> extractor) //7 prices.find(_.isEmpty) match { //8 case Some(noPrice) = { … } case None => { val values = prices. map(x => x(0)) .map(_.drop(START).take(SAMPLES)) args.map(_.toInt) foreach( run(_, values)) //9 } }
As mentioned earlier, the cluster analysis applies to the closing price in the range between the 80th and 130th trading day. The extractor is defined to extract the adjusted closing price for a stock whose price information is retrieved from YahooFinancials
(line 5
). The list of stock symbols is used to extract price information from CSV files located at the path (line 6
). For instance, the ticker symbol for General Electric Corp. is GE and the trading data is located in GE.csv
.
The 50 daily prices for each stock are extracted by an instance of DataSource
(line 7
). The run
method introduced earlier is invoked either for each stock or as soon as K-means fails through an exit condition in the find
method (line 8
). The normalized data values.toArray
for the specific time window is extracted by the combination of calls to drop and take Scala array methods (line 9
).
The first test run is executed with K=3 clusters. The mean (or centroid) vector for each cluster is plotted as follows:
The means vectors of the three clusters are quite distinctive. The top and bottom means 1 and 2 in the chart have the respective standard deviation of 0.34 and 0.27 and share a very similar pattern. The difference between the elements of the 1 and 2 cluster mean vectors is almost constant: 0.37. The cluster with a mean vector 3 represents the group of stocks that behave like the stocks in cluster 2 at the beginning of the time period, and behave like the stocks in cluster 1 towards the end of the time period.
This behavior can be easily explained by the fact that the time window or trading period, the 80th to 130th trading day, correspond to the shift in the monetary policy of the federal reserve in regard to the quantitative easing program. Here is the list of stocks for each of the clusters whose centroid values are displayed on the chart:
Cluster |
List of stocks |
---|---|
Cluster 1 |
AET, AHS, BBBY, BRCM, C, CB, CL, CLX, COH, CVX, CYH, DE, DG, DHI, DO, DUK, EA, EBAY, EXC, EXP, FE, GLW, GPS, IBM, JCP, JNJ, JWN, K, KF, KMI, KO, KRFT, LEN, LINC, LRCX, MSFT, NVMI, THC, XRT |
Cluster 2 |
AA, AAPL, ADBE, ADSK, AFAM, AMZN, AU, BHI, BTU, CAT, CCL, CCMP, COP, CSC, CU, DOW, EMR, ENTG, ETFC, FCX, FDX, FFIV, FISV, FLIR, FLR, FLS, FTR, GLD, GRMN, GT, JCI, QCOM, QQQ, SIL, SLV, SLW |
Cluster 3 |
ADM, ADP, AXP, BA, BBT, BEN, BK, BSX, CA, CBS, CCE, CELG, CHK, CI, CME, CMG, CSCO, CVS, DAL, DD, DNB, EMC, EXPE, F, FDO, FITB, FMC, GCI, GE, GM, GME, GS, HCA, JNPR, JPM, KLAC, LH, LLL, LM, LMT, LNC, LO, MKSI, MU, NEM, TRW, TXN, UNH, WDC, XLF, XLNX, ZNGA |
Let's evaluate the impact of the number of clusters K on the characteristics of each cluster.
We repeat the previous test on the 127 stocks and the same time window with the number of clusters varying from 2 to 15.
The mean (or centroid) vector for each cluster is plotted as follows for K = 2:
The chart of the results of the K-means algorithms with 2 clusters shows that the mean vector for the cluster labeled 2 is similar to the mean vector labeled 3 on the chart with K = 3 clusters. However, the cluster with the mean vector 1 reflects somewhat the aggregation or summation of the mean vectors for the clusters 1 and 3 in the chart K =3. The aggregation effect explains why the standard deviation for the cluster 1, 0.55, is twice as much as the standard deviation for the cluster 2, 0.28.
The mean (or centroid) vector for each cluster is plotted as follows for K = 5:
In this chart, we can assess that the clusters 1 (with the highest mean), 2 (with the lowest mean), and 3 are very similar to the clusters with the same labels in the chart for K =3. The cluster with the mean vector 4 contains stocks whose behaviors are quite similar to those in cluster 3, but in the opposite direction. In other words, the stocks in cluster 3 and 4 reacted in opposite ways following the announcement of the change in the monetary policy.
In the tests with high values of K, the distinction between the different clusters becomes murky, as shown in the following chart for K = 10:
The means for clusters 1, 2, and 3 seen in the first chart for the case K = 3 are still visible. It is fair to assume that these are very likely the most reliable clusters. These clusters happened to have a low standard deviation or high density.
Let's define the density of a cluster Cj with a centroid cj as the inverse of the Euclidean distance between all members of each cluster and its mean (or centroid):
The density of the cluster is plotted against the number of clusters with K = 1 to 13:
As expected, the average density of each cluster increases as K increases. From this experiment, we can draw the simple conclusion that the density of each cluster does not significantly increase in the test runs for K =5 and beyond. You may observe that the density does not always increase as the number of clusters increases (K = 6 to K = 11). The anomaly can be explained by the following three factors:
There are several methodologies to validate the output of a K-means algorithm from purity to mutual information [4:6]. One effective way to validate the output of a clustering algorithm is to label each cluster and run those clusters through a new batch of labeled observations. For example, if during one of these tests you find that one of the clusters CC contains most of the commodity-related stocks, then you can select another commodity-related stock, SC, which is not part of the first batch, and run the entire clustering algorithm again. If SC is contained in CC, then the clustering has performed as expected. If this is the case, you should run a new set of stocks, some of which are commodity related, and measure the number of true positives, true negatives, false positives, and false negatives. The precision, recall, and F1 measures introduced in the Assessing a model section of Chapter 2, Hello World!, confirms whether the tuning parameters and labels you selected for your cluster are indeed correct.
Validation
The quality of the clusters, as measured by the F1 statistics, depends on the labeling of the cluster and the rule (that is, label a cluster with the industry with the highest relative percentage of stocks in the cluster) used to assign a label. This process is very subjective. The only sure way to validate a validation methodology is to evaluate several labeling schemes and select the one that generates the highest F1 statistics.
We reviewed some of the tuning parameters that impact the quality of the results of the K-means clustering. They are as follows:
In some cases, the similarity criterion (that is, Euclidean distance versus cosine value) can have an impact on the cleanness or density of the clusters.
The final and important consideration is the computational complexity of the K-means algorithm. The previous sections of the chapter described some of the performance issues with K-means and possible remedies.
Despite its many benefits, the K-means algorithm does not handle missing data or unobserved features very well. Features that depend on each other indirectly may in fact depend on a common hidden (also known as latent) variable. The expectation-maximization algorithm described in the next section addresses some of these limitations.
The expectation-maximization algorithm was originally introduced to estimate the maximum likelihood in the case of incomplete data [4:7]. It is an iterative method to compute the model features that maximize the likely estimate for observed values, taking into account unobserved values.
The iterative algorithm consists of computing:
The expectation-maximization algorithm is applied to solve clustering problems by assuming that each latent variable follows a Normal or Gaussian distribution. This is similar to the K-means algorithm for which the distance of each data point to the center of each cluster follows a Gaussian distribution [4:8]. Therefore, a set of latent variables is a mixture of Gaussian distributions.
Latent variables Z can be visualized as the behavior (or symptoms) of a model (observed) X for which Z are the root causes of the behavior:
The latent values Z follow a Gaussian distribution. For the statisticians among us, the mathematics of a mixture model is described in the following information box.
The mixture model
If {xi} is a set of observed features associated with latent features {zk}, the probability for the feature xi given zk has a value j:
The probability p is called the base distribution. If we extend to the entire model, θ= {xi, zk}, the conditional probability is defined as follows:
The most widely used mixture model is the Gaussian mixture model that represents the base distribution p as a Normal distribution and the conditional probability as a weighted Normal multivariate distribution:
As far as the implementation is concerned, the expectation-maximization algorithm can be broken down into three stages:
A formal, detailed, but short mathematical formulation of the EM algorithm can be found in S. Borman's tutorial [4:9].
Let's implement the three steps (LL, E-step, and M step) in Scala. The internal calculations of the EM algorithm are a bit complex and overwhelming. You may not benefit much from the details of a specific implementation such as computation of the eigenvalues of the covariance matrix of the expectation of the log likelihood. This implementation hides some complexities by using the Apache Commons Math library package [4:10].
First, let's define convenient internal types:
type EM = MultivariateNormalMixtureExpectationMaximization type EMOutput = List[(Double, DblVector, DblVector)] import scala.collections.JavaConversions._ //1
The constructor of the MultivariateEM
class uses the standard template for machine learning algorithm classes:
PipeOperator
Here is an implementation of the constructor of MultivariateEM:
class MultivariateEM[T <% Double](K: Int) extends PipeOperator[XTSeries[Array[T]], EMOutput]
The Apache Commons Math Java implementation of the EM uses Java container classes that need to be explicitly converted to Scala collections. Those conversions are defined in the JavaConversions
package (line 1
).
The implementation of the EM algorithm in the data transformation |>
operator uses the Apache Commons Math MultivariateNormalMixture
class for the Gaussian mixture model and the MultivariateNormalMixtureExpectationMaximization
class for the EM algorithm:
def |> : PartialFunction[XTSeries[Array[T]], EMOutput] = { case xt: XTSeries[Array[T]] if(xt.size>0 && dimension(xt)>0) =>{ val data: DblMatrix = xt //2 val multivariateEM = new EM(data) val est = MultivariateNormalMixtureExpectationMaximization .estimate(data, K) multivariateEM.fit(est) //3 val newMixture = multivariateEM.getFittedModel //4 val components = newMixture.getComponents.toList //5 components.map(p => (p.getKey.toDouble, p.getValue.getMeans, p.getValue.getStandardDeviations) ))//6 ….
Let's look at the main |>
method of the MultivariateEM
wrapper class. The first step is to convert the time series into a primitive matrix of Double
with observations and historical quotes as rows and the stock symbols as columns (line 2
).
The initial mixture of Gaussian distributions can be provided by the user or can be extracted from the dataset as an estimate (line 3
). The getFittedModel
model triggers the M-step (line 4
).
The Apache library uses Java primitives that need to be converted to Scala types using the package import scala.collection.JavaConversions
. An instance of java.util.List
is converted to scala.collection.immutable.List
using toList
, which invokes the asScalaIterator
method of WrapAsScala
, one of the base traits of JavaConversions
(line 5
).
The <Double, MultivariateNormalDistribution>
key-value pair, returned by the call to getFittedModel
by the Apache math
method, is to be converted to a tuple containing the mean and standard deviation for each cluster (line 6
).
Third-party library exceptions
Scala does not enforce the declaration of exceptions as part of the signature of a method. Therefore, there is no guarantee that all types of exceptions will be caught locally. This problem occurs when exceptions are thrown from a third-party library in two scenarios:
One easy workaround is to leverage the Scala exception-handling mechanism:
Try { .. } match { case Success(results) => … case Failure(exception) => ... }
Let's apply the MultivariateEM
class to the clustering of the same 127 stocks used in evaluating the K-means algorithm.
As discussed in the paragraph related to the curse of dimensionality, the number of stocks (127) to analyze restricts the number of observations to be used by the EM algorithm. A simple option is to filter out some of the noise of the stocks and apply a basic sampling method. The maximum sampling rate is restricted by the frequencies in the spectrum of noises of different types in the historical price of every stock.
Filtering and sampling
The preprocessing of the data using a combination of a simple moving average and fixed interval sampling prior to clustering is very rudimentary in this example. For instance, we cannot assume that the historical quotes of all the stocks share the same noise characteristics. The noise pattern in the quotation of momentum and heavily traded stocks is certainly different from blue-chip securities with a strong ownership, and these stocks are held by large mutual funds.
The sampling rate should take into account the spectrum of frequency of the noise. It should be set as at least twice the frequency of the noise with the lowest frequency.
The object of the test is to evaluate the impact of the sampling rate, samplingRate
, and the number K
of clusters used in the EM algorithm:
val extractor = YahooFinancials.adjClose :: List[Array[String] =>Double]() //1 val period = 8 val samplingRate = 10 val smAv = SimpleMovingAverage[Double](period) //2 val obs = DataSource.listSymbols(path).map(sym => { //3 val xs = DataSource(sym, path, true) |> extractor //2 val values : XTSeries[Double] = XTSeries.|>(xs)).head //4 val filtered = smAv |> values filtered.zipWithIndex //5 .drop(period+1).toArray //6 .filter( _._2%samplingRate==0) .map( _._1) })
The first step is to extract the historical quotes for all the stocks using the same extractor
as in the K-means test case (line 1
).
The symbols of the stocks under consideration are extracted from the name of the files in the path directory. The historical data is contained in the CSV file named path/STOCK_NAME.csv
(line 3
). An implicit conversion is triggered by an assignment of values of the type XTSeries[Double]
(line 4
). The simple moving average algorithm zeroed out the first period values in the smoothed data, filtered
(line 5
). Those null values have to be dropped before applying the sampling (line 6
).
The first test is to execute the EM algorithm with K=3 clusters and a sampling period of 10 on data smoothed by a simple moving average with a period of 8:
MultivariateEM[Double](K) |> XTSeries[DblVector](obs) foreach (…)
The driver prints the key (line 3
), the mean (coordinates of the centroid vector) (line 4
), and the standard deviation for each component (cluster).
The sampling of historical prices of the 127 stocks between January 1, 2013 and December 31, 2013 with a frequency of 0.1 hertz produces 24 data points. The following chart displays the mean or centroid of each of the 3 clusters:
The mean vectors of clusters 2 and 3 have similar patterns, which may suggest that 2 components or clusters could provide a first insight into the similarity within groups of stocks. The following is a chart of the normalized standard deviation per cluster using EM K = 3:
The distribution of the standard deviation along the mean vector of each cluster can be explained by the fact that the price of stocks from a couple of industries went down in synergy, while others went up as a semihomogenous group following the announcement from the Federal Reserve that the monthly quantity of bonds purchased as part of the quantitative easing program would be reduced in the near future.
Relation to K-means
You may wonder what is the relation between EM and K-means as both techniques address the same problem. The K-means algorithm assigns each observation uniquely to one and only one cluster. The EM algorithm assigns an observation based on posterior probability. K-means is a special case of the EM for Gaussian mixtures [4:11].
Online learning is a powerful strategy for training a clustering model when dealing with very large datasets. This strategy has regained interest from scientists lately. The description of online EM is beyond the scope of this tutorial. However, you may need to know that there are several algorithms available for online EM if you ever have to deal with large datasets: batch EM, stepwise EM, incremental EM, and Monte Carlo EM [4:12].