Clustering is a type of machine learning algorithm, which aims to group items based on similarities. In this example, we will use the log returns of stocks in the Dow Jones Industrial Index to cluster. Most of the steps of this recipe have already passed the review in previous chapters.
First, we will download the EOD price data for those stocks from Yahoo Finance. Second, we will calculate a square affinity matrix. Finally, we will cluster the stocks with the
AffinityPropagation
class.
We will download price data for 2011 using the stock symbols of the DJI Index. In this example, we are only interested in the close price:
# 2011 to 2012 start = datetime.datetime(2011, 01, 01) end = datetime.datetime(2012, 01, 01) #Dow Jones symbols symbols = ["AA", "AXP", "BA", "BAC", "CAT", "CSCO", "CVX", "DD", "DIS", "GE", "HD", "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT", "KO", "MCD", "MMM", "MRK", "MSFT", "PFE", "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"] quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True) for symbol in symbols] close = numpy.array([q.close for q in quotes]).astype(numpy.float)
Calculate the similarities between different stocks using the log returns as metric. What we are trying to do is calculate the Euclidean distances for the data points:
logreturns = numpy.diff(numpy.log(close)) print logreturns.shape logreturns_norms = numpy.sum(logreturns ** 2, axis=1) S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)
Give the AffinityPropagation
class the result from the previous step. This class labels the data points, or in our case, stocks with the appropriate cluster number:
aff_pro = sklearn.cluster.AffinityPropagation().fit(S) labels = aff_pro.labels_ for i in xrange(len(labels)): print '%s in Cluster %d' % (symbols[i], labels[i])
The complete clustering program is as follows:
import datetime import numpy import sklearn.cluster from matplotlib import finance #1. Download price data # 2011 to 2012 start = datetime.datetime(2011, 01, 01) end = datetime.datetime(2012, 01, 01) #Dow Jones symbols symbols = ["AA", "AXP", "BA", "BAC", "CAT", "CSCO", "CVX", "DD", "DIS", "GE", "HD", "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT", "KO", "MCD", "MMM", "MRK", "MSFT", "PFE", "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"] quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True) for symbol in symbols] close = numpy.array([q.close for q in quotes]).astype(numpy.float) print close.shape #2. Calculate affinity matrix logreturns = numpy.diff(numpy.log(close)) print logreturns.shape logreturns_norms = numpy.sum(logreturns ** 2, axis=1) S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T) #3. Cluster using affinity propagation aff_pro = sklearn.cluster.AffinityPropagation().fit(S) labels = aff_pro.labels_ for i in xrange(len(labels)): print '%s in Cluster %d' % (symbols[i], labels[i])
The output with the cluster numbers for each stock is as follows:
(30, 252) (30, 251) AA in Cluster 0 AXP in Cluster 6 BA in Cluster 6 BAC in Cluster 1 CAT in Cluster 6 CSCO in Cluster 2 CVX in Cluster 7 DD in Cluster 6 DIS in Cluster 6 GE in Cluster 6 HD in Cluster 5 HPQ in Cluster 3 IBM in Cluster 5 INTC in Cluster 6 JNJ in Cluster 5 JPM in Cluster 4 KFT in Cluster 5 KO in Cluster 5 MCD in Cluster 5 MMM in Cluster 6 MRK in Cluster 5 MSFT in Cluster 5 PFE in Cluster 7 PG in Cluster 5 T in Cluster 5 TRV in Cluster 5 UTX in Cluster 6 VZ in Cluster 5 WMT in Cluster 5 XOM in Cluster 7
The following table is an overview of the functions we used in this recipe:
Function |
Description |
---|---|
Creates an | |
Computes an affinity matrix from Euclidian distances and applies affinity propagation clustering. | |
Calculates differences of numbers within a NumPy array. If not specified, first-order differences are computed. | |
Calculates the natural log of elements in a NumPy array. | |
Sums the elements of a NumPy array. | |
Does matrix multiplication for 2D arrays. Calculates the inner product for 1D arrays. |