Clustering Dow Jones stocks with scikits-learn

Clustering is a type of machine learning algorithm, which aims to group items based on similarities. In this example, we will use the log returns of stocks in the Dow Jones Industrial Index to cluster. Most of the steps of this recipe have already passed the review in previous chapters.

How to do it...

First, we will download the EOD price data for those stocks from Yahoo Finance. Second, we will calculate a square affinity matrix. Finally, we will cluster the stocks with the AffinityPropagation class.

  1. Downloading the price data.

    We will download price data for 2011 using the stock symbols of the DJI Index. In this example, we are only interested in the close price:

    # 2011 to 2012
    start = datetime.datetime(2011, 01, 01)
    end = datetime.datetime(2012, 01, 01)
    
    #Dow Jones symbols
    symbols = ["AA", "AXP", "BA", "BAC", "CAT", "CSCO", "CVX", "DD", "DIS", "GE", "HD", "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT", "KO", "MCD", "MMM", "MRK", "MSFT", "PFE", "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]
    
    quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
        for symbol in symbols]
    
    close = numpy.array([q.close for q in quotes]).astype(numpy.float)
  2. Calculating the affinity matrix.

    Calculate the similarities between different stocks using the log returns as metric. What we are trying to do is calculate the Euclidean distances for the data points:

    logreturns = numpy.diff(numpy.log(close))
    print logreturns.shape
    
    logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
    S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)
  3. Clustering the stocks.

    Give the AffinityPropagation class the result from the previous step. This class labels the data points, or in our case, stocks with the appropriate cluster number:

    aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
    labels = aff_pro.labels_
    
    for i in xrange(len(labels)):
        print '%s in Cluster %d' % (symbols[i], labels[i])

The complete clustering program is as follows:

import datetime
import numpy
import sklearn.cluster
from matplotlib import finance

#1. Download price data

# 2011 to 2012
start = datetime.datetime(2011, 01, 01)
end = datetime.datetime(2012, 01, 01)

#Dow Jones symbols
symbols = ["AA", "AXP", "BA", "BAC", "CAT",
  "CSCO", "CVX", "DD", "DIS", "GE", "HD",
  "HPQ", "IBM", "INTC", "JNJ", "JPM", "KFT",
  "KO", "MCD", "MMM", "MRK", "MSFT", "PFE",
  "PG", "T", "TRV", "UTX", "VZ", "WMT", "XOM"]

quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
    for symbol in symbols]

close = numpy.array([q.close for q in quotes]).astype(numpy.float)
print close.shape

#2. Calculate affinity matrix
logreturns = numpy.diff(numpy.log(close))
print logreturns.shape

logreturns_norms = numpy.sum(logreturns ** 2, axis=1)
S = - logreturns_norms[:, numpy.newaxis] - logreturns_norms[numpy.newaxis, :] + 2 * numpy.dot(logreturns, logreturns.T)

#3. Cluster using affinity propagation
aff_pro = sklearn.cluster.AffinityPropagation().fit(S)
labels = aff_pro.labels_

for i in xrange(len(labels)):
    print '%s in Cluster %d' % (symbols[i], labels[i])

The output with the cluster numbers for each stock is as follows:

(30, 252)
(30, 251)
AA in Cluster 0
AXP in Cluster 6
BA in Cluster 6
BAC in Cluster 1
CAT in Cluster 6
CSCO in Cluster 2
CVX in Cluster 7
DD in Cluster 6
DIS in Cluster 6
GE in Cluster 6
HD in Cluster 5
HPQ in Cluster 3
IBM in Cluster 5
INTC in Cluster 6
JNJ in Cluster 5
JPM in Cluster 4
KFT in Cluster 5
KO in Cluster 5
MCD in Cluster 5
MMM in Cluster 6
MRK in Cluster 5
MSFT in Cluster 5
PFE in Cluster 7
PG in Cluster 5
T in Cluster 5
TRV in Cluster 5
UTX in Cluster 6
VZ in Cluster 5
WMT in Cluster 5
XOM in Cluster 7

How it works...

The following table is an overview of the functions we used in this recipe:

Function

Description

sklearn.cluster.AffinityPropagation()

Creates an AffinityPropagation object.

sklearn.cluster.AffinityPropagation.fit

Computes an affinity matrix from Euclidian distances and applies affinity propagation clustering.

diff

Calculates differences of numbers within a NumPy array. If not specified, first-order differences are computed.

log

Calculates the natural log of elements in a NumPy array.

sum

Sums the elements of a NumPy array.

dot

Does matrix multiplication for 2D arrays. Calculates the inner product for 1D arrays.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset