Using LDA for classification

Linear Discriminant Analysis (LDA) attempts to fit a linear combination of features to predict the outcome variable. LDA is often used as a preprocessing step. We'll walk through both methods in this recipe.

Getting ready

In this recipe, we will do the following:

  1. Grab stock data from Yahoo.
  2. Rearrange it in a shape we're comfortable with.
  3. Create an LDA object to fit and predict the class labels.
  4. Give an example of how to use LDA for dimensionality reduction.

How to do it…

In this example, we will perform an analysis similar to Altman's Z-score. In this paper, Altman looked at a company's likelihood of defaulting within two years based on several financial metrics. The following is taken from the Wiki page of Altman's Z-score:

T1 = Working Capital / Total Assets. Measures liquid assets in relation to the size of the company.

T2 = Retained Earnings / Total Assets. Measures profitability that reflects the company's age and earning power.

T3 = Earnings Before Interest and Taxes / Total Assets. Measures operating efficiency apart from tax and leveraging factors. It recognizes operating earnings as being important to long-term viability.

T4 = Market Value of Equity / Book Value of Total Liabilities. Adds market dimension that can show up security price fluctuation as a possible red flag.

T5 = Sales/ Total Assets. Standard measure for total asset turnover (varies greatly from industry to industry).

From Wikipedia:

[1]: Altman, Edward I. (September 1968). ""Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy"". Journal of Finance: 189–209.

In this analysis, we'll look at some financial data from Yahoo via pandas. We'll try to predict if a stock will be higher in exactly 6 months from today, based on the current attribute of the stock. It's obviously nowhere near as refined as Altman's Z-score. Let's use a basket of auto stocks:

>>> tickers = ["F", "TM", "GM", "TSLA"]
>>> from pandas.io import data as external_data
>>> stock_panel = external_data.DataReader(tickers, "yahoo")

This data structure is panel from pandas. It's similar to an OLAP cube or a 3D DataFrame. Let's take a look at the data to get some familiarity with closes since that's what we care about while comparing:

>>> stock_df = stock_panel.Close.dropna()
>>> stock_df.plot(figsize=(7, 5))

The following is the output:

How to do it…

Ok, so now we need to compare each stock price with its price in 6 months. If it's higher, we'll code it with 1, and if not, we'll code that with 0.

To do this, we'll just shift the dataframe back 180 days and compare:

#this dataframe indicates if the stock was higher in 180 days
>>> classes = (stock_df.shift(-180) > stock_df).astype(int)

The next thing we need to do is flatten out the dataset:

>>> X = stock_panel.to_frame()
>>> classes = classes.unstack()
>>> classes = classes.swaplevel(0, 1).sort_index()
>>> classes = classes.to_frame()
>>> classes.index.names = ['Date', 'minor']
>>> data = X.join(classes).dropna()
>>> data.rename(columns={0: 'is_higher'}, inplace=True)
>>> data.head()

The following is the output:

How to do it…

Ok, so now we need to create matrices to SciPy. To do this, we'll use the patsy library. This is a great library that can be used to create a design matrix in a fashion similar to R:

>>> import patsy
>>> X = patsy.dmatrix("Open + High + Low + Close + Volume + 
                      is_higher - 1", data.reset_index(), 
                      return_type='dataframe')
>>> X.head()

The following is the output:

How to do it…

patsy is a very strong package, for example, suppose we want to apply some of the preprocessing from Chapter 1, Premodel Workflow. In patsy, it's possible, like in R, to modify the formula in a way that corresponds to modifications in the design matrix. It won't be done here, but if we want to scale the value to mean 0 and standard deviation 1, the function will be "scale(open) + scale(high)".

Awesome! So, now that we have our dataset, let's fit the LDA object:

>>> import pandas as pd
>>> from sklearn.lda import LDA
>>> lda = LDA()
>>> lda.fit(X.ix[:, :-1], X.ix[:, -1]);

We can see that it's not too bad when predicting against the dataset. Certainly, we will want to improve this with other parameters and test the model:

>>> from sklearn.metrics import classification_report
>>> print classification_report(X.ix[:, -1].values, 
                                lda.predict(X.ix[:, :-1]))
             precision     recall    f1-score      support
0.0               0.63       0.59        0.61        1895
1.0               0.60       0.64        0.62        1833
avg / total       0.61       0.61        0.61        3728

These metrics describe how the model fits the data in various ways.

The precision and recall parameters are fairly similar. In some ways, as shown in the following list, they can be thought of as conditional proportions:

  • For precision, given the model predicts a positive value, what proportion of this is correct?
  • For recall, given the state of one class is true, what proportion did we "select"? I say, select because recall is a common metric in search problems. For example, there can be a set of underlying web pages that, in fact, relate to a search term—the proportion that is returned.

The f1-score parameter attempts to summarize the relationship between recall and precision.

How it works…

LDA is actually fairly similar to clustering that we did previously. We fit a basic model from the data. Then, once we have the model, we try to predict and compare the likelihoods of the data given in each class. We choose the option that's more likely.

LDA is actually a simplification of QDA, which we'll talk about in the next chapter. Here, we assume that the covariance of each class is the same, but in QDA, the assumption is relaxed. Think about the connections between KNN and GMM and the relationship there and here.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset