Linear Discriminant Analysis (LDA) attempts to fit a linear combination of features to predict the outcome variable. LDA is often used as a preprocessing step. We'll walk through both methods in this recipe.
In this recipe, we will do the following:
In this example, we will perform an analysis similar to Altman's Z-score. In this paper, Altman looked at a company's likelihood of defaulting within two years based on several financial metrics. The following is taken from the Wiki page of Altman's Z-score:
T1 = Working Capital / Total Assets. Measures liquid assets in relation to the size of the company.
T2 = Retained Earnings / Total Assets. Measures profitability that reflects the company's age and earning power.
T3 = Earnings Before Interest and Taxes / Total Assets. Measures operating efficiency apart from tax and leveraging factors. It recognizes operating earnings as being important to long-term viability.
T4 = Market Value of Equity / Book Value of Total Liabilities. Adds market dimension that can show up security price fluctuation as a possible red flag.
T5 = Sales/ Total Assets. Standard measure for total asset turnover (varies greatly from industry to industry).
From Wikipedia:
[1]: Altman, Edward I. (September 1968). ""Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy"". Journal of Finance: 189–209.
In this analysis, we'll look at some financial data from Yahoo via pandas. We'll try to predict if a stock will be higher in exactly 6 months from today, based on the current attribute of the stock. It's obviously nowhere near as refined as Altman's Z-score. Let's use a basket of auto stocks:
>>> tickers = ["F", "TM", "GM", "TSLA"] >>> from pandas.io import data as external_data >>> stock_panel = external_data.DataReader(tickers, "yahoo")
This data structure is panel
from pandas. It's similar to an OLAP cube or a 3D DataFrame
. Let's take a look at the data to get some familiarity with closes since that's what we care about while comparing:
>>> stock_df = stock_panel.Close.dropna() >>> stock_df.plot(figsize=(7, 5))
The following is the output:
Ok, so now we need to compare each stock price with its price in 6 months. If it's higher, we'll code it with 1, and if not, we'll code that with 0.
To do this, we'll just shift the dataframe back 180 days and compare:
#this dataframe indicates if the stock was higher in 180 days >>> classes = (stock_df.shift(-180) > stock_df).astype(int)
The next thing we need to do is flatten out the dataset:
>>> X = stock_panel.to_frame() >>> classes = classes.unstack() >>> classes = classes.swaplevel(0, 1).sort_index() >>> classes = classes.to_frame() >>> classes.index.names = ['Date', 'minor'] >>> data = X.join(classes).dropna() >>> data.rename(columns={0: 'is_higher'}, inplace=True) >>> data.head()
The following is the output:
Ok, so now we need to create matrices to SciPy. To do this, we'll use the patsy
library. This is a great library that can be used to create a design matrix in a fashion similar to R:
>>> import patsy >>> X = patsy.dmatrix("Open + High + Low + Close + Volume + is_higher - 1", data.reset_index(), return_type='dataframe') >>> X.head()
The following is the output:
patsy
is a very strong package, for example, suppose we want to apply some of the preprocessing from Chapter 1, Premodel Workflow. In patsy
, it's possible, like in R, to modify the formula in a way that corresponds to modifications in the design matrix. It won't be done here, but if we want to scale the value to mean 0 and standard deviation 1, the function will be "scale(open) + scale(high)"
.
Awesome! So, now that we have our dataset, let's fit the LDA object:
>>> import pandas as pd >>> from sklearn.lda import LDA >>> lda = LDA() >>> lda.fit(X.ix[:, :-1], X.ix[:, -1]);
We can see that it's not too bad when predicting against the dataset. Certainly, we will want to improve this with other parameters and test the model:
>>> from sklearn.metrics import classification_report >>> print classification_report(X.ix[:, -1].values, lda.predict(X.ix[:, :-1])) precision recall f1-score support 0.0 0.63 0.59 0.61 1895 1.0 0.60 0.64 0.62 1833 avg / total 0.61 0.61 0.61 3728
These metrics describe how the model fits the data in various ways.
The precision
and recall
parameters are fairly similar. In some ways, as shown in the following list, they can be thought of as conditional proportions:
precision
, given the model predicts a positive value, what proportion of this is correct?recall
, given the state of one class is true, what proportion did we "select"? I say, select because recall is a common metric in search problems. For example, there can be a set of underlying web pages that, in fact, relate to a search term—the proportion that is returned.The f1-score
parameter attempts to summarize the relationship between recall
and precision
.
LDA is actually fairly similar to clustering that we did previously. We fit a basic model from the data. Then, once we have the model, we try to predict and compare the likelihoods of the data given in each class. We choose the option that's more likely.
LDA is actually a simplification of QDA, which we'll talk about in the next chapter. Here, we assume that the covariance of each class is the same, but in QDA, the assumption is relaxed. Think about the connections between KNN and GMM and the relationship there and here.