Statsmodels has quite a lot of sample datasets in its distributions. The complete list can be found at https://github.com/statsmodels/statsmodels/tree/master/statsmodels/datasets .
In this tutorial, we will concentrate on the copper dataset, which contains information about copper prices, world consumption, and other parameters.
Before we start, we might need to install patsy. It is easy enough to see if this is necessary just run the code. If you get errors related to patsy, you will need to execute any one of the following two commands:
sudo easy_install patsy pip install --upgrade patsy
In this section, we will see how we can load a dataset from statsmodels as a Pandas DataFrame
or Series
object.
The function we need to call is load_pandas
. Load the data as follows:
data = statsmodels.api.datasets.copper.load_pandas()
This loads the data in a DataSet
object, which contains pandas
objects.
The
Dataset
object has an
attribute exog
, which when loaded as a pandas
object, becomes a
DataFrame
object with multiple columns. It also has an
endog
attribute containing values for the world consumption of copper in our case.
Perform an ordinary least squares calculation by creating an OLS
object, and calling its
fit
method as follows:
x, y = data.exog, data.endog fit = statsmodels.api.OLS(y, x).fit() print "Fit params", fit.params
This should print the result of the fitting procedure, as follows:
Fit params COPPERPRICE 14.222028 INCOMEINDEX 1693.166242 ALUMPRICE -60.638117 INVENTORYINDEX 2515.374903 TIME 183.193035
The results of the OLS fit can be summarized by the summary
method as follows:
print fit.summary()
This will give us the following output for the regression results:
The code to load the copper data set is as follows:
import statsmodels.api # See https://github.com/statsmodels /statsmodels/tree/master/statsmodels/datasets data = statsmodels.api.datasets.copper.load_pandas() x, y = data.exog, data.endog fit = statsmodels.api.OLS(y, x).fit() print "Fit params", fit.params print print "Summary" print print fit.summary()
The data in the
Dataset
class of statsmodels follows a special format. Among others, this class has the
endog
and
exog
attributes. Statsmodels has a
load
function, which loads data as NumPy arrays. Instead, we used the
load_pandas
method, which loads data as Pandas objects. We did an OLS fit, basically
giving us a statistical model for copper price and consumption.