Modeling the data

Let's begin modeling by using our dataset. We're going to examine the effect that the ZIP code and the number of bedrooms have on the rental price. We'll use two packages here: the first, statsmodels, we introduced in Chapter 1, The Python Machine Learning Ecosystem, but the second, patsy, https://patsy.readthedocs.org/en/latest/index.html, is a package that makes working with statsmodels easier. Patsy allows you to use R-style formulas when running a regression. Let's do that now:

import patsy 
import statsmodels.api as sm 
 
 
f = 'rent ~ zip + beds' 
y, X = patsy.dmatrices(f, zdf, return_type='dataframe') 
 
results = sm.OLS(y, X).fit() 
results.summary()

The preceding code generates the following output:

Note that the preceding output is truncated.

With those few lines of code, we have just run our first machine learning algorithm.

While most people don't tend to think of linear regression as machine learning, that's exactly what it is. Linear regression is a type of supervised machine learning. Supervised, in this context, simply means we provide the output values for our training set.

Let's now unpack what happened there. After our imports, we have two lines that relate to the patsy module. The first line is the formula we will be using. On the left-hand side (before the tilde) is our response, or dependent, variable, rent. On the right-hand side, we have our independent, or predictor, variables, zip and beds. This formula simply means we want to know how the ZIP code and the number of bedrooms will affect the rental price.

Our formula is then passed into patsy.dmatrices() along with our DataFrame containing corresponding column names. Patsy is then set to return a DataFrame with our X matrix of predictor variables and a y vector with our response variable. These are then passed into sm.OLS(), on which we also call .fit() to run our model. Finally, we print out the results of the model.

As you can see, there is a lot of information provided in the resulting output. Let's begin by looking at the topmost section. We see that the model included 555 observations, that it has an adjusted R² of .367, and that it is significant with an F-statistic probability of 3.50e-31. What is the significance of this? It means that we have created a model that is able to explain about a third of the variance in price using just bedrooms and ZIP code. Is this a good result? In order to better answer that, let's now look at the center section of the output.

The center section provides us with information on each of the independent variables in our model. From left to right, we see the following: the variable, the variable's coefficient in the model, the standard error, the t-statistic, the p-value for the t-statistic, and a 95% confidence interval.

What does all of this tell us? If we look at the p-value column, we can determine whether our individual variables are statistically significant. Statistically significant in a regression model means that the relationship between an independent variable and a response variable is unlikely to have occurred by chance. Typically, statisticians use a p-value of .05 when determining this. A .05 p-value means that the results we see would occur by chance only 5% of the time. In terms of our output here, the number of bedrooms is clearly significant. What about the ZIP codes?

The first thing to notice here is that our intercept represents the 07302 ZIP code. When modeling a linear regression, an intercept is needed. The intercept is simply where the regression line meets the y axis. Statsmodels will automatically select one of the predictor variables to use as the intercept. Here it decided on Jersey City, 07302, since it organized the ZIP codes in ascending order. We can confirm this by examining the data as follows:

The preceding code generates the following output:

Notice that they are in ascending order, and if we look at the sorted ZIP code values in our DataFrame, we see the same with the exception of the missing ZIP 07302, which is now our baseline against which all the others will be compared.

Looking at our results output again, we notice that some ZIP codes are highly significant and others are not. Let's look at our old friend, the Lincoln Center neighborhood, or 10069. If you remember, it was the area with the highest rents in our sample. We would expect that it would be significant and have a large positive coefficient when compared to the baseline of Jersey City, and, in fact, it does. The p-value is 0.000, and the coefficient is 4116. This means that you can expect the rent to be significantly higher near Lincoln Center, compared to an equivalent apartment in Jersey City—no surprise there.

Let's now use our model to make a number of forecasts.

Table of Contents for Modeling the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Modeling the data