Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Fitting a line through data

Now, we get to do some modeling! It's best to start simple; therefore, we'll look at linear regression first. Linear regression is the first, and therefore, probably the most fundamental model—a straight line through data.

Getting ready

The boston dataset is perfect to play around with regression. The boston dataset has the median home price of several areas in Boston. It also has other factors that might impact housing prices, for example, crime rate.

First, import the datasets model, then we can load the dataset:

>>> from sklearn import datasets
>>> boston = datasets.load_boston()

How to do it...

Actually, using linear regression in scikit-learn is quite simple. The API for linear regression is basically the same API you're now familiar with from the previous chapter.

First, import the LinearRegression object and create an object:

>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()

Now, it's as easy as passing the independent and dependent variables to the fit method of LinearRegression:

>>> lr.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

Now, to get the predictions, do the following:

>>> predictions = lr.predict(boston.data)

It's then probably a good idea to look at how close the predictions are to the actual data. We can use a histogram to look at the differences. These are called the residuals, as shown:

Let's take a look at the coefficients:

>>> lr.coef_
array([ -1.07170557e-01,  4.63952195e-02, 2.08602395e-02,
         2.68856140e+00, -1.77957587e+01, 3.80475246e+00,
         7.51061703e-04, -1.47575880e+00, 3.05655038e-01,
        -1.23293463e-02, -9.53463555e-01, 9.39251272e-03,
        -5.25466633e-01])

Tip

A common pattern to express the coefficients of the features and their names is zip(boston.feature_names, lr.coef_).

So, going back to the data, we can see which factors have a negative relationship with the outcome, and also the factors that have a positive relationship. For example, and as expected, an increase in the per capita crime rate by town has a negative relationship with the price of a home in Boston. The per capita crime rate is the first coefficient in the regression.

How it works...

The basic idea of linear regression is to find the set of coefficients of that satisfy , where X is the data matrix. It's unlikely that for the given values of X, we will find a set of coefficients that exactly satisfy the equation; an error term gets added if there is an inexact specification or measurement error. Therefore, the equation becomes , where is assumed to be normally distributed and independent of the X values. Geometrically, this means that the error terms are perpendicular to X. It's beyond the scope of this book, but it might be worth it to prove to yourself.

In order to find the set of betas that map the X values to y, we minimize the error term. This is done by minimizing the residual sum of squares.

This problem can be solved analytically, with the solution being .

There's more...

The LinearRegression object can automatically normalize (or scale) the inputs:

>>> lr2 = LinearRegression(normalize=True)
>>> lr2.fit(boston.data, boston.target)
LinearRegression(copy_X=True, fit_intercept=True, normalize=True)
>>> predictions2 = lr2.predict(boston.data)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Fitting a line through data

Create new playlist

Sign In

Sign Up

Fitting a line through data

Getting ready

How to do it...

Tip

How it works...

There's more...

Table of Contents for
Fitting a line through data