Chapter 2. Linear Regression

In this chapter you will learn how to use linear models in regression problems. First, we will examine simple linear regression, which models the relationship between a response variable and single explanatory variable. Next, we will discuss multiple linear regression, a generalization of simple linear regression that can support more than one explanatory variable. Then, we will discuss polynomial regression, a special case of multiple linear regression that can effectively model nonlinear relationships. Finally, we will discuss how to train our models by finding the values of their parameters that minimize a cost function. We will work through a toy problem to learn how the models and learning algorithms work before discussing an application with a larger dataset.

Simple linear regression

In the previous chapter you learned that training data is used to estimate the parameters of a model in supervised learning problems. Past observations of explanatory variables and their corresponding response variables comprise the training data. The model can be used to predict the value of the response variable for values of the explanatory variable that have not been previously observed. Recall that the goal in regression problems is to predict the value of a continuous response variable. In this chapter, we will examine several example linear regression models. We will discuss the training data, model, learning algorithm, and evaluation metrics for each approach. To start, let's consider simple linear regression. Simple linear regression can be used to model a linear relationship between one response variable and one explanatory variable. Linear regression has been applied to many important scientific and social problems; the example that we will consider is probably not one of them.

Suppose you wish to know the price of a pizza. You might simply look at a menu. This, however, is a machine learning book, so we will use simple linear regression instead to predict the price of a pizza based on an attribute of the pizza that we can observe. Let's model the relationship between the size of a pizza and its price. First, we will write a program with scikit-learn that can predict the price of a pizza given its size. Then, we will discuss how simple linear regression works and how it can be generalized to work with other types of problems. Let's assume that you have recorded the diameters and prices of pizzas that you have previously eaten in your pizza journal. These observations comprise our training data:

Training instance

Diameter (in inches)

Price (in dollars)

1

6

7

2

8

9

3

10

13

4

14

17.5

5

18

18

We can visualize our training data by plotting it on a graph using matplotlib:

>>> import matplotlib.pyplot as plt
>>> X = [[6], [8], [10], [14],   [18]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> plt.figure()
>>> plt.title('Pizza price plotted against diameter')
>>> plt.xlabel('Diameter in inches')
>>> plt.ylabel('Price in dollars')
>>> plt.plot(X, y, 'k.')
>>> plt.axis([0, 25, 0, 25])
>>> plt.grid(True)
>>> plt.show()

The preceding script produces the following graph. The diameters of the pizzas are plotted on the x axis and the prices are plotted on the y axis.

Simple linear regression

We can see from the graph of the training data that there is a positive relationship between the diameter of a pizza and its price, which should be corroborated by our own pizza-eating experience. As the diameter of a pizza increases, its price generally increases too. The following pizza-price predictor program models this relationship using linear regression. Let's review the following program and discuss how linear regression works:

>>> from sklearn.linear_model import LinearRegression
>>> # Training data
>>> X = [[6], [8], [10], [14],   [18]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> # Create and fit the model
>>> model = LinearRegression()
>>> model.fit(X, y)
>>> print 'A 12" pizza should cost: $%.2f' % model.predict([12])[0]
A 12" pizza should cost: $13.68

Simple linear regression assumes that a linear relationship exists between the response variable and explanatory variable; it models this relationship with a linear surface called a hyperplane. A hyperplane is a subspace that has one dimension less than the ambient space that contains it. In simple linear regression, there is one dimension for the response variable and another dimension for the explanatory variable, making a total of two dimensions. The regression hyperplane therefore, has one dimension; a hyperplane with one dimension is a line.

The sklearn.linear_model.LinearRegression class is an estimator. Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods. The former method is used to learn the parameters of a model, and the latter method is used to predict the value of a response variable for an explanatory variable using the learned parameters. It is easy to experiment with different models using scikit-learn because all estimators implement the fit and predict methods.

The fit method of LinearRegression learns the parameters of the following model for simple linear regression:

Simple linear regression

Simple linear regression is the predicted value of the response variable; in this example, it is the predicted price of the pizza. Simple linear regression is the explanatory variable. The intercept term Simple linear regression and coefficient Simple linear regression are parameters of the model that are learned by the learning algorithm. The line plotted in the following figure models the relationship between the size of a pizza and its price. Using this model, we would expect the price of an 8-inch pizza to be about $7.33, and the price of a 20-inch pizza to be $18.75.

Simple linear regression

Using training data to learn the values of the parameters for simple linear regression that produce the best fitting model is called ordinary least squares or linear least squares. "In this chapter we will discuss methods for approximating the values of the model's parameters and for solving them analytically. First, however, we must define what it means for a model to fit the training data.

Evaluating the fitness of a model with a cost function

Regression lines produced by several sets of parameter values are plotted in the following figure. How can we assess which parameters produced the best-fitting regression line?

Evaluating the fitness of a model with a cost function

A cost function, also called a loss function, is used to define and measure the error of a model. The differences between the prices predicted by the model and the observed prices of the pizzas in the training set are called residuals or training errors. Later, we will evaluate a model on a separate set of test data; the differences between the predicted and observed values in the test data are called prediction errors or test errors.

The residuals for our model are indicated by the vertical lines between the points for the training instances and regression hyperplane in the following plot:

Evaluating the fitness of a model with a cost function

We can produce the best pizza-price predictor by minimizing the sum of the residuals. That is, our model fits if the values it predicts for the response variable are close to the observed values for all of the training examples. This measure of the model's fitness is called the residual sum of squares cost function. Formally, this function assesses the fitness of a model by summing the squared residuals for all of our training examples. The residual sum of squares is calculated with the formula in the following equation, where Evaluating the fitness of a model with a cost function is the observed value and Evaluating the fitness of a model with a cost function is the predicted value:

Evaluating the fitness of a model with a cost function

Let's compute the residual sum of squares for our model by adding the following two lines to the previous script:

>>> import numpy as np
>>> print 'Residual sum of squares: %.2f' % np.mean((model.predict(X) - y) ** 2)
Residual sum of squares: 1.75

Now that we have a cost function, we can find the values of our model's parameters that minimize it.

Solving ordinary least squares for simple linear regression

In this section, we will work through solving ordinary least squares for simple linear regression. Recall that simple linear regression is given by the following equation:

Solving ordinary least squares for simple linear regression

Also, recall that our goal is to solve the values of Solving ordinary least squares for simple linear regression and Solving ordinary least squares for simple linear regression that minimize the cost function. We will solve Solving ordinary least squares for simple linear regression first. To do so, we will calculate the variance of Solving ordinary least squares for simple linear regression and covariance of Solving ordinary least squares for simple linear regression and Solving ordinary least squares for simple linear regression.

Variance is a measure of how far a set of values is spread out. If all of the numbers in the set are equal, the variance of the set is zero. A small variance indicates that the numbers are near the mean of the set, while a set containing numbers that are far from the mean and each other will have a large variance. Variance can be calculated using the following equation:

Solving ordinary least squares for simple linear regression

In the preceding equation, Solving ordinary least squares for simple linear regression is the mean of Solving ordinary least squares for simple linear regression, Solving ordinary least squares for simple linear regression is the value of Solving ordinary least squares for simple linear regression for the Solving ordinary least squares for simple linear regression training instance, and Solving ordinary least squares for simple linear regression is the number of training instances. Let's calculate the variance of the pizza diameters in our training set:

>>> from __future__ import division
>>> xbar = (6 + 8 + 10 + 14 + 18) / 5
>>> variance = ((6 - xbar)**2 + (8 - xbar)**2 + (10 - xbar)**2 + (14 - xbar)**2 + (18 - xbar)**2) / 4
>>> print variance
23.2

NumPy also provides the var method to calculate variance. The ddof keyword parameter can be used to set Bessel's correction to calculate the sample variance:

>>> import numpy as np
>>> print np.var([6, 8, 10, 14, 18], ddof=1)
23.2

Covariance is a measure of how much two variables change together. If the value of the variables increase together, their covariance is positive. If one variable tends to increase while the other decreases, their covariance is negative. If there is no linear relationship between the two variables, their covariance will be equal to zero; the variables are linearly uncorrelated but not necessarily independent. Covariance can be calculated using the following formula:

Solving ordinary least squares for simple linear regression

As with variance, Solving ordinary least squares for simple linear regression is the diameter of the Solving ordinary least squares for simple linear regression training instance, Solving ordinary least squares for simple linear regression is the mean of the diameters, Solving ordinary least squares for simple linear regression is the mean of the prices, Solving ordinary least squares for simple linear regression is the price of the Solving ordinary least squares for simple linear regression training instance, and Solving ordinary least squares for simple linear regression is the number of training instances. Let's calculate the covariance of the diameters and prices of the pizzas in the training set:

>>> xbar = (6 + 8 + 10 + 14 + 18) / 5
>>> ybar = (7 + 9 + 13 + 17.5 + 18) / 5
>>> cov = ((6 - xbar) * (7 - ybar) + (8 - xbar) * (9 - ybar) + (10 - xbar) * (13 - ybar) +
>>>        (14 - xbar) * (17.5 - ybar) + (18 - xbar) * (18 - ybar)) / 4
>>> print cov
>>> import numpy as np
>>> print np.cov([6, 8, 10, 14, 18], [7, 9, 13, 17.5, 18])[0][1]
22.65
22.65

Now that we have calculated the variance of our explanatory variable and the covariance of the response and explanatory variables, we can solve Solving ordinary least squares for simple linear regression using the following formula:

Solving ordinary least squares for simple linear regression
Solving ordinary least squares for simple linear regression

Having solved Solving ordinary least squares for simple linear regression, we can solve Solving ordinary least squares for simple linear regression using the following formula:

Solving ordinary least squares for simple linear regression

In the preceding formula, Solving ordinary least squares for simple linear regression is the mean of Solving ordinary least squares for simple linear regression and Solving ordinary least squares for simple linear regression is the mean of Solving ordinary least squares for simple linear regression. Solving ordinary least squares for simple linear regression are the coordinates of the centroid, a point that the model must pass through. We can use the centroid and the value of Solving ordinary least squares for simple linear regression to solve for Solving ordinary least squares for simple linear regression as follows:

Solving ordinary least squares for simple linear regression

Now that we have solved the values of the model's parameters that minimize the cost function, we can plug in the diameters of the pizzas and predict their prices. For instance, an 11-inch pizza is expected to cost around $12.70, and an 18-inch pizza is expected to cost around $19.54. Congratulations! You used simple linear regression to predict the price of a pizza.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset