Multiple linear regression

We have trained and evaluated a model to predict the price of a pizza. While you are eager to demonstrate the pizza-price predictor to your friends and co-workers, you are concerned by the model's imperfect r-squared score and the embarrassment its predictions could cause you. How can we improve the model?

Recalling your personal pizza-eating experience, you might have some intuitions about the other attributes of a pizza that are related to its price. For instance, the price often depends on the number of toppings on the pizza. Fortunately, your pizza journal describes toppings in detail; let's add the number of toppings to our training data as a second explanatory variable. We cannot proceed with simple linear regression, but we can use a generalization of simple linear regression that can use multiple explanatory variables called multiple linear regression. Formally, multiple linear regression is the following model:

Multiple linear regression

this edit makes no sense. change to "Where simple linear regression uses a single explanatory variable with a single coefficient, multiple linear regression uses a coefficient for each of an arbitrary number of explanatory variables.

Multiple linear regression

For simple linear regression, this is equivalent to the following:

Multiple linear regression

Multiple linear regression is a column vector of the values of the response variables for the training examples. Multiple linear regression is a column vector of the values of the model's parameters. Multiple linear regression, called the design matrix, is an Multiple linear regression dimensional matrix of the values of the explanatory variables for the training examples. Multiple linear regression is the number of training examples and Multiple linear regression is the number of explanatory variables. Let's update our pizza training data to include the number of toppings with the following values:

Training Example

Diameter (in inches)

Number of toppings

Price (in dollars)

1

6

2

7

2

8

1

9

3

10

0

13

4

14

2

17.5

5

18

0

18

We must also update our test data to include the second explanatory variable, as follows:

Test Instance

Diameter (in inches)

Number of toppings

Price (in dollars)

1

8

2

11

2

9

0

8.5

3

11

2

15

4

16

2

18

5

12

0

11

Our learning algorithm must estimate the values of three parameters: the coefficients for the two features and the intercept term. While one might be tempted to solve Multiple linear regression by dividing each side of the equation by Multiple linear regression, division by a matrix is impossible. Just as dividing a number by an integer is equivalent to multiplying by the inverse of the same integer, we can multiply Multiple linear regression by the inverse of Multiple linear regression to avoid matrix division. Matrix inversion is denoted with a superscript -1. Only square matrices can be inverted. Multiple linear regression is not likely to be a square; the number of training instances will have to be equal to the number of features for it to be so. We will multiply Multiple linear regression by its transpose to yield a square matrix that can be inverted. Denoted with a superscript Multiple linear regression, the transpose of a matrix is formed by turning the rows of the matrix into columns and vice versa, as follows:

Multiple linear regression

To recap, our model is given by the following formula:

Multiple linear regression

We know the values of Multiple linear regression and Multiple linear regression from our training data. We must find the values of Multiple linear regression, which minimize the cost function. We can solve Multiple linear regression as follows:

Multiple linear regression

We can solve Multiple linear regression using NumPy, as follows:

>>> from numpy.linalg import inv
>>> from numpy import dot, transpose
>>> X = [[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]]
>>> y = [[7], [9], [13], [17.5], [18]]
>>> print dot(inv(dot(transpose(X), X)), dot(transpose(X), y))
[[ 1.1875    ]
 [ 1.01041667]
 [ 0.39583333]]

NumPy also provides a least squares function that can solve the values of the parameters more compactly:

>>> from numpy.linalg import lstsq
>>> X = [[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]]
>>> y = [[7],    [9],    [13],    [17.5],  [18]]
>>> print lstsq(X, y)[0]
[[ 1.1875    ]
 [ 1.01041667]
 [ 0.39583333]]

Let's update our pizza-price predictor program to use the second explanatory variable, and compare its performance on the test set to that of the simple linear regression model:

>>> from sklearn.linear_model import LinearRegression
>>> X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
>>> y = [[7],    [9],    [13],    [17.5],  [18]]
>>> model = LinearRegression()
>>> model.fit(X, y)
>>> X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]]
>>> y_test = [[11],   [8.5],  [15],    [18],    [11]]
>>> predictions = model.predict(X_test)
>>> for i, prediction in enumerate(predictions):
>>>     print 'Predicted: %s, Target: %s' % (prediction, y_test[i])
>>> print 'R-squared: %.2f' % model.score(X_test, y_test)
Predicted: [ 10.0625], Target: [11]
Predicted: [ 10.28125], Target: [8.5]
Predicted: [ 13.09375], Target: [15]
Predicted: [ 18.14583333], Target: [18]
Predicted: [ 13.3125], Target: [11]
R-squared: 0.77

It appears that adding the number of toppings as an explanatory variable has improved the performance of our model. In later sections, we will discuss why evaluating a model on a single test set can provide inaccurate estimates of the model's performance, and how we can estimate its performance more accurately by training and testing on many partitions of the data. For now, however, we can accept that the multiple linear regression model performs significantly better than the simple linear regression model. There may be other attributes of pizzas that can be used to explain their prices. What if the relationship between these explanatory variables and the response variable is not linear in the real world? In the next section, we will examine a special case of multiple linear regression that can be used to model nonlinear relationships.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset