We have trained and evaluated a model to predict the price of a pizza. While you are eager to demonstrate the pizza-price predictor to your friends and co-workers, you are concerned by the model's imperfect r-squared score and the embarrassment its predictions could cause you. How can we improve the model?
Recalling your personal pizza-eating experience, you might have some intuitions about the other attributes of a pizza that are related to its price. For instance, the price often depends on the number of toppings on the pizza. Fortunately, your pizza journal describes toppings in detail; let's add the number of toppings to our training data as a second explanatory variable. We cannot proceed with simple linear regression, but we can use a generalization of simple linear regression that can use multiple explanatory variables called multiple linear regression. Formally, multiple linear regression is the following model:
this edit makes no sense. change to "Where simple linear regression uses a single explanatory variable with a single coefficient, multiple linear regression uses a coefficient for each of an arbitrary number of explanatory variables.
For simple linear regression, this is equivalent to the following:
is a column vector of the values of the response variables for the training examples. is a column vector of the values of the model's parameters. , called the design matrix, is an dimensional matrix of the values of the explanatory variables for the training examples. is the number of training examples and is the number of explanatory variables. Let's update our pizza training data to include the number of toppings with the following values:
Training Example |
Diameter (in inches) |
Number of toppings |
Price (in dollars) |
---|---|---|---|
1 |
6 |
2 |
7 |
2 |
8 |
1 |
9 |
3 |
10 |
0 |
13 |
4 |
14 |
2 |
17.5 |
5 |
18 |
0 |
18 |
We must also update our test data to include the second explanatory variable, as follows:
Test Instance |
Diameter (in inches) |
Number of toppings |
Price (in dollars) |
---|---|---|---|
1 |
8 |
2 |
11 |
2 |
9 |
0 |
8.5 |
3 |
11 |
2 |
15 |
4 |
16 |
2 |
18 |
5 |
12 |
0 |
11 |
Our learning algorithm must estimate the values of three parameters: the coefficients for the two features and the intercept term. While one might be tempted to solve by dividing each side of the equation by , division by a matrix is impossible. Just as dividing a number by an integer is equivalent to multiplying by the inverse of the same integer, we can multiply by the inverse of to avoid matrix division. Matrix inversion is denoted with a superscript -1. Only square matrices can be inverted. is not likely to be a square; the number of training instances will have to be equal to the number of features for it to be so. We will multiply by its transpose to yield a square matrix that can be inverted. Denoted with a superscript , the transpose of a matrix is formed by turning the rows of the matrix into columns and vice versa, as follows:
To recap, our model is given by the following formula:
We know the values of and from our training data. We must find the values of , which minimize the cost function. We can solve as follows:
We can solve using NumPy, as follows:
>>> from numpy.linalg import inv >>> from numpy import dot, transpose >>> X = [[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]] >>> y = [[7], [9], [13], [17.5], [18]] >>> print dot(inv(dot(transpose(X), X)), dot(transpose(X), y)) [[ 1.1875 ] [ 1.01041667] [ 0.39583333]]
NumPy also provides a least squares function that can solve the values of the parameters more compactly:
>>> from numpy.linalg import lstsq >>> X = [[1, 6, 2], [1, 8, 1], [1, 10, 0], [1, 14, 2], [1, 18, 0]] >>> y = [[7], [9], [13], [17.5], [18]] >>> print lstsq(X, y)[0] [[ 1.1875 ] [ 1.01041667] [ 0.39583333]]
Let's update our pizza-price predictor program to use the second explanatory variable, and compare its performance on the test set to that of the simple linear regression model:
>>> from sklearn.linear_model import LinearRegression >>> X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]] >>> y = [[7], [9], [13], [17.5], [18]] >>> model = LinearRegression() >>> model.fit(X, y) >>> X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]] >>> y_test = [[11], [8.5], [15], [18], [11]] >>> predictions = model.predict(X_test) >>> for i, prediction in enumerate(predictions): >>> print 'Predicted: %s, Target: %s' % (prediction, y_test[i]) >>> print 'R-squared: %.2f' % model.score(X_test, y_test) Predicted: [ 10.0625], Target: [11] Predicted: [ 10.28125], Target: [8.5] Predicted: [ 13.09375], Target: [15] Predicted: [ 18.14583333], Target: [18] Predicted: [ 13.3125], Target: [11] R-squared: 0.77
It appears that adding the number of toppings as an explanatory variable has improved the performance of our model. In later sections, we will discuss why evaluating a model on a single test set can provide inaccurate estimates of the model's performance, and how we can estimate its performance more accurately by training and testing on many partitions of the data. For now, however, we can accept that the multiple linear regression model performs significantly better than the simple linear regression model. There may be other attributes of pizzas that can be used to explain their prices. What if the relationship between these explanatory variables and the response variable is not linear in the real world? In the next section, we will examine a special case of multiple linear regression that can be used to model nonlinear relationships.