At the beginning of this chapter, we mentioned that linear regression can be understood as obtaining the best-fitting straight line through the sample points of our training data. However, we have neither defined the term best-fitting nor have we discussed the different techniques of fitting such a model. In the following subsections, we will fill in the missing pieces of this puzzle using the Ordinary Least Squares (OLS) method (sometimes also called linear least squares) to estimate the parameters of the linear regression line that minimizes the sum of the squared vertical distances (residuals or errors) to the sample points.
Consider our implementation of the ADAptive LInear NEuron (Adaline) from Chapter 2, Training Simple Machine Learning Algorithms for Classification; we remember that the artificial neuron uses a linear activation function. Also, we defined a cost function , which we minimized to learn the weights via optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function in Adaline is the Sum of Squared Errors (SSE), which is identical to the cost function that we use for OLS:
Here, is the predicted value (note that the term is just used for convenience to derive the update rule of GD). Essentially, OLS regression can be understood as Adaline without the unit step function so that we obtain continuous target values instead of the class labels -1
and 1
. To demonstrate this, let us take the GD implementation of Adaline from Chapter 2, Training Simple Machine Learning Algorithms for Classification and remove the unit step function to implement our first linear regression model:
class LinearRegressionGD(object): def __init__(self, eta=0.001, n_iter=20): self.eta = eta self.n_iter = n_iter def fit(self, X, y): self.w_ = np.zeros(1 + X.shape[1]) self.cost_ = [] for i in range(self.n_iter): output = self.net_input(X) errors = (y - output) self.w_[1:] += self.eta * X.T.dot(errors) self.w_[0] += self.eta * errors.sum() cost = (errors**2).sum() / 2.0 self.cost_.append(cost) return self def net_input(self, X): return np.dot(X, self.w_[1:]) + self.w_[0] def predict(self, X): return self.net_input(X)
If you need a refresher about how the weights are being updated—taking a step into the opposite direction of the gradient—please revisit the Adaptive linear neurons and the convergence of learning section in Chapter 2, Training Simple Machine Learning Algorithms for Classification.
To see our LinearRegressionGD
regressor in action, let's use the RM
(number of rooms) variable from the Housing dataset as the explanatory variable and train a model that can predict MEDV
(house prices). Furthermore, we will standardize the variables for better convergence of the GD algorithm. The code is as follows:
>>> X = df[['RM']].values >>> y = df['MEDV'].values >>> from sklearn.preprocessing import StandardScaler >>> sc_x = StandardScaler() >>> sc_y = StandardScaler() >>> X_std = sc_x.fit_transform(X) >>> y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten() >>> lr = LinearRegressionGD() >>> lr.fit(X_std, y_std)
Notice the workaround regarding y_std
, using np.newaxisx
and flatten
. Most transformers in scikit-learn expect data to be stored in two-dimensional arrays. In the previous code example, the use of np.newaxis
in y[:, np.newaxis]
added a new dimension to the array. Then, after the StandardScaler
returned the scaled variable, we converted it back to the original one-dimensional array representation using the flatten()
method for our convenience.
We discussed in Chapter 2, Training Simple Machine Learning Algorithms for Classification that it is always a good idea to plot the cost as a function of the number of epochs passes over the training dataset when we are using optimization algorithms, such as gradient descent, to check the algorithm converged to a cost minimum (here, a global cost minimum):
>>> sns.reset_orig() # resets matplotlib style >>> plt.plot(range(1, lr.n_iter+1), lr.cost_) >>> plt.ylabel('SSE') >>> plt.xlabel('Epoch') >>> plt.show()
As we can see in the following plot, the GD algorithm converged after the fifth epoch:
Next, let's visualize how well the linear regression line fits the training data. To do so, we will define a simple helper function that will plot a scatterplot of the training samples and add the regression line:
>>> def lin_regplot(X, y, model): ... plt.scatter(X, y, c='steelblue', edgecolor='white', s=70) ... plt.plot(X, model.predict(X), color='black', lw=2) ... return None
Now, we will use this lin_regplot
function to plot the number of rooms against house price:
>>> lin_regplot(X_std, y_std, lr) >>> plt.xlabel('Average number of rooms [RM] (standardized)') >>> plt.ylabel('Price in $1000s [MEDV] (standardized)') >>> plt.show()
As we can see in the following plot, the linear regression line reflects the general trend that house prices tend to increase with the number of rooms:
Although this observation makes intuitive sense, the data also tells us that the number of rooms does not explain the house prices very well in many cases. Later in this chapter, we will discuss how to quantify the performance of a regression model. Interestingly, we also observe that several data points lined up at , which suggests that the prices may have been clipped. In certain applications, it may also be important to report the predicted outcome variables on their original scale. To scale the predicted price outcome back onto the Price in $1000s
axis, we can simply apply the inverse_transform
method of the StandardScaler
:
>>> num_rooms_std = sc_x.transform([5.0]) >>> price_std = lr.predict(num_rooms_std) >>> print("Price in $1000s: %.3f" % ... sc_y.inverse_transform(price_std)) Price in $1000s: 10.840
In this code example, we used the previously trained linear regression model to predict the price of a house with five rooms. According to our model, such a house is worth $10,840.
On a side note, it is also worth mentioning that we technically don't have to update the weights of the intercept if we are working with standardized variables since the y-axis intercept is always 0 in those cases. We can quickly confirm this by printing the weights:
>>> print('Slope: %.3f' % lr.w_[1]) Slope: 0.695 >>> print('Intercept: %.3f' % lr.w_[0]) Intercept: -0.000
In the previous section, we implemented a working model for regression analysis; however, in a real-world application we may be interested in more efficient implementations. For example many of scikit-learn's estimators for regression make use of the LIBLINEAR library, advanced optimization algorithms, and other code optimizations that work better with unstandardized variables, which is sometimes desirable for certain applications:
>>> from sklearn.linear_model import LinearRegression >>> slr = LinearRegression() >>> slr.fit(X, y) >>> print('Slope: %.3f' % slr.coef_[0]) Slope: 9.102 >>> print('Intercept: %.3f' % slr.intercept_) Intercept: -34.671
As we can see from executing this code, scikit-learn's LinearRegression
model, fitted with the unstandardized RM
and MEDV
variables, yielded different model coefficients. Let's compare it to our GD implementation by plotting MEDV
against RM
:
>>> lin_regplot(X, y, slr) >>> plt.xlabel('Average number of rooms [RM]') >>> plt.ylabel('Price in $1000s [MEDV]') >>> plt.show()
Now, when we plot the training data and our fitted model by executing this code, we can see that the overall result looks identical to our GD implementation:
As an alternative to using machine learning libraries, there is also a closed-form solution for solving OLS involving a system of linear equations that can be found in most introductory statistics textbooks:
We can implement it in Python as follows:
# adding a column vector of "ones" >>> Xb = np.hstack((np.ones((X.shape[0], 1)), X)) >>> w = np.zeros(X.shape[1]) >>> z = np.linalg.inv(np.dot(Xb.T, Xb)) >>> w = np.dot(z, np.dot(Xb.T, y)) >>> print('Slope: %.3f' % w[1]) Slope: 9.102 >>> print('Intercept: %.3f' % w[0]) Intercept: -34.671
The advantage of this method is that it is guaranteed to find the optimal solution analytically. However, if we are working with very large datasets, it can be computationally too expensive to invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix may be singular (non-invertible), which is why we may prefer iterative methods in certain cases.
If you are interested in more information on how to obtain normal equations, I recommend you take a look at Dr. Stephen Pollock's chapter The Classical Linear Regression Model from his lectures at the University of Leicester, which is available for free at: http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.