Implementing an ordinary least squares linear regression model

At the beginning of this chapter, we mentioned that linear regression can be understood as obtaining the best-fitting straight line through the sample points of our training data. However, we have neither defined the term best-fitting nor have we discussed the different techniques of fitting such a model. In the following subsections, we will fill in the missing pieces of this puzzle using the Ordinary Least Squares (OLS) method (sometimes also called linear least squares) to estimate the parameters of the linear regression line that minimizes the sum of the squared vertical distances (residuals or errors) to the sample points.

Solving regression for regression parameters with gradient descent

Consider our implementation of the ADAptive LInear NEuron (Adaline) from Chapter 2, Training Simple Machine Learning Algorithms for Classification; we remember that the artificial neuron uses a linear activation function. Also, we defined a cost function Solving regression for regression parameters with gradient descent, which we minimized to learn the weights via optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD). This cost function in Adaline is the Sum of Squared Errors (SSE), which is identical to the cost function that we use for OLS:

Solving regression for regression parameters with gradient descent

Here, Solving regression for regression parameters with gradient descent is the predicted value Solving regression for regression parameters with gradient descent (note that the term Solving regression for regression parameters with gradient descent is just used for convenience to derive the update rule of GD). Essentially, OLS regression can be understood as Adaline without the unit step function so that we obtain continuous target values instead of the class labels -1 and 1. To demonstrate this, let us take the GD implementation of Adaline from Chapter 2, Training Simple Machine Learning Algorithms for Classification and remove the unit step function to implement our first linear regression model:

class LinearRegressionGD(object):

    def __init__(self, eta=0.001, n_iter=20):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        self.w_ = np.zeros(1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            output = self.net_input(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors**2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def predict(self, X):
        return self.net_input(X)

Note

If you need a refresher about how the weights are being updated—taking a step into the opposite direction of the gradient—please revisit the Adaptive linear neurons and the convergence of learning section in Chapter 2, Training Simple Machine Learning Algorithms for Classification.

To see our LinearRegressionGD regressor in action, let's use the RM (number of rooms) variable from the Housing dataset as the explanatory variable and train a model that can predict MEDV (house prices). Furthermore, we will standardize the variables for better convergence of the GD algorithm. The code is as follows:

>>> X = df[['RM']].values
>>> y = df['MEDV'].values
>>> from sklearn.preprocessing import StandardScaler
>>> sc_x = StandardScaler()
>>> sc_y = StandardScaler()
>>> X_std = sc_x.fit_transform(X)
>>> y_std = sc_y.fit_transform(y[:, np.newaxis]).flatten()
>>> lr = LinearRegressionGD()
>>> lr.fit(X_std, y_std)

Notice the workaround regarding y_std, using np.newaxisx and flatten. Most transformers in scikit-learn expect data to be stored in two-dimensional arrays. In the previous code example, the use of np.newaxis in y[:, np.newaxis] added a new dimension to the array. Then, after the StandardScaler returned the scaled variable, we converted it back to the original one-dimensional array representation using the flatten() method for our convenience.

We discussed in Chapter 2, Training Simple Machine Learning Algorithms for Classification that it is always a good idea to plot the cost as a function of the number of epochs passes over the training dataset when we are using optimization algorithms, such as gradient descent, to check the algorithm converged to a cost minimum (here, a global cost minimum):

>>> sns.reset_orig() # resets matplotlib style
>>> plt.plot(range(1, lr.n_iter+1), lr.cost_)
>>> plt.ylabel('SSE')
>>> plt.xlabel('Epoch')
>>> plt.show()

As we can see in the following plot, the GD algorithm converged after the fifth epoch:

Solving regression for regression parameters with gradient descent

Next, let's visualize how well the linear regression line fits the training data. To do so, we will define a simple helper function that will plot a scatterplot of the training samples and add the regression line:

>>> def lin_regplot(X, y, model):
...     plt.scatter(X, y, c='steelblue', edgecolor='white', s=70)
...     plt.plot(X, model.predict(X), color='black', lw=2)
...     return None

Now, we will use this lin_regplot function to plot the number of rooms against house price:

>>> lin_regplot(X_std, y_std, lr)
>>> plt.xlabel('Average number of rooms [RM] (standardized)')
>>> plt.ylabel('Price in $1000s [MEDV] (standardized)')
>>> plt.show()

As we can see in the following plot, the linear regression line reflects the general trend that house prices tend to increase with the number of rooms:

Solving regression for regression parameters with gradient descent

Although this observation makes intuitive sense, the data also tells us that the number of rooms does not explain the house prices very well in many cases. Later in this chapter, we will discuss how to quantify the performance of a regression model. Interestingly, we also observe that several data points lined up at Solving regression for regression parameters with gradient descent, which suggests that the prices may have been clipped. In certain applications, it may also be important to report the predicted outcome variables on their original scale. To scale the predicted price outcome back onto the Price in $1000s axis, we can simply apply the inverse_transform method of the StandardScaler:

>>> num_rooms_std = sc_x.transform([5.0]) 
>>> price_std = lr.predict(num_rooms_std)
>>> print("Price in $1000s: %.3f" % 
...       sc_y.inverse_transform(price_std))
Price in $1000s: 10.840

In this code example, we used the previously trained linear regression model to predict the price of a house with five rooms. According to our model, such a house is worth $10,840.

On a side note, it is also worth mentioning that we technically don't have to update the weights of the intercept if we are working with standardized variables since the y-axis intercept is always 0 in those cases. We can quickly confirm this by printing the weights:

>>> print('Slope: %.3f' % lr.w_[1])
Slope: 0.695
>>> print('Intercept: %.3f' % lr.w_[0])
Intercept: -0.000

Estimating coefficient of a regression model via scikit-learn

In the previous section, we implemented a working model for regression analysis; however, in a real-world application we may be interested in more efficient implementations. For example many of scikit-learn's estimators for regression make use of the LIBLINEAR library, advanced optimization algorithms, and other code optimizations that work better with unstandardized variables, which is sometimes desirable for certain applications:

>>> from sklearn.linear_model import LinearRegression
>>> slr = LinearRegression()
>>> slr.fit(X, y)
>>> print('Slope: %.3f' % slr.coef_[0])
Slope: 9.102
>>> print('Intercept: %.3f' % slr.intercept_)
Intercept: -34.671

As we can see from executing this code, scikit-learn's LinearRegression model, fitted with the unstandardized RM and MEDV variables, yielded different model coefficients. Let's compare it to our GD implementation by plotting MEDV against RM:

>>> lin_regplot(X, y, slr)
>>> plt.xlabel('Average number of rooms [RM]')
>>> plt.ylabel('Price in $1000s [MEDV]')
>>> plt.show()

Now, when we plot the training data and our fitted model by executing this code, we can see that the overall result looks identical to our GD implementation:

Estimating coefficient of a regression model via scikit-learn

Note

As an alternative to using machine learning libraries, there is also a closed-form solution for solving OLS involving a system of linear equations that can be found in most introductory statistics textbooks:

Estimating coefficient of a regression model via scikit-learn

We can implement it in Python as follows:

# adding a column vector of "ones"
>>> Xb = np.hstack((np.ones((X.shape[0], 1)), X))
>>> w = np.zeros(X.shape[1])
>>> z = np.linalg.inv(np.dot(Xb.T, Xb))
>>> w = np.dot(z, np.dot(Xb.T, y))
>>> print('Slope: %.3f' % w[1])
Slope: 9.102
>>> print('Intercept: %.3f' % w[0])
Intercept: -34.671

The advantage of this method is that it is guaranteed to find the optimal solution analytically. However, if we are working with very large datasets, it can be computationally too expensive to invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix may be singular (non-invertible), which is why we may prefer iterative methods in certain cases.

If you are interested in more information on how to obtain normal equations, I recommend you take a look at Dr. Stephen Pollock's chapter The Classical Linear Regression Model from his lectures at the University of Leicester, which is available for free at: http://www.le.ac.uk/users/dsgp1/COURSES/MESOMET/ECMETXT/06mesmet.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset