Let's now go ahead and try to make a simple linear regression model and see what are the issues that we face and how can they be resolved to make the model more robust. We will use the advertising data that we used earlier for illustrating the correlation.
The following two methods implement linear regression in Python:
ols
method and the statsmodel.formula.api
libraryscikit-learn
packageLet's implement a simple linear regression using the first method and then build upon a multiple-linear regression model. We will then also look at how the second method is used to do the same.
Let's first import the Advertising
data, as shown:
import pandas as pd advert=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Linear Regression/Advertising.csv') advert.head()
To reiterate, this dataset contains data about the advertising budget spent on TV, Radio, and Newspapers, for a particular product and the resulting sales. We will expect a positive correlation between such advertising costs and sales. We have already seen that there is a good correlation between TV advertising costs and sales. Let's see whether it is present or not. If yes, how does the relationship look like and to do that we write the following code:
import statsmodels.formula.api as smf model1=smf.ols(formula='Sales~TV',data=advert).fit() model1.params
In this code snippet, we have assumed a linear relationship between advertising costs on TV and sales. We have also created a best fit using the least sum of square method. This snippet will output the values for model parameters that is a and β. The following is the output:
In the notation that we have been using, a is the intercept and β is the slope. Thus:
The equation for the model will be:
The equation implies that an increase of 100 units in advertising costs will increase the sale by four units.
If you remember, we learnt that the values of these parameters are estimates and there will be a p-value associated to these. If the p-values are very small, then it can be accepted that these parameters have a non-zero value and are statistically significant in the model. Let's have a look at the p-values for these parameters:
model1.pvalues
As it can be seen, the p-values are very small; hence, the parameters are significant.
Let's also check another important indicator of the model efficacy and that is R2. As we saw earlier, there is a ready-made method for doing this. This can be done by typing the following code line:
model1.rsquared
The value comes out to be 0.61.
If we want the entire important model parameters at one go, we can take a look at the model summary by writing this snippet:
model1.summary()
The result is as follows:
As we can see, the F-statistic for this model is very high and the associated p-value is negligible, suggesting that the parameter estimates for this model were all significant and non-zero.
Let's now predict the values of sales based on the equation we just derived. This can be done using the following snippet:
sales_pred=model1.predict(pd.DataFrame(advert['TV'])) sales_pred
This equation basically calculates the predicted sales value for each row based on the model equation using TV costs. One can plot sales_pred
against the TV advertising costs to find the line of best fit. Let's do that:
import matplotlib.pyplot as plt %matplotlib inline advert.plot(kind='scatter', x='TV', y='Sales') plt.plot(pd.DataFrame(advert['TV']),sales_pred,c='red',linewidth=2)
We get the following plot as the output. The red line is the line of best fit (obtained from the model). The blue dots represent the actual data present:
Now, let's calculate the RSE term for our prediction using the following code snippet:
advert['sales_pred']=0.047537*advert['TV']+7.03 advert['RSE']=(advert['Sales']-advert['sales_pred'])**2 RSEd=advert.sum()['RSE'] RSE=np.sqrt(RSEd/198) salesmean=np.mean(advert['Sales']) error=RSE/salesmean RSE,salesmean,error
The output consists of three numbers, first of which is RSE=3.25, second is salesmean
(mean of actual sales) = 14.02 and error is their ratio, which is equal to 0.23. Thus, on an average this model will have 23%, even if the coefficients are correctly predicted. This is a significant amount of errors and we would like to bring it down by some means. Also, the R2 value of 0.61 can be improved upon. One thing we can try is to add more columns in the model, as predictors and see whether it improves the result or not.
When linear regression involves more than one predictor variable, then it is called multiple linear regression. The nature of the model remains the same (linear), except that there might be separate slope (β) coefficients associated with each of the predictor variables. The model will be represented, as follows:
Each βi will be estimated using the same least sum of squares method; hence, would have a p-value associated with the estimation. The smaller the p-value associated with a variable, the more the significance of that variable to the model. The variables with large p-values should be eliminated from the model as they aren't good predictors of the output variable.
While the multiple regression gives us with the possibility of using more variables as predictors; hence, it increases the efficiency of the model. It also increases the complexity of the process of model building, as the selection of the variables to be kept and discarded in the model becomes tedious.
With this simple dataset of three predictor variables, there can be seven possible models. They are as follows:
For a model with p possible predictor variables, there can be 2p-1 possible models; hence, as the number of predictors increases, the selection becomes tedious.
It would have been a tedious task to choose from so many possible models. Thankfully, there are a few guidelines to filter some of these and then navigate towards the most efficient one. The following are the guidelines:
Based on these guidelines, there are two kinds of approaches to select the predictor variables to go in the final model:
Many of the statistical programs, including the Python, give us an option to select from the two preceding approaches while implementing a linear regression. The statistical program then implements the linear regression using the selected approach.
For now, let us manually add a few variables and see how it changes the model parameters and efficacy, so that we can get a better glimpse of what goes on behind the curtain when these approaches are implemented by the statistical program.
We have already seen one model assuming a linear relationship between sales and TV advertising costs. We can ignore the other models consisting of single variables (that is newspaper and radio, as they have a small correlation compared to TV). Let us now try to add more variables to the model we already have and see how the parameters and efficacy change.
Let us try adding the newspaper variable to the model using the following code snippet:
import statsmodels.formula.api as smf model2=smf.ols(formula='Sales~TV+Newspaper',data=advert).fit() model2.params
The following are the results:
The p-values for the coefficients are very small, suggesting that all the estimates are significant. The equation for this model will be:
The values of R2 and adjusted R2 are 0.646 and 0.642, which is just a minor improvement from the value obtained in the earlier model.
The values can be predicted using the following snippet:
sales_pred=model2.predict(advert[['TV','Newspaper']]) sales_pred
To calculate the RSE, we modify the snippet a little:
import numpy as np advert['sales_pred']=5.77 + 0.046*advert['TV'] + 0.04*advert['Newspaper'] advert['RSE']=(advert['Sales']-advert['sales_pred'])**2 RSEd=advert.sum()['RSE'] RSE=np.sqrt(RSEd/197) salesmean=np.mean(advert['Sales']) error=RSE/salesmean RSE,salesmean,error
The value of RSE comes out to be 3.12 (22%), not very different from the model with only TV. The number 197 comes from the (n-p-1) term in the formula for RSE, where n=200, p=2 for the current model. The following table is the model summary:
Although as the F-statistic decreases, the associated p-value also decreases. But, it is just a marginal improvement to the model, as we can see in the adj. R2 value. So, adding newspaper didn't improve the model significantly.
Let's try adding radio to the model instead of the newspaper. Radio had the second best correlation with the Sales
variable in the correlation matrix we created earlier in the chapter. Thus, one expects some significant improvement in the model upon its addition to the model. Let's see if that happens or not:
import statsmodels.formula.api as smf model3=smf.ols(formula='Sales~TV+Radio',data=advert).fit() model3.params
The output parameters and the associated p-values of this model are, as follows:
The model can be represented as the following:
The values can be predicted based on the preceding model using the following snippet:
sales_pred=model3.predict(advert[['TV','Radio']]) sales_pred
The model summary looks something similar to the following screenshot:
One thing to observe here is that the R2 value has improved considerably due to the addition of radio to the model. Also, the F-statistic has increased significantly from the last model indicating a very efficient model.
The RSE can be calculated using the same method described previously. The value for this model comes out to be 1.71 (around 12%),which is much better than the 23% and 22% in the previous model.
Thus, we can conclude that radio is a great addition to the model and TV and radio advertising costs have been able to describe the sales very well and this model itself is a very efficient model. But, can we improve it a bit further by combining all three predictor variables?
The last thing that we should try is, all the predictor variables together by using the following code:
import statsmodels.formula.api as smf model4=smf.ols(formula='Sales~TV+Radio+Newspaper',data=advert).fit() model4.params
The estimates of the coefficients and the associated p-values for this model will be as follows:
The p-values for the coefficients are very small, suggesting that all the estimates are significant. The equation for this model will be:
The values of sales can be predicted using the following snippet:
sales_pred=model4.predict(advert[['TV','Radio','Newspaper']]) sales_pred
The summary of the model is shown in the following table:
The most striking feature of this model is that the estimate of the coefficients is very similar to that in the previous model. The intercept, coefficient for TV, and the coefficient for Radio are more or less the same. The values of R2 and adj-R2 are also similar to the previous model.
The value of RSE can be calculated in a similar way, as before. The value comes out to 2.57 (18%), which is more than the previous model.
Other things to note about this model are the following:
All these point in the direction that the model actually became a little less efficient on addition of newspaper to the previous model. What is the reason?
Multi-collinearity is the reason for the sub-optimal performance of the model when newspaper was added to the final model. Multi-collinearity alludes to the correlation between the predictor variables of the model.
These are some of the signs of a common problem encountered during a regression called multi-collinearity. Go back a few pages to the correlation matrix that we created for this dataset and you will find that there is a significant correlation of 0.35 between radio and newspaper. This means that the expense on Newspaper is related to that on the Radio. This relationship between the predictor variable increases the variability of the co-efficient estimates of the related predictor variables.
The t-statistic for these coefficients is calculated by dividing the mean value by the variability (or error). As this error goes up, the value of t-statistic goes down and the value of p-value increases. Thus, the chances that the null hypothesis for the hypothesis test associated with the F-statistic will be accepted are increased. This decreases the significance of the variable in the model.
Thus collinearity is an issue that needs to be taken care of. For highly correlated predicted variables, we need to do a deep-dive with these variables and see whose inclusion in the model makes the model more efficient.
It is a good practice to identify the pair of predictor variables with high correlation, using the correlation matrix and check the pairs of multi-collinearity effect on the model. The culprit variables should be removed from the model. The VIF is a method to tackle this issue.
A fool-proof way to detect this menace called multi-collinearity is a statistic called Variance Inflation Factor (VIF). It is a method to quantify the rise in the variability of the coefficient estimate of a particular variable because of high correlation between two or more than two predictor variables. The VIF needs to be calculated for each of the variables and if the value is very high for a particular variable, then that predictor needs to be eliminated from the model. Some statistical processes calculate VIF when fed with the option to do so. The following process goes under the hood for calculation of VIF:
Let us write a short snippet to calculate the VIF to understand the calculation better:
model=smf.ols(formula='Newspaper~TV+Radio',data=advert).fit() rsquared=model.rsquared VIF=1/(1-rsquared) VIF
This will give a VIF for the Newspaper. By changing the formula in the snippet, we can calculate the VIF for the other variables. The following are the values:
Newspaper |
Radio |
TV |
---|---|---|
1.14 |
1.14 |
1.04 |
The newspaper and TV have almost the same VIF, indicating that they are correlated to just one another and not the TV.
In this case, radio and newspaper are correlated. However, the model with TV and radio, as predictor variables, are far superior to the model with TV and newspaper as the predictor variables. The model with the all the three variables as predictors doesn't improve the model much. In fact, it increases the variability and the F-statistic. It will be a decent choice to drop the newspaper variable from the model and pick model 3 as the best candidate for the final model: