Simple linear regression

Before looking at some real-world datasets, it is very helpful to try to train a model on artificially generated data. In an artificial scenario such as this, we know what the true output function is beforehand, something that as a rule is not the case when it comes to real-world data. The advantage of performing this exercise is that it gives us a good idea of how our model works under the ideal scenario when all of our assumptions are fully satisfied, and it helps visualize what happens when we have a good linear fit. We'll begin by simulating a simple linear regression model. The following R snippet is used to create a data frame with 100 simulated observations of the following linear model with a single input feature:

Simple linear regression

Here is the code for the simple linear regression model:

> set.seed(5427395)
> nObs = 100
> x1minrange = 5
> x1maxrange = 25
> x1 = runif(nObs, x1minrange, x1maxrange)
> e = rnorm(nObs, mean = 0, sd = 2.0)
> y = 1.67 * x1 - 2.93 + e
> df = data.frame(y, x1)

For our input feature, we randomly sample points from a uniform distribution. We used a uniform distribution to get a good spread of data points. Note that our final df data frame is meant to simulate a data frame that we would obtain in practice; as a result, we do not include the error terms, as these would be unavailable to us in a real-world setting.

When we train a linear model using some data such as the data in our data frame, we are essentially hoping to produce a linear model with the same coefficients as the ones from the underlying model of the data. Put differently, the original coefficients define a population regression line. In this case, the population regression line represents the true underlying model of the data. In general, we will find ourselves attempting to model a function that is not necessarily linear. In this case, we can still define the population regression line as the best possible linear regression line, but a linear regression model will obviously not perform equally well.

Estimating the regression coefficients

For our simple linear regression model, the process of training the model amounts to an estimation of our two regression coefficients from our dataset. As we can see from our previously constructed data frame, our data is effectively a series of observations, each of which is a pair of values (xi, yi) where the first element of the pair is the input feature value and the second element of the pair is its output label. It turns out that for the case of simple linear regression, it is possible to write down two equations that can be used to compute our two regression coefficients. Instead of merely presenting these equations, we'll first take a brief moment to review some very basic statistical quantities that the reader has most likely encountered previously, as they will be featured very shortly.

The mean of a set of values is just the average of these values and is often described as a measure of location, giving a sense of where the values are centered on the scale in which they are measured. In statistical literature, the average value of a random variable is often known as the expectation, so we often find that the mean of a random variable X is denoted as E(X). Another notation that is commonly used is bar notation, where we can represent the notion of taking the average of a variable by placing a bar over that variable. To illustrate this, the following two equations show the mean of the output variable y and input feature x:

Estimating the regression coefficients

A second very common quantity, which should also be familiar, is the variance of a variable. The variance measures the average square distance that individual values have from the mean. In this way, it is a measure of dispersion, so that a low variance implies that most of the values are bunched up close to the mean, whereas a higher variance results in values that are spread out. Note that the definition of variance involves the definition of the mean, and for this reason we'll see the use of the x variable with a bar on it in the following equation, which shows the variance of our input feature x:

Estimating the regression coefficients

Finally, we'll define the covariance between two random variables, x and y, using the following equation:

Estimating the regression coefficients

From the previous equation, it should be clear that the variance, which we just defined previously, is actually a special case of the covariance where the two variables are the same. The covariance measures how strongly two variables are correlated with each other and can be positive or negative. A positive covariance implies a positive correlation; that is, when one variable increases, the other will increase as well. A negative covariance suggests the opposite; when one variable increases, the other will tend to decrease. When two variables are statistically independent of each other and hence uncorrelated, their covariance will be zero (although it should be noted that a zero covariance does not necessarily imply statistical independence).

Armed with these basic concepts, we can now present equations for the estimates of the two regression coefficients for the case of simple linear regression:

Estimating the regression coefficients

The first regression coefficient can be computed as the ratio of the covariance between the output and the input feature, and the variance of the input feature. Note that if the output feature were to be independent of the input feature, the covariance would be zero and therefore, our linear model would consist of a horizontal line with no slope. In practice, it should be noted that even when two variables are statistically independent, we will still typically see a small degree of covariance due to the random nature of the errors; thus, if we were to train a linear regression model to describe their relationship, our first regression coefficient would be nonzero in general. Later, we'll see how significance tests can be used to detect features we should not include in our models.

To implement linear regression in R, it is not necessary to perform these calculations as R provides us with the lm() function, which builds a linear regression model for us. The following code sample uses the df data frame we created previously and calculates the regression coefficients:

> myfit <- lm(y~x1, df)
> myfit

lm(formula = y ~ x1, data = df)

(Intercept)           x1  
     -2.380        1.641  

In the first line, we see that the usage of the lm() function involves first specifying a formula and then following up with the data parameter, which in our case is our data frame. In the case of simple linear regression, the syntax of the formula that we specify for the lm() function is the name of the output variable, followed by a tilde (~) and then by the name of the single input feature. We'll see how to specify more complex formulas when we look at multiple linear regression further along in this chapter. Finally, the output shows us the values for the two regression coefficients. Note that the β0 coefficient is labeled as the intercept, and the β1 coefficient is labeled by the name of the corresponding feature (in this case, x1) in the equation of the linear model:

The following graph shows the population line and the estimated line on the same plot:

Estimating the regression coefficients

As we can see, the two lines are so close to each other that they are barely distinguishable, showing that the model has estimated the true population line very closely. From Chapter 1, Gearing Up for Predictive Modeling, we know that we can formalize how closely our model matches our dataset, as well as how closely it would match an analogous test set using the mean square error. We'll examine this as well as several other metrics of model performance and quality in this chapter, but first we'll generalize our regression model to deal with more than one input feature.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.