Chapter 4. Prediction with R and Tableau Using Regression

In this chapter, we will consider regression from an analytics point of view. We will look at the predictive capabilities and performance of regression algorithms, which is a great start for the analytics program. At the end of this chapter, you'll have experience in simple linear regression, multi-linear regression, and k-Nearest Neighbors regression using a business-oriented understanding of the actual use cases of the regression techniques.

We will focus on preparing, exploring, and modeling the data in R, combined with the visualization power of Tableau in order to express the findings in the data.

Some interesting datasets come from the UCI machine learning datasets, which can be obtained from the following link: https://archive.ics.uci.edu/ml/datasets.html.

During the course of this chapter, we will use datasets that are obtained from the UCI website, in addition to default R datasets.

Getting started with regression

Regression means the unbiased prediction of the conditional expected value, using independent variables, and the dependent variable. A dependent variable is the variable that we want to predict. Examples of a dependent variable could be a number such as price, sales, or weight. An independent variable is a characteristic, or feature, that helps to determine the dependent variable. So, for example, the independent variable of weight could help to determine the dependent variable of weight.

Regression analysis can be used in forecasting, time series modeling, and cause and effect relationships.

Simple linear regression

R can help us to build prediction stories with Tableau. Linear regression is a great starting place when you want to predict a number, such as profit, cost, or sales. In simple linear regression, there is only one independent variable x, which predicts a dependent value, y.

Simple linear regression is usually expressed with a line that identifies the slope that helps us to make predictions. So, if sales = x and profit = y, what is the slope that allows us to make the prediction? We will do this in R to create the calculation, and then we will repeat it in R. We can also color-code it so that we can see what is above and what is below the slope.

Using lm() to conduct a simple linear regression

What is linear regression? Linear regression has the objective if finding a model that fits a regression line through the data well, whilst reducing the discrepancy, or error, between the data and the regression line. If the regression model is significant, it will be able to account for the error, and the regression line will fit the data better because it will minimize the error. The error is also known as the residuals, and it is measured as the sum of squared errors of error, which is sometimes abbreviated to SSE. It is calculated as the model's deviations predicted from actual empirical values of data. In practice, a small error amount, or SSE, indicates that the data is a close match to the model.

Note

In order to do regression, we need to measure the y distance of each of the points from a line of best fit and then sum the error margin (that is, the distance to the line).

We are trying to predict the line of best fit between one or many variables from a scatter plot of points of data. To find the line of best fit, we need to calculate a couple of things about the line. We can use the lm() function to obtain the line, which we can call m:

  • We need to calculate the slope of the line m
  • We also need to calculate the intercept with the y axis c

So we begin with the equation of the line:

y = mx + c

To get the line, we use the concept of Ordinary Least Squares (OLS). This means that we sum the square of the y-distances between the points and the line. Furthermore, we can rearrange the formula to give us beta (or m) in terms of the number of points n, x, and y. This would assume that we can minimize the mean error with the line and the points. It will be the best predictor for all of the points in the training set and future feature vectors.

Let's start with a simple example in R, where we predict women's weight from their height. If we were articulating this question per Microsoft's Team Data Science Process, we would be stating this as a business question during the business understanding phase. How can we come up with a model that helps us to predict what the women's weight is going to be, dependent on their height?

Using this business question as a basis for further investigation, how do we come up with a model from the data, which we could then use for further analysis? Simple linear regression is about two variables, an independent and a dependent variable, which is also known as the predictor variable. With only one variable, and no other information, the best prediction is the mean of the sample itself. In other words, when all we have is one variable, the mean is the best predictor of any one amount. The first step is to collect a random sample of data. In R, we are lucky to have sample data that we can use.

To explore linear regression, we will use the women dataset, which is installed by default with R. The variability of the weight amount can only be explained by the weights themselves, because that is all we have.

To conduct the regression, we will use the lm function, which appears as follows:

model <- lm(y ~ x, data=mydata)

To see the women dataset, open up RStudio. When we type in the variable name, we will get the contents of the variable. In this example, the variable name women will give us the data itself.

The women's height and weight are printed out to the console, and here is an example:

> women

When we type in this variable name, we get the actual data itself, which we can see next:

Using lm() to conduct a simple linear regression

We can visualize the data quite simply in R, using the plot(women) command. The plot command provides a quick and easy way of visualizing the data. Our objective here is simply to see the relationship of the data.

The results appear as follows:

Using lm() to conduct a simple linear regression

Now that we can see the relationship of the data, we can use the summary command to explore the data further:

summary(women)

This will give us the results, which are given here as follows:

Using lm() to conduct a simple linear regression

Let's look at the results in closer detail:

Using lm() to conduct a simple linear regression

Next, we can create a model that will use the lm function to create a linear regression model of the data. We will assign the results to a model called linearregressionmodel, as follows:

linearregressionmodel <- lm(weight ~ height, data=women)

What does the model produce? We can use the summary command again, and this will provide some descriptive statistics about the lm model that we have generated. One of the nice, understated features of R is its ability to use variables. Here we have our variable, linearregressionmodel – note that one word is storing a whole model!

summary(linearregressionmodel)

How does this appear in the R interface? Here is an example:

Using lm() to conduct a simple linear regression

What do these numbers mean? Let's take a closer look at some of the key numbers.

Coefficients

What are coefficients? It means that one change in x causes an expected change in y. Here is how it looks in R:

Coefficients

We can see that the values of coefficients are given as -87.51667 and 3.45000. It means that one unit change in x, the weight, causes a -87.51667 unit change in the expected value of y, the height.

If we were to write this as an equation, the general model could be written as follows:

y = a + b x

This means that our prediction equation for the linearregressionmodel model is as follows:

Linearregressionmodel = -87.52 + (3.45 * height)

We can get this information another way in R. We can see the coefficients by simply using the variable name linearregressionmodel, which outputs the result as follows:

Coefficients

Residual standard error

In the output, residual standard error is cost, which is 1.525.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset