Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Multiple linear regression

Whenever we have more than one input feature and want to build a linear regression model, we are in the realm of multiple linear regression. The general equation for a multiple linear regression model with k input features is:

Our assumptions about the model and about the error component ε remain the same as with simple linear regression, remembering that as we now have more than one input feature, we assume that these are independent of each other. Instead of using simulated data to demonstrate multiple linear regression, we will analyze two real-world data sets.

Predicting CPU performance

Our first real-world data set was presented by the researchers Dennis F. Kibler, David W. Aha, and Marc K. Albert in a 1989 paper titled Instance-based prediction of real-valued attributes and published in Journal of Computational Intelligence. The data contain the characteristics of different CPU models, such as the cycle time and the amount of cache memory. When deciding between processors, we would like to take all of these things into account, but ideally, we'd like to compare processors on a single numerical scale. For this reason, we often develop programs to benchmark the relative performance of a CPU. Our data set also comes with the published relative performance of our CPUs and our objective will be to use the available CPU characteristics as features to predict this. The data set can be obtained online from the UCI Machine Learning Repository via this link: http://archive.ics.uci.edu/ml/datasets/Computer+Hardware.

Tip

The UCI Machine Learning Repository is a wonderful online resource that hosts a large number of data sets, many of which are often cited by authors of books and tutorials. It is well worth the effort to familiarize yourself with this website and its data sets. A very good way to learn predictive analytics is to practice using the techniques you learn in this book on different data sets, and the UCI repository provides many of these for exactly this purpose.

The machine.data file contains all our data in a comma-separated format, with one line per CPU model. We'll import this in R and label all the columns. Note that there are 10 columns in total, but we don't need the first two for our analysis, as these are just the brand and model name of the CPU. Similarly, the final column is a predicted estimate of the relative performance that was produced by the researchers themselves; our actual output variable, PRP, is in column 9. We'll store the data that we need in a data frame called machine:

> machine <- read.csv("machine.data", header = F)
> names(machine) <- c("VENDOR", "MODEL", "MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", "PRP", "ERP")
> machine <- machine[, 3:9]
> head(machine, n = 3)
  MYCT MMIN  MMAX CACH CHMIN CHMAX PRP
1  125  256  6000  256    16   128 198
2   29 8000 32000   32     8    32 269
3   29 8000 32000   32     8    32 220

The data set also comes with the definition of the data columns:

Column name	Definition
`MYCT`	The machine cycle time in nanoseconds
`MMIN`	The minimum main memory in kilobytes
`MMAX`	The maximum main memory in kilobytes
`CACH`	The cache memory in kilobytes
`CHMIN`	The minimum channels in units
`CHMAX`	The maximum channels in units
`PRP`	The published relative performance (our output variable)

The data set contains no missing values, so no observations need to be removed or modified. One thing that we'll notice is that we only have roughly 200 data points, which is generally considered a very small sample. Nonetheless, we will proceed with splitting our data into a training set and a test set, with an 85-15 split, as follows:

> library(caret)
> set.seed(4352345)
> machine_sampling_vector <- createDataPartition(machine$PRP, p = 0.85, list = FALSE)
> machine_train <- machine[machine_sampling_vector,]
> machine_train_features <- machine[, 1:6]
> machine_train_labels <- machine$PRP[machine_sampling_vector]
> machine_test <- machine[-machine_sampling_vector,]
> machine_test_labels <- machine$PRP[-machine_sampling_vector]

Now that we have our data set up and running, we'd usually want to investigate further and check whether some of our assumptions for linear regression are valid. For example, we would like to know whether we have any highly correlated features. To do this, we can construct a correlation matrix with the cor()function and use the findCorrelation() function from the caret package to get suggestions for which features to remove:

> machine_correlations <- cor(machine_train_features)
> findCorrelation(machine_correlations)
integer(0)
> findCorrelation(machine_correlations, cutoff = 0.75)
[1] 3
> cor(machine_train$MMIN, machine_train$MMAX)
[1] 0.7679307

Using the default cutoff of 0.9 for a high degree of correlation, we found that none of our features should be removed. When we reduce this cutoff to 0.75, we see that caret recommends that we remove the third feature (MMAX). As the final line of preceding code shows, the degree of correlation between this feature and MMIN is 0.768. While the value is not very high, it is still high enough to cause us a certain degree of concern that this will affect our model. Intuitively, of course, if we look at the definitions of our input features, we will certainly tend to expect that a model with a relatively high value for the minimum main memory will also be likely to have a relatively high value for the maximum main memory. Linear regression can sometimes still give us a good model with correlated variables, but we would expect to get better results if our variables were uncorrelated. For now, we've decided to keep all our features for this data set.

Predicting the price of used cars

Our second data set is in the cars data frame included in the caret package and was collected by Shonda Kuiper in 2008 from the Kelly Blue Book website, www.kbb.com. This is an online resource to obtain reliable prices for used cars. The data set is comprised of 804 GM cars, all with the model year 2005. It includes a number of car attributes, such as the mileage and engine size as well as the suggested selling price. Many features are binary indicator variables, such as the Buick feature, which represents whether a particular car's make was Buick. The cars were all in excellent condition and less than one-year old when priced, so the car condition is not included as a feature. Our objective for this data set is to build a model that will predict the selling price of a car using the values of these attributes. The definitions of the features are as follows:

Column name	Definition
`Price`	The suggested retail price in USD (our output variable)
`Mileage`	The number of miles the car has been driven
`Cylinder`	The number of cylinders in the car's engine
`Doors`	The number of doors
`Cruise`	The indicator variable representing whether the car has cruise control
`Sound`	The indicator variable representing whether the car has upgraded speakers
`Leather`	The indicator variable representing whether the car has leather seats
`Buick`	The indicator variable representing whether the make of the car is Buick
`Cadillac`	The indicator variable representing whether the make of the car is Cadillac
`Chevy`	The indicator variable representing whether the make of the car is Chevy
`Pontiac`	The indicator variable representing whether the make of the car is Pontiac
`Saab`	The indicator variable representing whether the make of the car is Saab
`Saturn`	The indicator variable representing whether the make of the car is Saturn
`convertible`	The indicator variable representing whether the type of the car is a convertible
`coupe`	The indicator variable representing whether the type of the car is a coupe
`hatchback`	The indicator variable representing whether the type of the car is a hatchback
`sedan`	The indicator variable representing whether the type of the car is a sedan
`wagon`	The indicator variable representing whether the type of the car is a wagon

As with the machine data set, we should investigate the correlation between input features:

> library(caret)
> data(cars)
> cars_cor <- cor(cars_train_features)
> findCorrelation(cars_cor)
integer(0)
> findCorrelation(cars_cor, cutoff = 0.75)
[1] 3
> cor(cars$Doors,cars$coupe)
[1] -0.8254435
> table(cars$coupe,cars$Doors)
   
      2   4
  0  50 614
  1 140   0

Just as with the machine data set, we have a correlation that shows up when we set cutoff to 0.75 in the findCorrelation() function of caret. By directly examining the correlation matrix, we found that there is a relatively high degree of correlation between the Doors feature and the coupe feature. By cross-tabulating these two, we can see why this is the case. If we know that the type of a car is a coupe, then the number of doors is always two. If the car is not a coupe, then it most likely has four doors.

Another problematic aspect of the cars data is that some features are exact linear combinations of other features. This is discovered using the findLinearCombos() function in the caret package:

> findLinearCombos(cars)
$linearCombos
$linearCombos[[1]]
[1] 15  4  8  9 10 11 12 13 14

$linearCombos[[2]]
 [1] 18  4  8  9 10 11 12 13 16 17


$remove
[1] 15 18

Here, we are advised to drop the coupe and wagon columns, which are the 15^th and 18^th features, respectively, because they are exact linear combinations of other features. We will remove both of these from our data frame, thus also eliminating the correlation problem we saw previously.

Next, we'll split our data into training and test sets:

> cars <- cars[,c(-15, -18)]
> set.seed(232455)
> cars_sampling_vector <- createDataPartition(cars$Price, p = 
  0.85, list = FALSE)
> cars_train <- cars[cars_sampling_vector,]
> cars_train_features <- cars[,-1]
> cars_train_labels <- cars$Price[cars_sampling_vector]
> cars_test <- cars[-cars_sampling_vector,]
> cars_test_labels <- cars$Price[-cars_sampling_vector]

Now that we have our data ready, we'll build some models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Multiple linear regression

Create new playlist

Sign In

Sign Up

Multiple linear regression

Predicting CPU performance

Tip

Predicting the price of used cars

Table of Contents for
Multiple linear regression