Getting started with multiple regression?

Simple linear regression will summarize the relationship between an outcome and a single explanatory element. However, in real life, things are not always so simple! We are going to use the adult dataset from UCI, which focuses on census data with a view to identifying if adults earn above or below fifty thousand dollars a year. The idea is that we can build a model from observations of adult behavior, to see if the individuals earn above or below fifty thousand dollars a year.

Multiple regression builds a model of the data, which is used to make predictions. Multiple regression is a scoring model, which makes a summary. It predicts a value between 0 and 1, which means that it is good for predicting probabilities.

It's possible to imagine multiple regression as modeling the behavior of a coin being tossed in the air. How will the coin land—heads or tails? It is not dependent on just one thing. The reality is that the result will depend on other variables. This score is the probability of the result. In other words, the result of the coin face being heads depends on the other variables, and this score is expressed as a probability. The probability is the resulting, predicted number, which is an additive function of the inputs. The resulting model also gives an indication of the relative impact of each input variable on the output.

Building our multiple regression model

The first thing we need to do in the model building process is to select a dataset—this should be something that contains a fairly large number of samples (observations). We have selected some datasets here as examples.

Once the dataset has been selected, we want to ensure that we can use the dataset to determine something about the business question that we are trying to answer. We are trying to make predictions on the data, so our training set should be in the same shape as the test dataset. A feature is an item that can be used to predict a value in an experiment.

Once we have built our model, we can accurately test the predictions and see whether our guesses are accurate and then rank the efficiency of our model. At the end of this process, we can evaluate the model and determine whether this is a good fit or not. Ultimately, this could mean changing the way we interact with our data, or perhaps amending the algorithm we use to optimize the efficiency of the model.

When we trained our model, we only selected the greater proportion of the dataset. In fact, we can use the rest of the dataset to test whether we can accurately predict a value, and this is the test dataset.

Supervised learning is distinct from unsupervised learning, which we'll look at later on in this book. In the domain of supervised learning, we try to predict either a continuous variable, a number, for example, a predicted earning level for adults and other conditions or a class of output that is discrete, such as earning level. In order to do this task, we need two things:

  • The first is features—these will need to be in a form that our machine learning algorithm can process. The mathematical term for this is a vector—so we refer to this as a feature vector.
  • We also need a set of labels—these are generally in text form, but we may need them to be in numeric form, so as part of the input we may have to turn them into a set of numbers that our algorithm can understand.

Once we have our features vectors and labels we can feed these into an algorithm that will attempt to build a model from the input data

The algorithm produces a training set from part of our input dataset and we can refer to the trained model now—it is important to understand that the model can be continually trained as we discover new things and get new data—machine learning is so powerful because of the feedback cycle involved.

Is the model good or bad? How do we evaluate a regression model?

Confusion matrix

One way of doing this is to build in a confusion matrix from the result. A confusion matrix is a very simple and precise way of summarizing the result. The confusion matrix is a simple table that shows the actual classification against the predicted ones.

It will be built from a particular class—in this case, Iris Versicolor.

Starting from the top, we derived the following:

  • 12 true positives—this means we accurately predicted Iris Versicolor 12 times
  • Three false positives—this means that we labeled Iris Setosa and Iris Virginica incorrectly as Iris Versicolor three times
  • Six false negatives—this is the Iris Versicolor that were incorrectly marked as the other two types
  • Nine true negatives—this is the remaining classes that were classified correctly as non-Iris Versicolor types

In the next example, we will see an example of this scenario in R, and then we can visualize the results in Tableau.

Prerequisites

The following items are prerequisites for the exercise:

Instructions

In order to proceed, you will need to download the data as follows:

  1. Download the CSV file containing the adult UCI data.
  2. You will need to do this for the test and also for training data.
  3. For the test data, the link is here: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test.
  4. For the training data, the link is here: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data.
  5. Load the CSV file into R, and assign it to the adult variable name. Your code could look like the following segment:
    • adult.training | read.csv (C:/Users/jenst/Downloads/adult.csv)
    • adult.test | read.csv (C:/Users/jenst/Downloads/adulttest.csv)
  6. Let's create a binary response variable called y, which will be our dependent variable. It is based on the number of records in the training dataset:
    N.obs <- dim(adult.training)[1]
    y <- rep(0, N.obs)
    
    y[class==levels(class)[2]] <- 1
    
  7. Next, we will look at the columns in the dataset, using the summary command:
    summary(adult.training)
    
  8. We can use the names command to obtain the column name:
    names(adult.training)
    
  9. We can also view some of the data in R, using the head command:
    head(adult.training)
    
  10. Now, we will use the glm function in order to create a data model, which we will assign to the adultdatamodel variable:
    ## GLM fit
    adultdatamodel <- glm(y ~ age + educationnum + hoursperweek + workclass + maritalstatus + occupation + relationship + race + sex, family=binomial("logit"))
    
  11. Once we have obtained the result, we need to check the coefficients, and we will set the results to the tab variable:
    resultstable <- summary(fit)$coefficients
    sorter <- order(resultstable[,4])
    resultstable <- resultstable[sorter,]
    
  12. Now, we can move onto the test data, which is assigned to the pred variable:
    pred <- predict(fit, test.data, type="response")
    N.test <- length(pred)
    
  13. Next we will use 0.5 as a threshold for the prediction to be successful:
    y.hat <- rep(0, N.test)
    y.hat[pred>=0.5] <- 1
    
  14. We can visualize the data in a confusion.table in order to identify the true outcome versus the predicted outcome:
    ## Get the true outcome of the test data
    outcome <- levels(test.data$class)
    y.test <- rep(0, N.test)
    y.test[test.data$class==outcome[2]] <- 1
    
    confusion.table <- table(y.hat, y.test)
    colnames(confusion.table) <- c(paste("Actual",outcome[1]), outcome[2])
    rownames(confusion.table) <- c(paste("Predicted",outcome[1]), outcome[2])
    

Once we have our confusion table, we can print it out to a CSV file so that we can visualize it in Tableau.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset