Simple linear regression will summarize the relationship between an outcome and a single explanatory element. However, in real life, things are not always so simple! We are going to use the adult dataset from UCI, which focuses on census data with a view to identifying if adults earn above or below fifty thousand dollars a year. The idea is that we can build a model from observations of adult behavior, to see if the individuals earn above or below fifty thousand dollars a year.
Multiple regression builds a model of the data, which is used to make predictions. Multiple regression is a scoring model, which makes a summary. It predicts a value between 0 and 1, which means that it is good for predicting probabilities.
It's possible to imagine multiple regression as modeling the behavior of a coin being tossed in the air. How will the coin land—heads or tails? It is not dependent on just one thing. The reality is that the result will depend on other variables. This score is the probability of the result. In other words, the result of the coin face being heads depends on the other variables, and this score is expressed as a probability. The probability is the resulting, predicted number, which is an additive function of the inputs. The resulting model also gives an indication of the relative impact of each input variable on the output.
The first thing we need to do in the model building process is to select a dataset—this should be something that contains a fairly large number of samples (observations). We have selected some datasets here as examples.
Once the dataset has been selected, we want to ensure that we can use the dataset to determine something about the business question that we are trying to answer. We are trying to make predictions on the data, so our training set should be in the same shape as the test dataset. A feature is an item that can be used to predict a value in an experiment.
Once we have built our model, we can accurately test the predictions and see whether our guesses are accurate and then rank the efficiency of our model. At the end of this process, we can evaluate the model and determine whether this is a good fit or not. Ultimately, this could mean changing the way we interact with our data, or perhaps amending the algorithm we use to optimize the efficiency of the model.
When we trained our model, we only selected the greater proportion of the dataset. In fact, we can use the rest of the dataset to test whether we can accurately predict a value, and this is the test dataset.
Supervised learning is distinct from unsupervised learning, which we'll look at later on in this book. In the domain of supervised learning, we try to predict either a continuous variable, a number, for example, a predicted earning level for adults and other conditions or a class of output that is discrete, such as earning level. In order to do this task, we need two things:
Once we have our features vectors and labels we can feed these into an algorithm that will attempt to build a model from the input data
The algorithm produces a training set from part of our input dataset and we can refer to the trained model now—it is important to understand that the model can be continually trained as we discover new things and get new data—machine learning is so powerful because of the feedback cycle involved.
Is the model good or bad? How do we evaluate a regression model?
One way of doing this is to build in a confusion matrix from the result. A confusion matrix is a very simple and precise way of summarizing the result. The confusion matrix is a simple table that shows the actual classification against the predicted ones.
It will be built from a particular class—in this case, Iris Versicolor.
Starting from the top, we derived the following:
In the next example, we will see an example of this scenario in R, and then we can visualize the results in Tableau.
The following items are prerequisites for the exercise:
In order to proceed, you will need to download the data as follows:
adult.training
| read.csv
(C:/Users/jenst/Downloads/adult.csv)
adult.test
| read.csv
(C:/Users/jenst/Downloads/adulttest.csv)
y
, which will be our dependent variable. It is based on the number of records in the training dataset:N.obs <- dim(adult.training)[1] y <- rep(0, N.obs) y[class==levels(class)[2]] <- 1
summary
command:summary(adult.training)
names
command to obtain the column name:names(adult.training)
head
command:head(adult.training)
glm
function in order to create a data model, which we will assign to the adultdatamodel
variable:## GLM fit adultdatamodel <- glm(y ~ age + educationnum + hoursperweek + workclass + maritalstatus + occupation + relationship + race + sex, family=binomial("logit"))
resultstable <- summary(fit)$coefficients sorter <- order(resultstable[,4]) resultstable <- resultstable[sorter,]
pred
variable:pred <- predict(fit, test.data, type="response") N.test <- length(pred)
y.hat <- rep(0, N.test) y.hat[pred>=0.5] <- 1
## Get the true outcome of the test data outcome <- levels(test.data$class) y.test <- rep(0, N.test) y.test[test.data$class==outcome[2]] <- 1 confusion.table <- table(y.hat, y.test) colnames(confusion.table) <- c(paste("Actual",outcome[1]), outcome[2]) rownames(confusion.table) <- c(paste("Predicted",outcome[1]), outcome[2])
Once we have our confusion table, we can print it out to a CSV file so that we can visualize it in Tableau.