In this section, we will investigate how neural networks can be used to solve a real-world regression problem. Once again, we turn to the UCI Machine Learning Repository for our data set. We've chosen to try out the energy efficiency data set available at http://archive.ics.uci.edu/ml/datasets/Energy+efficiency. The prediction task is to use various building characteristics, such as surface area and roof area, in order to predict the energy efficiency of a building, which is expressed in the form of two different metrics—heating load and cooling load.
This is a good example for us to try out as we can demonstrate how neural networks can be used to predict two different outputs with a single network. The full attribute description of the data set is given in the following table:
Column name |
Type |
Definition |
---|---|---|
|
Numerical |
Relative compactness |
|
Numerical |
Surface area |
|
Numerical |
Wall area |
|
Numerical |
Roof area |
|
Numerical |
Overall height |
|
Numerical |
Building orientation (factor) |
|
Numerical |
Glazing area |
|
Numerical |
Glazing area distribution (factor) |
|
Numerical |
Heating load (first output) |
|
Numerical |
Cooling load (second output) |
The data was generated using a simulator called Ecotect. Each observation in the data set corresponds to a simulated building. All the buildings have the same volume, but other attributes that impact their energy efficiency, such as their glazing area, are modified.
The data on the website comes in Microsoft Excel format. To load this into R, we can use the R package xlsx
, which can read and understand Microsoft Excel files:
> library(xlsx) > eneff <- read.xlsx2("ENB2012_data.xlsx", sheetIndex = 1, colClasses = rep("numeric", 10)) > names(eneff) <- c("relCompactness", "surfArea", "wallArea", "roofArea", "height", "orientation", "glazArea", "glazAreaDist", "heatLoad", "coolLoad") > eneff <- eneff[complete.cases(eneff),]
The import adds a number of empty observations at the end of the data frame, so the last line removes these. Now, by referring to the paper in which the data set was presented, we discover that two of our attributes are actually factors. In order for our neural network to work with these, we will need to convert them into dummy variables. To do this, we will use the dummyVars()
function from the caret
package:
> library(caret) > eneff$orientation <- factor(eneff$orientation) > eneff$glazAreaDist <- factor(eneff$glazAreaDist) > dummies <- dummyVars(heatLoad + coolLoad ~ ., data = eneff) > eneff_data <- cbind(as.data.frame(predict(dummies, newdata = eneff)), eneff[,9:10]) > dim(eneff_data) [1] 768 18
The dummyVars()
function takes in a formula and a data frame. From these, it identifies the input features and performs dummy encoding on those that are factors in order to produce new binary columns. There are as many columns created for a factor as there are levels in that factor. Just as with the preProcess()
function that we've been using, we actually obtain the columns themselves after using the predict()
function. Next, we'll do an 80-20 split between the training and test data:
> set.seed(474576) > eneff_sampling_vector <- createDataPartition(eneff_data$heatLoad, p = 0.80, list = FALSE) > eneff_train <- eneff_data[eneff_sampling_vector, 1:16] > eneff_train_outputs <- eneff_data[eneff_sampling_vector, 17:18] > eneff_test <- eneff_data[-eneff_sampling_vector, 1:16] > eneff_test_outputs <- eneff_data[-eneff_sampling_vector, 17:18]
One of the most important preprocessing steps to perform when training neural networks is to scale input features and outputs. One good reason to perform input scaling is in order to avoid saturation which occurs when the optimization procedure reaches a point where the gradient of the error function is very small in absolute value. This is usually the result of very large or very small inputs to the nonlinear neuron activation functions. Saturation causes the optimization procedure to terminate, thinking we have converged.
Depending on the particular neural network implementation, for regression tasks it may also make sense to scale the outputs as some implementations of linear neurons are designed to produce an output in the interval [-1,1]. Scaling can also help convergence. Consequently, we will use caret
to scale all our data dimensions to the unit interval, noting that this has no effect on the binary columns produced earlier:
> eneff_pp <- preProcess(eneff_train, method = c("range")) > eneff_train_pp <- predict(eneff_pp, eneff_train) > eneff_test_pp <- predict(eneff_pp, eneff_test) > eneff_train_out_pp <- preProcess(eneff_train_outputs, method = c("range")) > eneff_train_outputs_pp <- predict(eneff_train_out_pp, eneff_train_outputs) > eneff_test_outputs_pp <- predict(eneff_train_out_pp, eneff_test_outputs)
Several different packages implement neural networks in R, each with their various merits and strengths. For this reason, it helps to be familiar with more than one package and in this chapter we will investigate three of these, the first being neuralnet
:
> library("neuralnet") > n <- names(eneff_data) > f <- as.formula(paste("heatLoad + coolLoad ~", paste(n[!n %in% c("heatLoad", "coolLoad")], collapse = " + "))) > eneff_model <- neuralnet(f, data = cbind(eneff_train_pp, eneff_train_outputs_pp), hidden = 10) > eneff_model Call: neuralnet(formula = f, data = cbind(eneff_train_pp, eneff_train_outputs_pp), hidden = 10) 1 repetition was calculated. Error Reached Threshold Steps 1 0.3339635783 0.009307995429 9998
The neuralnet()
function trains a neural network based on the information provided in its arguments. The first argument we provide is a formula and the format is similar to the formulae that we've seen with the lm()
and glm()
functions in previous chapters. One interesting difference here is that we have specified two outputs, heatLoad
and coolLoad
. Another difference is that currently we are unable to use the dot (.
) notation to imply that all the remaining columns in our data frame can be used as features, so we need to specify them explicitly.
Note that with the formula, we have effectively defined the input and output layers of the neural network and so what remains to be specified is the structure of the hidden layers. This is specified with the hidden
parameter, which either takes in a scalar for a single layer, or a vector of scalars that specify the number of hidden units in each layer, starting from the layer just after the input layer, and ending with the layer just before the output layer.
In the example we saw earlier, we've used a single layer with 10 nodes. We can actually visualize our neural network as the package provides us with the ability to plot the model directly (the numbered circles are dummy bias neurons):
The call to neuralnet()
also allows us to specify what type of activation function we would like to use for our neurons through the parameter string act.fct
. By default, this is set to the logistic activation function and so we have not changed this. Another very important parameter is linear.output
, which can be either TRUE
or FALSE
. This specifies whether we should apply the activation function to the neurons in the output layer. The default value of TRUE
that we used means that we do not apply the activation function and so we can observe a linear output. For regression type problems, this is what is appropriate. This is because, if we were to apply a logistic activation function our output would be bounded in the interval [0,1]. Finally, we can specify a differentiable error function through the err.fct
parameter to use as part of our optimization strategy. As we are doing regression, we use the default value of sse
, which corresponds to the sum of squared error.
As there is a random component in neural network training, namely the initialization of the weights, we may want to specify that we should retrain the same model a number of times in order for us to pick the best possible model that we get (using criteria such as the SSE to rank these). This can be done by specifying an integer value for the rep
parameter. Let's rewrite our original call to explicitly show the default values we are using:
> eneff_model <- neuralnet(f, data = cbind(eneff_train_pp, eneff_train_outputs_pp), hidden = 10, act.fct = "logistic", linear.output = TRUE, err.fct = "sse", rep = 1)
The model's output provides us with some information about the performance of the neural network and what is shown depends on its configuration. As we have specified the SSE as our error metric, the error shown is the SSE that was obtained. The threshold figure is just the value of the partial derivative of the error function when the model stopped training. Essentially, instead of terminating when the gradient is 0 exactly, we specify a very small value below which the error gradient needs to fall before the algorithm terminates. The default value for this is 0.01 and can be changed by supplying a number for the threshold parameter in the neuralnet()
function. Reducing this value will generally result in longer training times. The model output also shows us the number of training steps that were performed. Finally, if we had used the rep
parameter to repeat this process multiple times, we would see a row for each model trained. Our output shows us that we trained only one model.
The package neuralnet
provides us with a neat way to use our model to perform predictions through the compute()
function. Essentially, it provides us with not only the predicted output for a data frame of observations, but also shows us the output values of all the neurons in the model's architecture. To evaluate the performance of the model, we are interested in the outputs of the neural network on our test set:
> test_predictions <- compute(eneff_model, eneff_test_pp)
We can access the predicted outputs of the neural network using the net.result
attribute of the test_predictions
object as follows:
> head(test_predictions$net.result) [,1] [,2] 7 0.38996108769 0.39770348145 8 0.38508402576 0.46726904682 14 0.29555228848 0.24157156896 21 0.49912349400 0.51244876337 23 0.50036257800 0.47436990729 29 0.01133684342 0.01815294595
As this is a regression problem, we would like to be able to use the MSE in order to evaluate the performance of our model on both target outputs. In order to do that, we need to transform our predicted outputs back onto their original scale for a fair assessment to be made. The scaling constants we used on our data are stored in the ranges
attribute of the eneff_train_out_pp
object:
> eneff_train_out_pp$ranges heatLoad coolLoad [1,] 6.01 10.90 [2,] 42.96 48.03
The first row contains the minimum values of the original data, and the second row contains the maximum values. We'll now write a function that will take in a scaled vector and another vector that contains the original minimum and maximum values, and will return the original unscaled vector:
reverse_range_scale <- function(v, ranges) { return( (ranges[2] - ranges[1]) * v + ranges[1] ) }
Next, we'll use this to obtain the unscaled predicted outputs for our test set:
> output_ranges <- eneff_train_out_pp$ranges > test_predictions_unscaled <- sapply(1:2, function(x) reverse_range_scale(test_predictions[,x], output_ranges[,x]))
We can also define a simple function to compute the MSE and use it to check the performance on our two tasks:
mse <- function(y_p, y) { return(mean((y - y_p) ^ 2)) } > mse(test_predictions_unscaled[,1], eneff_test_outputs[,1]) [1] 0.2940468477 > mse(test_predictions_unscaled[,2], eneff_test_outputs[,2]) [1] 1.440127075
These values are very low, indicating that we have very good prediction accuracy. We can also investigate correlation, which is scale independent, and we could have used it on the unscaled outputs as well:
> cor(test_predictions_unscaled[,1], eneff_test_outputs[,1]) [1] 0.9986655316 > cor(test_predictions_unscaled[,2], eneff_test_outputs[,2]) [1] 0.9926735348
These values are extremely high, indicating that we have near-perfect performance, something very rare to see with real-world data. If the accuracy were not this high, we would experiment by making the architecture more complicated. We could, for example, build a model with an additional layer by setting hidden=c(10,5)
so that we would have an additional layer of five neurons before the output layer.