Neural network model

In Chapter 2, Learning Process in Neural Networks, we scaled the data before building the network. On that occasion, we pointed out that it is good practice to normalize the data before training a neural network. With normalization, data units are eliminated, allowing you to easily compare data from different locations.

It is not always necessary to normalize numeric data. However, it has been shown that when numeric values ​​are normalized, neural network formation is often more efficient and leads to better prediction. In fact, if numeric data are not normalized and the sizes of two predictors are very distant, a change in the value of a neural network weight has much more relative influence on higher value.

There are several standardization techniques; in Chapter 2, Learning Process in Neural Networks, we adopted min-max standardization. In this case, we will adopt Z-scores normalization. This technique consists of subtracting the mean of the column to each value in a column, and then dividing the result for the standard deviation of the column. The formula to achieve this is the following:

 

In summary, the Z score (also called standard score) represents the number of standard deviations with which the value of an observation point or data is greater than the mean value of what is observed or measured. Values above the mean have positive Z-scores, while values below the mean have negative Z-scores. The Z-score is a quantity without dimension, obtained by subtracting the population mean from a single rough score and then dividing the difference for the standard deviation of the population.

Before applying the method chosen for normalization, you must calculate the mean and standard deviation values of each database column. To do this, we use the apply function. This function returns a vector or an array or a list of values obtained by applying a function to margins of an array or matrix. Let's understand the meaning of the arguments used.

mean_data <- apply(data[1:6], 2, mean)
sd_data <- apply(data[1:6], 2, sd)

The first line allows us to calculate the mean of each variable going to the second line, allowing us to calculate the standard deviation of each variable. Let's see how we used the function apply(). The first argument of the apply function specifies the dataset to apply the function to, in our case, the dataset named data. In particular, we have only considered the first six numeric variables; the other ones we will use for other purposes. The second argument must contain a vector giving the subscripts which the function will be applied over. In our case, one indicates rows and two indicates columns. The third argument must contain the function to be applied; in our case, the mean() function in the first row and the sd() function in the second row. The results are shown as follows:

> mean_data
mpg cylinders displacement horsepower weight
23.445918 5.471939 194.411990 104.469388 2977.584184
acceleration
15.541327
> sd_data
mpg cylinders displacement horsepower weight
7.805007 1.705783 04.644004 38.491160 849.402560
acceleration
2.758864

To normalize the data, we use the scale() function, which is a generic function whose default method centers and/or scales the columns of a numeric matrix:

data_scaled <- as.data.frame(scale(data[,1:6],center = mean_data, scale = sd_data))

Let's take a look at the data transformed by normalization:

head(data_scaled, n=20)

The results are as follows:

> head(data_scaled, n=20)
mpg cylinders displacement horsepower weight acceleration
1 -0.69774672 1.4820530 1.07591459 0.6632851 0.6197483 -1.28361760
2 -1.08211534 1.4820530 1.48683159 1.5725848 0.8422577 -1.46485160
3 -0.69774672 1.4820530 1.18103289 1.1828849 0.5396921 -1.64608561
4 -0.95399247 1.4820530 1.04724596 1.1828849 0.5361602 -1.28361760
5 -0.82586959 1.4820530 1.02813354 0.9230850 0.5549969 -1.82731962
6 -1.08211534 1.4820530 2.24177212 2.4299245 1.6051468 -2.00855363
7 -1.21023822 1.4820530 2.48067735 3.0014843 1.6204517 -2.37102164
8 -1.21023822 1.4820530 2.34689042 2.8715843 1.5710052 -2.55225565
9 -1.21023822 1.4820530 2.49023356 3.1313843 1.7040399 -2.00855363
10 -1.08211534 1.4820530 1.86907996 2.2220846 1.0270935 -2.55225565
11 -1.08211534 1.4820530 1.80218649 1.7024847 0.6892089 -2.00855363
12 -1.21023822 1.4820530 1.39126949 1.4426848 0.7433646 -2.73348966
13 -1.08211534 1.4820530 1.96464205 1.1828849 0.9223139 -2.18978763
14 -1.21023822 1.4820530 2.49023356 3.1313843 0.1276377 -2.00855363
15 0.07099053 -0.8629108 -0.77799001 -0.2460146 -0.7129531 -0.19621355
16 -0.18525522 0.3095711 0.03428778 -0.2460146 -0.1702187 -0.01497955
17 -0.69774672 0.3095711 0.04384399 -0.1940546 -0.2396793 -0.01497955
18 -0.31337809 0.3095711 0.05340019 -0.5058145 -0.4598340 0.16625446
19 0.45535916 -0.8629108 -0.93088936 -0.4278746 -0.9978592 -0.37744756
20 0.32723628 -0.8629108 -0.93088936 -1.5190342 -1.3451622 1.79736053

Let's now split the data for the training and the test:

index = sample(1:nrow(data),round(0.70*nrow(data)))
train_data <- as.data.frame(data_scaled[index,])
test_data <- as.data.frame(data_scaled[-index,])

In the first line of the code just suggested, the dataset is split into 70:30, with the intention of using 70 percent of the data at our disposal to train the network and the remaining 30 percent to test the network. In the second and third lines, the data of the dataframe named data is subdivided into two new dataframes, called train_data and test_data. Now we have to build the function to be submitted to the network:

n = names(data_scaled)
f = as.formula(paste("mpg ~", paste(n[!n %in% "mpg"], collapse = " + ")))

In the first line, we recover all the variable names in the data_scaled dataframe, using the names() function. In the second line, we build formula that we will use to train the network. What does this formula represent?

The models fitted by the neuralnet() function are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term. Let's look at the formula we set:

> f
mpg ~ cylinders + displacement + horsepower + weight + acceleration

Now we can build and train the network.

In Chapter 3, Deep Learning Using Multilayer Neural Networks, we said that to choose the optimal number of neurons, we need to know that:

  • Small number of neurons will lead to high error for your system, as the predictive factors might be too complex for a small number of neurons to capture
  • Large number of neurons will overfit your training data and not generalize well
  • The number of neurons in each hidden layer should be somewhere between the size of the input and the output layer, potentially the mean
  • The number of neurons in each hidden layer shouldn't exceed twice the number of input neurons, as you are probably grossly overfit at this point

In this case, we have five input variables (cylinders, displacement, horsepower, weight, and acceleration) and one variable output (mpg). We choose to set three neurons in the hidden layer.

net = neuralnet(f,data=train_data,hidden=3,linear.output=TRUE)

The hidden argument accepts a vector with the number of neurons for each hidden layer, while the argument linear.output is used to specify whether we want to do regression (linear.output=TRUE) or classification (linear.output=FALSE).

The algorithm used in neuralnet(), by default, is based on the resilient backpropagation without weight backtracking and additionally modifies one learning rate, either the learning rate associated with the smallest absolute gradient (sag) or the smallest learning rate (slr) itself. The neuralnet() function returns an object of class nn. An object of class nn is a list containing at most the components shown in the following table:

Components

Description

call

The matched call.

response

Extracted from the data argument.

covariate

The variables extracted from the data argument.

model.list

A list containing the covariates and the response variables extracted from the formula argument.

err.fct

The error function.

act.fct

The activation function.

data

The data argument.

net.result

A list containing the overall result of the neural network for every repetition.

weights

A list containing the fitted weights of the neural network for every repetition.

generalized.weights

A list containing the generalized weights of the neural network for every repetition.

result.matrix

A matrix containing the reached threshold, needed steps, error, AIC and BIC (if computed), and weights for every repetition. Each column represents one repetition.

startweights

A list containing the startweights of the neural network for every repetition.

 

To produce result summaries of the results of the model, we use the summary() function:

> summary(net)
Length Class Mode
call 5 -none- call
response 274 -none- numeric
covariate 1370 -none- numeric
model.list 2 -none- list
err.fct 1 -none- function
act.fct 1 -none- function
linear.output 1 -none- logical
data 6 data.frame list
net.result 1 -none- list
weights 1 -none- list
startweights 1 -none- list
generalized.weights 1 -none- list
result.matrix 25 -none- numeric

For each component of the neural network model are displayed three features:

  • Length: This is component length, that is how many elements of this type are contained in it
  • Class: This contains specific indication on the component class
  • Mode: This is the type of component (numeric, list, function, logical, and so on)

To plot the graphical representation of the model with the weights on each connection, we can use the plot() function. The plot() function is a generic function for the representation of objects in R. Generic function means that it is suitable for different types of objects, from variables to tables to complex function outputs, producing different results. Applied to a nominal variable, it will produce a bar graph. Applied to a cardinal variable, it will produce a scatterplot. Applied to the same variable, but tabulated, that is, to its frequency distribution, it will produce a histogram. Finally, applied to two variables, a nominal and a cardinal, it will produce a boxplot.

plot(net)

The neural network plot is shown in the following graph:

In the previous graph, the black lines (these lines start from input nodes) show the connections between each layer and the weights on each connection, while the blue lines (these lines start from bias nodes which are distinguished by number 1) show the bias term added in each step. The bias can be thought of as the intercept of a linear model.

Though over time we have understood a lot about the mechanics that are the basis of the neural networks, in many respects, the model we have built and trained remains a black box. The fitting, weights, and model are not clear enough. We can be satisfied that the training algorithm is convergent and then the model is ready to be used.

We can print on video, the weights and biases:

> net$result.matrix
1
error 21.800203210980
reached.threshold 0.009985137179
steps 9378.000000000000
Intercept.to.1layhid1 -1.324633695625
cylinders.to.1layhid1 0.291091600669
displacement.to.1layhid1 -2.243406161080
horsepower.to.1layhid1 0.616083122568
weight.to.1layhid1 1.292334492287
acceleration.to.1layhid1 -0.286145921068
Intercept.to.1layhid2 -41.734205163355
cylinders.to.1layhid2 -5.574494023650
displacement.to.1layhid2 33.629686446649
horsepower.to.1layhid2 -28.185856598271
weight.to.1layhid2 -50.822997942647
acceleration.to.1layhid2 -5.865256284330
Intercept.to.1layhid3 0.297173606203
cylinders.to.1layhid3 0.306910802417
displacement.to.1layhid3 -5.897977831914
horsepower.to.1layhid3 0.379215333054
weight.to.1layhid3 2.651777936654
acceleration.to.1layhid3 -1.035618563747
Intercept.to.mpg -0.578197055155
1layhid.1.to.mpg -3.190914666614
1layhid.2.to.mpg 0.714673177354
1layhid.3.to.mpg 1.958297807266

As can be seen, these are the same values that we can read in the network plot. For example, cylinders.to.1layhid1 = 0.291091600669 is the weight for the connection between the input cylinders and the first node of the hidden layer.

Now we can use the network to make predictions. For this, we had set aside 30 percent of the data in the test_data dataframe. It is time to use it.

predict_net_test <- compute(net,test_data[,2:6])

In our case, we applied the function to the test_data dataset, using only the columns from 2 to 6, representing the input variables of the network. To evaluate the network performance, we can use the Mean Squared Error (MSE) as a measure of how far away our predictions are from the real data.

MSE.net <- sum((test_data$mpg - predict_net_test$net.result)^2)/nrow(test_data)

Here test_data$mpg is the actual data and predict_net_test$net.result is the predicted data for the target of the analysis. Following is the result:

> MSE.net
[1] 0.2591064572

It looks like a good result, but what do we compare it with? To get an idea of the accuracy of the network prediction, we can build a linear regression model:

Lm_Mod <- lm(mpg~., data=train_data)
summary(Lm_Mod)

We build a linear regression model using the lm function. This function is used to fit linear models. It can be used to perform regression, single stratum analysis of variance, and analysis of covariance. To produce a summary of the results of model fitting obtained, we have used the summary() function, which returns the following results:

> summary(Lm_Mod)
Call:
lm(formula = mpg ~ ., data = train_data)
Residuals:
Min 1Q Median 3Q Max
-1.48013031 -0.34128989 -0.04310873 0.27697893 1.77674878
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01457260 0.03268643 0.44583 0.656080
cylinders -0.14056198 0.10067461 -1.39620 0.163809
displacement 0.06316568 0.13405986 0.47118 0.637899
horsepower -0.16993594 0.09180870 -1.85098 0.065273 .
weight -0.59531412 0.09982123 -5.96380 0.0000000077563 ***
acceleration 0.03096675 0.05166132 0.59942 0.549400
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5392526 on 268 degrees of freedom
Multiple R-squared: 0.7183376, Adjusted R-squared: 0.7130827
F-statistic: 136.6987 on 5 and 268 DF, p-value: < 0.00000000000000022204

Now we make the prediction with the linear regression model using the data contained in the test_data dataframe:

predict_lm <- predict(Lm_Mod,test_data)

Finally, we calculate the MSE for the regression model:

MSE.lm <- sum((predict_lm - test_data$mpg)^2)/nrow(test_data)

Following is the result:

> MSE.lm
[1] 0.3124200509

From the comparison between the two models (neural network model versus linear regression model), once again the neural network wins (0.26 versus 0.31).

We now perform a visual comparison by drawing on a graph the actual value versus the predicted value, first for neural network and then for linear regression model:

par(mfrow=c(1,2))

plot(test_data$mpg,predict_net_test$net.result,col='black',main='Real vs predicted for neural network',pch=18,cex=4)
abline(0,1,lwd=5)

plot(test_data$mpg,predict_lm,col='black',main='Real vs predicted for linear regression',pch=18,cex=4)
abline(0,1,lwd=5)

The comparison between the performance of the neural network model (to the left) and the linear regression model (to the right) on the test set is plotted in the following graph:

As we can see, the predictions by the neural network are more concentrated around the line than those by the linear regression model, even if you do not note a big difference.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset