Chapter 7. Neural Networks

 

"Forget artificial intelligence – in the brave new world of big data, it's artificial idiocy we should be looking out for."

 
 --Tom Chatfiel

I recall that at some meeting circa mid-2012, I was part of a group discussing the results of some analysis or other, when one of the people around the table sounded off with a hint of exasperation mixed with a tinge of fright: this isn't one of those neural networks, is it? I knew of his past run-ins and deep-seated anxiety for neural networks, so I assuaged his fears making some sarcastic comment that neural networks have basically gone the way of the dinosaur. No one disagreed! Several months later, I was gobsmacked when I attended a local meeting where the discussion focused on, of all things, neural networks and this mysterious deep learning. Machine learning pioneers such as Ng, Hinton, Salakhutdinov, and Bengio have revived neural networks and improved their performance.

Much media hype revolves around these methods with high-tech companies such as Facebook, Google, and Netflix investing tens, if not hundreds, of millions of dollars. The methods have yielded promising results in voice recognition, image recognition, machine, and automation. If self-driving cars ever stop running off the road and into each other, it will likely be from the methods discussed here.

In this chapter, we will discuss how the methods work, their benefits, and inherent drawbacks so that you can become conversationally competent about them. We will work through a practical business application of a neural network. Finally, we will apply the deep learning methodology in a cloud-based application.

Neural network

Neural network is a fairly broad term that covers a number of related methods, but in our case, we will focus on a Feed Forward network that trains with Back Propagation. I'm not going to waste our time discussing how the machine learning methodology is similar or dissimilar to how a biological brain works. We only need to start with a working definition of what a neural network is. I think the Wikipedia entry is a good start.

In machine learning and cognitive science, Artificial Neural Networks (ANNs) are a family of statistical learning models inspired by biological neural networks (the central nervous systems of animals, in particular, the brain) and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. https://en.wikipedia.org/wiki/Artificial_neural_network.

The motivation or benefit of ANNs is that they allow the modeling of highly complex relationships between inputs/features and response variable(s), especially if the relationships are highly nonlinear. No underlying assumptions are required to create and evaluate the model and it can be used with qualitative and quantitative responses. If this is the yin, then the yang is the common criticism that the results are black box, which means that there is no equation with the coefficients to examine and share with the business partners. In fact, the results are almost not interpretable. The other criticisms revolve around how results can differ by just changing the initial random inputs and that training ANNs is computationally expensive and time-consuming.

The mathematics behind ANNs is not trivial by any measure. However, it is crucial to at least get a working understanding of what is happening. A good way to intuitively develop this understanding is to start a diagram of a simplistic neural network.

In this simple network, the inputs or covariates consist of two nodes or neurons. The neuron labeled 1 represents a constant or more appropriately, the intercept. X1 represents a quantitative variable. The Ws represent the weights that are multiplied by the input node values. These values become Input Nodes to Hidden Node. You can have multiple hidden nodes, but the principal of what happens in just this one is the same. In the hidden node, H1, the weight * value computations are summed. As the intercept is notated as 1, then that input value is simply the weight, W1. Now the magic happens. The summed value is then transformed with the Activation function, turning the input signal to an output signal. In this example, as it is the only Hidden Node, it is multiplied by W3 and becomes the estimate of Y, our response. This is the feed-forward portion of the algorithm.

Neural network

But wait, there's more! To complete the cycle or epoch as it is known, backpropagation happens and trains the model based on what was learned. To initiate the backpropagation, an error is determined based on a loss function such as Sum of Squared Error or Cross-Entropy, among others. As the weights, W1 and W2, were set to some initial random values between [-1, 1], the initial error may be high. Working backward, the weights are changed to minimize the error in the loss function. The following diagram portrays the backpropagation portion:

Neural network

This completes one epoch. This process continues, using gradient descent (discussed in Chapter 5, More Classification Techniques — K-Nearest Neighbors and Support Vector Machines) until the algorithm converges to the minimum error or prespecified number of epochs. If we assume that our activation function is simply linear, in this example, we would end up with Y = W3(W1(1) + W2(X1)).

The networks can get complicated if you add numerous input neurons, multiple neurons in a hidden node, and even multiple hidden nodes. It is important to note that the output from a neuron is connected to all the subsequent neurons and has weights assigned to all these connections. This greatly increases the model complexity. Adding hidden nodes and increasing the number of neurons in the hidden nodes has not improved the performance of ANNs as we hoped. Thus, the development of deep learning occurs, which in part relaxes the requirement of all these neuron connections.

There are a number of activation functions that one can use/try, including a simple linear function, or for a classification problem, the logistic function. (Chapter 3, Logistic Regression and Discriminant Analysis) A threshold function can be used where the output is binary (0 or 1) based on some threshold value. Other common activation functions are sigmoid and hyperbolic tangent (tanh).

The sigmoid function is similar to the logistic function but is not bound between zero and one. (Note that the logistic function is the inverse of the sigmoid.) We can plot a sigmoid function in R. We will first create an R function in order to calculate the sigmoid function values:

> sigmoid = function(x) {
+ 1 / ( 1 + exp(-x) )
+ }

Then, it is a simple matter to plot the function over a range of values, say -5 to 5:

> plot(sigmoid,-5,5)

The output of the preceding command is as follows:

Neural network

You can also plot code font using base R. Again, let's examine it between -5 and 5:

> x = seq(-5, 5, by=0.1)

> t = tanh(x)

> plot(x,t, type="l", ylab="tanh")

The output of the preceding command is as follows:

Neural network

This all sounds fascinating, but the ANN almost went the way of disco as it just did not perform well, especially when trying to use deep networks with many hidden layers and neurons. It seems that a slow, yet gradual revival came about with the seminal paper by Hinton and Salakhutdinov (2006) in the reformulated, and dare I say, rebranded neural network, deep learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset