There's more...

Activation functions are used to learn non-linear and complex functional mappings between the inputs and the response variable in an artificial neural network. One thing to keep in mind is that an activation function should be differentiable so that backpropagation optimization can be performed in the network while computing gradients of error (loss) with respect to weights, in order to optimize weights and reduce errors. Let's have a look at some of the popular activation functions and their properties:

Sigmoid:
- A sigmoid function ranges between 0 and 1.
- It is usually used in an output layer of a binary classification problem.
- It is better than linear activation because the output of the activation function is in the range of (0,1) compared to (-inf, inf), so the output of the activation is bound. It scales down large negative numbers toward 0 and large positive numbers toward 1.
- Its output is not zero centered, which makes gradient updates go too far in different directions and makes optimization harder.
- It has a vanishing gradient problem.
- It also has slow convergence.

The sigmoid function is defined as follows:

Here is the graph of the sigmoid function:

Tangent Hyperbolic (tanh):
- The tanh function scales the values between -1 and 1.
- The gradient for tanh is steeper than it is for sigmoid.
- Unlike sigmoid, it is centered around zero, which makes optimization easier.
- It is usually used in hidden layers.
- It has a vanishing gradient problem.

The tanh function is defined as follows:

The following diagram is the graph of the tanh function:

Rectified linear units (ReLU):
- It is a non-linear function
- It ranges from 0 to infinity
- It does not have a vanishing gradient problem
- Its convergence is faster than sigmoid and tanh
- It has a dying ReLU problem
- It is used in hidden layers

The ReLU function is defined as follows:

Here is the graph for the ReLU function:

Now, let's look at the variants of ReLU:

Leaky ReLU:
- It doesn't have a dying ReLU problem as it doesn't have zero-slope parts
- Leaky ReLU learns faster then ReLU

Mathematically, the Leaky ReLU function can be defined as follows:

Here is a graphical representation of the Leaky ReLU function:

Exponential Linear Unit (ELU):
- It doesn't have the dying ReLU problem
- It saturates for large negative values

Mathematically, the ELU function can be defined as follows:

Here is a graphical representation of the ELU function:

Parametric Rectified Linear Unit (PReLu):
- PReLU is a type of leaky ReLU, where the value of alpha is determined by the network itself.

The mathematical definition of the PReLu function is as follows:

Thresholded Rectified linear unit:

The mathematical definition of the PReLu function is as follows:

Softmax:
- It is non-linear.
- It's usually used in the output layer of a multiclass classification problem.
- It calculates the probability distribution of the event over "n" different events (classes). It outputs values between 0 to 1 for all the classes and the sum of all the probabilities is 1.

The mathematical definition of the softmax function is as follows:

Here, K is the number of possible outcomes.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...