Debugging gradient descent with gradient checking

We just learned how gradient descent works and how to code the gradient descent algorithm from scratch for a simple two-layer network. But implementing gradient descent for complex neural networks is not a simple task. Apart from implementing, debugging a gradient descent for complex neural network architecture is again a tedious task. Surprisingly, even with some buggy gradient descent implementations, the network will learn something. However, apparently, it will not perform well compared to the bug-free implementation of gradient descent.

If the model does not give us any errors and learns something even with buggy implementations of the gradient descent algorithm, how can we evaluate and ensure that our implementation is correct? That is why we use the gradient checking algorithm. It will help us to validate our implementation of gradient descent by numerically checking the derivative.

Gradient checking is basically used for debugging the gradient descent algorithm and to validate that we have a correct implementation.

Okay. So, how does gradient checking works? In gradient checking, we basically compare the numerical and analytical gradients. Wait! What are numerical and analytical gradients?

Analytical gradient implies the gradients we calculated through backpropagation. Numerical gradients are the numerical approximation to the gradients. Let's explore this with an example. Assume we have a simple square function, .

The analytical gradient for the preceding function is computed using the power rule as follows:

Now, let's see how to approximate the gradient numerically. Instead of using the power rule to calculate gradients, we calculate gradients using a definition of the gradients. We know that the gradient or slope of a function basically gives us the steepness of the function.

Thus, the gradient or slope of a function is defined as follows:

A gradient of a function can be given as follows:

We use the preceding equation and approximate the gradients numerically. It implies that we are calculating the slope of the function manually, instead of using power rule as shown in the following diagram:

Computing gradients through power rule (7) and approximating the gradients numerically (8) essentially gives us the same value. Let's see how (7) and (8) give the same value in Python.

Define the square function:

def f(x):
    return x**2

Define the epsilon and input value:

epsilon = 1e-2
x=3

Compute the analytical gradient:

analytical_gradient = 2*x

print analytical_gradient

6

Compute the numerical gradient:

numerical_gradient = (f(x+epsilon) - f(x-epsilon)) / (2*epsilon)

print numerical_gradient

6.000000000012662

As you may have noticed, computing numerical and analytical gradients of the square function essentially gave us the same value, which is 6 when x =3.

While backpropagating the network, we compute the analytical gradients to minimize the cost function. Now, we need to make sure that our computed analytical gradients are correct. So, let's validate that we approximate the numerical gradients of the cost function.

The gradients of with respect to can be numerically approximated as follows:

It is represented as follows:

We check whether the analytical gradients and approximated numerical gradients are the same; if not, then there is an error in our analytical gradient computation. We don't want to check whether the numerical and analytical gradients are exactly the same; since we are only approximating the numerical gradients, we check the difference between the analytical and numerical gradients as an error. If the difference is less than or equal to a very small number say, 1e-7, then our implementation is fine. If the difference is greater than 1e-7, then our implementation is wrong.

Instead of calculating the error directly as the difference between the numerical gradient and the analytical gradient, we calculate the relative error. It can be defined as the ratio of differences to the ratio of an absolute value of the gradients:

When the value of relative error is less than or equal to a small threshold value, say, 1e-7, then our implementation is fine. If the relative error is greater than 1e-7, then our implementation is wrong. Now let's see how to implement gradient checking algorithm in Python step-by-step.

First, we calculate the weights. Refer equation (9):

weights_plus = weights + epsilon 
weights_minus = weights - epsilon

Compute J_plus and J_minus. Refer equation (9):

J_plus = forward_prop(x, weights_plus) 
J_minus = forward_prop(x, weights_minus)

Now, we can compute the numerical gradient as given in (9) as follows:

numerical_grad = (J_plus - J_minus) / (2 * epsilon)

Analytical gradients can be obtained through backpropagation:

analytical_grad = backword_prop(x, weights)

Compute the relative error as given in equation (10) as follows:

numerator = np.linalg.norm(analytical_grad - numerical_grad) 
denominator = np.linalg.norm(analytical_grad) + np.linalg.norm(numerical_grad) 
relative_error = numerator / denominator

If the relative error is less than a small threshold value, say 1e-7, then our gradient descent implementation is correct; otherwise, it is wrong:

if relative_error < 1e-7:
       print ("The gradient is correct!")
else:
       print ("The gradient is wrong!")

Thus, with the help of gradient checking, we make sure that our gradient descent algorithm is bug-free.

Table of Contents for Debugging gradient descent with gradient checking

Create new playlist

Sign In

Sign Up

Table of Contents for
Debugging gradient descent with gradient checking