Ill-conditioning

The condition number of a matrix is the ratio of the largest singular value to the smallest singular value. A matrix is ill-conditioned if the condition number is very high, usually indicating that the lowest singular value is orders of magnitude smaller than the highest one, and rows of the matrix are heavily correlated with each other. This is a very general problem in optimization. In fact, it makes even convex optimization problems difficult to solve. Generally, neural networks have this problem, which causes SGD to get stuck, that is, learning becomes very slow in spite of the existence of a strong gradient. For the dataset with a good condition number, close to 1, the error contours are nearly circular and the negative gradient always points straight at the minimum of the error surface. For a poor-conditioned dataset, the error surface is relatively flat in one or more directions and strongly curved in other directions. For complex neural networks, it may not be possible to find the Hessian and the ill-conditioning effect analytically. However, the effect of ill-conditioning can be monitored by plotting the squared gradient norm and g^TH_gover training epochs.

Let's consider the second-order Taylor-series approximation of the f(x) function that we want to optimize. The second-order Taylor series at point x₀ is given by:

Where g is the gradient vector and H is the Hessian of f(x) at x₀. If ε is the learning rate we are using, then the new point according to gradient descent is x₀ - ∈_g. Substituting this in the Taylor-series expansion, we get .

Note that if -∈g^T g + ½∈²g^TH_g > 0 then the value of the function at the new point will increase compared to x_o. Also, in the presence of strong gradients, we will have a high squared gradient norm, , but, simultaneously if the other quantity, g^TH_g, grows by an order of magnitude, then we will see f(x) decreasing at a very slow rate. However, if we could shrink the learning rate, ε, at that point, then maybe this effect can be nullified to some extent, as the g^TH_g quantity is multiplied by . The effect of ill-conditioning can be monitored by plotting the squared gradient norm and g^TH_g over training epochs. We have seen in the Hot plate example of how to calculate the gradient norm.

Table of Contents for Ill-conditioning

Create new playlist

Sign In

Sign Up

Table of Contents for
Ill-conditioning