Digit classification – model building and training

Now, let's go ahead and build our model. So, we have 10 classes in our dataset 0-9 and the goal is to classify any input image into one of these classes. Instead of giving a hard decision about the input image by saying only which class it could belong to, we are going to produce a vector of 10 possible values (because we have 10 classes). It'll represent the probabilities of each digit from 0-9 being the correct class for the input image.

For example, suppose we feed the model with a specific image. The model might be 70% sure that this image is 9, 10% sure that this image is 8, and so on. So, we are going to use the softmax regression here, which will produce values between 0 and 1.

A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

To tally up the evidence that a given image is in a particular class, we do a weighted sum of the pixel intensities. The weight is negative if that pixel having a high intensity is evidence against the image being in that class, and positive if it is evidence in favor.

Figure 7 shows the weights one model learned for each of these classes. Red represents negative weights, while blue represents positive weights:

Figure 7: Weights one model learned for each of MNIST classes

We also add some extra evidence called a bias. Basically, we want to be able to say that some things are more likely independent of the input. The result is that the evidence for a class i, given an input, x, is:

Where:

W_i is the weights
b_i is the bias for class i
j is an index for summing over the pixels in our input image x.

We then convert the evidence tallies into our predicted probabilities y using the softmax function:

y = softmax(evidence)

Here, softmax is serving as an activation or link function, shaping the output of our linear function into the form we want, in this case, a probability distribution over 10 cases (because we have 10 possible classes from 0-9). You can think of it as converting tallies of evidence into probabilities of our input being in each class. It's defined as:

softmax(evidence) = normalize(exp(evidence))

If you expand that equation, you get:

But it's often more helpful to think of softmax the first way: exponentiating its inputs and then normalizing them. Exponentiation means that one more unit of evidence increases the weight given to any hypothesis exponentially. And conversely, having one less unit of evidence means that a hypothesis gets a fraction of its earlier weight. No hypothesis ever has zero or negative weight. Softmax then normalizes these weights so that they add up to one, forming a valid probability distribution.

You can picture our softmax regression as looking something like the following, although with a lot more x's. For each output, we compute a weighted sum of the x's, add a bias, and then apply softmax:

Figure 8: Visualization of softmax regression

If we write that out as equations, we get:

Figure 9: Equation representation of the softmax regression

We can use vector notation for this procedure. This means that we'll be turning it into a matrix multiplication and vector addition. This is very helpful for computational efficiency and readability:

Figure 10: Vectorized representation of the softmax regression equation

More compactly, we can just write:

y = softmax(W_x + b)

Now, let's turn that into something that TensorFlow can use.

Table of Contents for Digit classification&#xA0;&#x2013; model building and training

Create new playlist

Sign In

Sign Up

Table of Contents for
Digit classification – model building and training