Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Cats Versus Dogs: Transfer Learning in 30 Lines with Keras

Imagine that we want to learn how to play the melodica, a wind instrument in the form of a handheld keyboard. Without a musical background, and the melodica being our very first instrument, it might take us a few months to become proficient at playing it. In contrast, if we were already skilled at playing another instrument, such as the piano, it might just take a few days, given how similar the two instruments are. Taking the learnings from one task and fine tuning them on a similar task is something we often do in real life (as illustrated in Figure 3-1). The more similar the two tasks are, the easier it is to adapt the learning from one task to the other.

We can apply this phenomenon from real life to the world of deep learning. Starting a deep learning project can be relatively quick when using a pretrained model, which reuses the knowledge that it learned during its training, and adapt it to the task at hand. This process is known as transfer learning.

In this chapter, we use transfer learning to modify existing models by training our own classifier in minutes using Keras. By the end of this chapter, we will have several tools in our arsenal to create high-accuracy image classifiers for any task.

Adapting Pretrained Models to New Tasks

Before we discuss the process of transfer learning, let’s quickly take a step back and review the primary reasons for the boom in deep learning:

Availability of bigger and better-quality datasets like ImageNet
Better compute available; i.e., faster and cheaper GPUs
Better algorithms (model architecture, optimizer, and training procedure)
Availability of pretrained models that have taken months to train but can be quickly reused

The last point is probably one of the biggest reasons for the widespread adoption of deep learning by the masses. If every training task took a month, not more than a handful of researchers with deep pockets would be working in this area. Thanks to transfer learning, the underappreciated hero of training models, we can now modify an existing model to suit our task in as little as a few minutes.

For example, we saw in Chapter 2 that the pretrained ResNet-50 model, which is trained on ImageNet, can predict feline and canine breeds, among thousands of other categories. So, if we just want to classify between the high-level “cat” and “dog” categories (and not the lower-level breeds), we can begin with the ResNet-50 model and quickly retrain this model to classify cats and dogs. All we need to do is show it a dataset with these two categories during training, which should take anywhere between a few minutes to a few hours. In comparison, if we had to train a cat versus dog model without a pretrained model, it could take several hours to days.

From the Creator’s Desk

By Jeremy Howard, cofounder of fast.ai and former chief scientist at Kaggle

Hundreds of thousands of students have studied deep learning through fast.ai. Our goal is to get them up and running as quickly as possible, solving real problems quickly. So what’s the first thing we teach? It’s transfer learning!

Thousands of students have now shared their success stories on our forum (http://forums.fast.ai) describing how with as few as 30 images they have created 100% accurate image classifiers. We’ve also heard from students that have broken academic records in many domains and created commercially valuable models using this simple technique.

Five years ago, I created Enlitic, the first company to focus on deep learning for medicine. As an initial proof of concept, I decided to develop a lung tumor classifier from CT scans. You can probably guess what technique we used...yes, it was transfer learning! In our open source fast.ai library we make transfer learning trivially easy—it’s just three lines of code, and the most important best practices are built in.

A Shallow Dive into Convolutional Neural Networks

We have been using the term “model” to refer to the part of AI that makes our predictions. In deep learning for computer vision, that model is usually a special type of neural network called a CNN. Although we explore CNNs in greater detail later in the book, we look at them very briefly in relation to training them via transfer learning here.

In machine learning, we need to convert data into a set of discernible features and then add a classification algorithm to classify them. It’s the same with CNNs. They consist of two parts: convolutional layers and fully connected layers. The job of the convolutional layers is to take the large number of pixels of an image and convert them into a much smaller representation; that is, features. The fully connected layers convert these features into probabilities. A fully connected layer is really a neural network with hidden layers, as we saw in Chapter 1. In summary, the convolutional layers act as feature extractors, whereas the fully connected layers act as classifiers. Figure 3-2 shows a high-level overview of a CNN.

A high-level overview of a Convolutional Neural Network

Imagine that we want to detect a human face. We might want to use a CNN to classify an image and determine whether it contains a face. Such a CNN would be composed of several layers connected one after another. These layers represent mathematical operations. The output of one layer is the input to the next. The first (or the lowermost) layer is the input layer, where the input image is fed. The last (or the topmost) layer is the output layer, which gives the predictions.

The way it works is the image is fed into the CNN and passes through a series of layers, with each performing a mathematical operation and passing the result to the subsequent layer. The resulting output is a list of object classes and their probabilities. For example, categories like ball—65%, grass—20%, and so on. If the output for an image contains a “face” class with a 70% probability, we conclude that there is a 70% likelihood that the image contains a human face.

Note

An intuitive (and overly simplified) way to look at CNNs is to see them as a series of filters. As the word filter implies, each layer acts as a sieve of information, letting something “pass through” only if it recognizes it. (If you have heard of high-pass and low-pass filters in electronics, this might seem familiar.) We say that the layer was “activated” for that information. Each layer is activated for visual patterns resembling parts of cats, dogs, cars, and so forth. If a layer does not recognize information (due to what it learned while training), its output is close to zero. CNNs are the “bouncers” of the deep learning world!

In the facial detection example, lower-level layers (Figure 3-3, a; layers that are closer to the input image) are “activated” for simpler shapes; for example, edges and curves. Because these layers activate only for basic shapes, they can be easily reused for a different purpose than face recognition such as detecting a car (every image is composed of edges and curves, after all). Middle-level layers (Figure 3-3 b) are activated for more complex shapes such as eyes, noses, and lips. These layers are not nearly as reusable as the lower-level layers. They might not be as useful for detecting a car, but might still be useful for detecting animals. And higher-level layers (Figure 3-3 c) are activated for even more complex shapes; for example, most of the human face. These layers tend to be more task-specific and thus the least reusable across other image classification problems.

(a) Lower level activations, followed by (b) mid-level activations and (c) upper layer activations (image source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, Lee et al., ICML 2009)

The complexity and power of what a layer can recognize increases as we approach the final layers. Conversely, the reusability of a layer decreases as we get closer to the output. This will become apparent very soon when we look at what these layers learn.

Transfer Learning

If we want to transfer knowledge from one model to another, we want to reuse more of the generic layers (closer to the input) and fewer of the task-specific layers (closer to the output). In other words, we want to remove the last few layers (typically the fully connected layers) so that we can utilize the more generic ones, and add layers that are geared toward our specific classification task. Once training begins, the generic layers (which form the majority of our new model) are kept frozen (i.e., they are unmodifiable), whereas the newly added task-specific layers are allowed to be modified. This is how transfer learning helps quickly train new models. Figure 3-4 illustrates this process for a pretrained model trained for task X adapted to task Y.

Fine Tuning

Basic transfer learning gets us only so far. We usually add only two to three fully connected layers after the generic layers to make the new classifier model. If we want higher accuracy, we must allow more layers to be trained. This means unfreezing some of the layers that would have otherwise been frozen in transfer learning. This is known as fine tuning. Figure 3-5 shows an example where some convolutional layers near the head/top are unfrozen and trained for the task at hand.

It’s obvious that compared to basic transfer learning, more layers are tweaked to our dataset during fine tuning. Because a higher number of layers have adapted to our task compared to transfer learning, we can achieve greater accuracy for our task. The decision on how many layers to fine tune is dependent on the amount of data at hand as well as the similarity of the target task to the original dataset on which the pretrained model was trained.

We often hear data scientists saying, “I fine tuned the model,” which means that they took a pretrained model, removed task-specific layers and added new ones, froze the lower layers, and trained the upper part of the network on the new dataset they had.

Note

In daily lingo, transfer learning and fine tuning are used interchangeably. When spoken, transfer learning is used more as a general concept, whereas fine tuning is referred to as its implementation.

How Much to Fine Tune

How many layers of a CNN should we fine tune? This can be guided by the following two factors:

How much data do we have?: If we have a couple hundred labeled images, it would be difficult to train and test a freshly defined model from scratch (i.e., define a model architecture with random seed weights) because we need a lot more data. The danger of training with such a small amount of data is that these powerful networks might potentially memorize it, leading to undesirable overfitting (which we explore later in the chapter). Instead, we will borrow a pretrained network and fine tune the last few layers. But if we had a million labeled images, it would be feasible to fine tune all layers of the network and, if necessary, train from scratch. So, the amount of task-specific data dictates whether we can fine tune, and how much.
How similar is the data?: If the task-specific data is similar to the data used for the pretrained network, we can fine tune the last few layers. But if our task is identifying different bones in an X-ray image and we want to start out from an ImageNet trained network, the high dissimilarity between regular ImageNet images and X-ray images would require nearly all layers to be trained.

To summarize, Table 3-1 offers an easy-to-follow cheat sheet.

Table 3-1. Cheatsheet for when and how to fine tune
	High similarity among datasets	Low similarity among datasets
Large amount of training data	Fine tune all layers	Train from scratch, or fine tune all layers
Small amount of training data	Fine tune last few layers	Tough luck! Train on a smaller network with heavy data augmentation or somehow get more data

Enough theory, let’s see it in action.

Building a Custom Classifier in Keras with Transfer Learning

As promised, it’s time to build our state-of-the-art classifier in 30 lines or less. At a high level, we will use the following steps:

Organize the data. Download labeled images of cats and dogs and then divide the images into training and validation folders.
Build the data pipeline. Define a pipeline for reading data, including preprocessing the images (e.g., resizing) and grouping multiple images together into batches.
Augment the data. In the absence of a ton of training images, make small changes (augmentation) like rotation, zooming, and so on to increase variation in training data.
Define the model. Take a pretrained model, remove the last few task-specific layers, and append a new classifier layer. Freeze the weights of original layers (i.e., make them unmodifiable). Select an optimizer algorithm and a metric to track (like accuracy).
Train and test. Train for a few iterations until our validation accuracy is high. Save the model to eventually load as part of any application for predictions.

This will all make sense pretty soon. Let’s explore this process in detail.

Solving the World’s Most Pressing Computer-Vision Problem

In early 2014, Microsoft Research was figuring out how to solve the world’s most pressing problem at the time: “Differentiating cats and dogs.” (Where else would we have gotten the idea for this chapter?) Keep in mind that it was a much more difficult computer-vision problem back then. To facilitate this effort, Microsoft Research released the Asirra (Animal Species Image Recognition for Restricting Access) dataset. The motivation behind the Asirra dataset was to develop a sufficiently challenging CAPTCHA system. More than three million images, labeled by animal shelters throughout the United States, were provided by Petfinder.com to Microsoft Research. When this problem was initially introduced, the highest possible accuracy attained was around 80%. By using deep learning, in just a few weeks, it went to 98%! This (now relatively easy) task shows the power of deep learning.

Organize the Data

It’s essential to understand the distinction between train, validation, and test data. Let’s look at a real-world analogy of a student preparing for standardized exams (e.g., SAT in the US, the Gaokao in China, JEE in India, CSAT in Korea, etc.). The in-class instruction and homework assignments are analogous to the training process. The quizzes, midterms, and other tests in school are the equivalent to the validation—the student is able to take them frequently, assess performance, and make improvements in their study plan. They’re ultimately optimizing for their performance in the final standardized exam for which they get only one chance. The final exam is equivalent to the test set—the student does not get an opportunity to improve here (ignoring the ability to retake the test). This is their one shot at showing what they have learned.

Similarly, our aim is to give the best predictions in the real world. To enable this, we divide our data into three parts: train, validation, and test. A typical distribution would be 80% for train, 10% for validation, and 10% for test. Note that we randomly divide our data into these three sets in order to ensure the least amount of bias that might creep in unknowingly. The final accuracy of the model is determined by the accuracy on the test set, much like the student’s score is determined only on their performance on the standardized exam.

The model learns from the training data and uses the validation set to evaluate its performance. Machine learning practitioners take this performance as feedback to find opportunities to improve their models on a continuous basis, similar to how students improve their preparation with the help of quizzes. There are several knobs that we can tune to improve performance; for example, the number of layers to train.

In many research competitions (including Kaggle.com), contestants receive a test set that is separate from the data they can use for building the model. This ensures uniformity across the competition when it comes to reporting accuracy. It is up to the contestants to divide the available data into training and validation sets. Similarly, during our experiments in this book, we will continue to divide data in these two sets, keeping in mind that a test dataset is still essential to report real-world numbers.

So why even use a validation set? Data is sometimes difficult to obtain, so why not use all the available samples for training, and then report accuracy on them? Sure, when the model begins to learn, it will gradually give higher accuracy predictions on the training dataset (called training accuracy). But because they are so powerful, deep neural networks can potentially memorize the training data, even resulting in 100% accuracy on the training data sometimes. However, its real-world performance will be quite poor. It’s like if the student knew the questions that would be on the exam before taking it. This is why a validation set, not used to train the model, gives a realistic assessment of the model performance. Even though we might assign 10-15% of the data as a validation set, it will go a long way in guiding us on how good our model really is.

For the training process, we need to store our dataset in the proper folder structure. We’ll divide the images into two sets: training and validation. For an image file, Keras will automatically assign the name of the class (category) based on its parent folder name. Figure 3-6 depicts the ideal structure to recreate.

Example directory structure of the training and validation data for different classes

The following sequence of commands can help download the data and achieve this directory structure:

$ wget https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/ 
download/train.zip 
$ unzip train.zip
$ mv train data
$ cd data
$ mkdir train val
$ mkdir train/cat train/dog
$ mkdir val/cat val/dog

The 25,000 files within the data folder are prefixed with “cat” and “dog.” Now, move the files into their respective directories. To keep our initial experiment short, we pick 250 random files per class and place them in training and validation folders. We can increase/decrease this number anytime, to experiment with a trade-off between accuracy and speed:

$ ls | grep cat | sort -R | head -250 | xargs -I {} mv {} train/cat/
$ ls | grep dog | sort -R | head -250 | xargs -I {} mv {} train/dog/
$ ls | grep cat | sort -R | head -250 | xargs -I {} mv {} val/cat/
$ ls | grep dog | sort -R | head -250 | xargs -I {} mv {} val/dog/

Build the Data Pipeline

To start off with our Python program, we begin by importing the necessary packages:

import tensorflow as tf
from tf.keras.preprocessing.image import ImageDataGenerator
from tf.keras.models import Model
from tf.keras.layers import Input, Flatten, Dense, Dropout,
GlobalAveragePooling2D
from tf.keras.applications.mobilenet import MobileNet, preprocess_input
import math

Place the following lines of configuration right after the import statements, which we can modify based on our dataset:

TRAIN_DATA_DIR = 'data/train_data/'
VALIDATION_DATA_DIR = 'data/val_data/'
TRAIN_SAMPLES = 500
VALIDATION_SAMPLES = 500
NUM_CLASSES = 2
IMG_WIDTH, IMG_HEIGHT = 224, 224
BATCH_SIZE = 64

Number of Classes

With two classes to distinguish between, we can treat this problem as one of the following:

A binary classification task
A multiclass classification task

Binary classification

As a binary classification task, it’s important to note that “cat versus dog is really “cat versus not cat.” A dog would be classified as a “not cat” much like a desk or a ball would. For a given image, the model will give a single probability value corresponding to the “cat” class—hence the probability of “not cat” is 1 - P(cat). If the probability is higher than 0.5, we predict it as “cat”; otherwise, “not cat.” To keep things simple, we assume that it’s guaranteed that the test set would contain only images of either cats or dogs. Because “cat versus not cat” is a binary classification task, we set the number of classes to 1; that is, “cat.” Anything that cannot be classified as “cat” will be classified as “not cat.”

Tip

Keras processes the input data in the alphabetical order of the folder names. Because “cat” comes before “dog” alphabetically, our first class for prediction is “cat.” For a multiclass task, we can apply the same concept and infer each class identifier (index) based on the folder sort order. Note that the class index starts at 0 for the first class.

Multiclass classification

In a hypothetical world that had only cats and dogs and nothing else, a “not cat” would always be a dog. So the label “not cat” could simply be replaced with the label “dog.” However, in the real world, we have more than two types of objects. As explained before, a ball or a sofa would also be classified as “dog,” which would be incorrect. Hence, for a real-world scenario, treating this as a multiclass classification task instead of a binary classification task is far more useful. As a multiclass classification task, we predict separate probability values for each class, and the highest one is our winner. In the case of “cat versus dog,” we set the number of classes to two. To keep our code reusable for future tasks, we will treat this as a multiclassification task.

Batch Size

At a high level, the training process includes the following steps:

Make predictions on images (forward pass).
Determine which predictions were incorrect and propagate back the difference between the prediction and the true value (backpropagation).
Rinse and repeat until the predictions become sufficiently accurate.

It’s quite likely that the initial iteration would have close to 0% accuracy. Repeating the process several times, however, can yield a highly accurate model (>90%).

The batch size defines how many images are seen by the model at a time. It’s important that each batch has a good variety of images from different classes in order to prevent large fluctuations in the accuracy metric between iterations. A sufficiently large batch size would be necessary for that. However, it’s important not to set the batch size too large; a batch that is too large might not fit in GPU memory, resulting in an “out of memory” crash. Usually, batch sizes are set as powers of 2. A good number to start with is 64 for most problems, and we can play with the number by increasing or decreasing it.

Data Augmentation

Usually, when we hear deep learning, we associate it with millions of images. So, 500 images like what we have might be a low number for real-world training. While these deep neural networks are powerful, a little too powerful for small quantities of data, the danger of a limited set of training images is that the neural network might memorize our training data, and show great prediction performance on the training set, but bad accuracy on the validation set. In other words, the model has overtrained and does not generalize on previously unseen images. And we definitely don’t want that.

Tip

Often, when we attempt to train a neural network on a small amount of data, the result is a model that performs extremely well on the training data itself but makes rather poor predictions on data that it has not seen before. Such a model would be described as an overfitted model and the problem itself is known as overfitting.

Figure 3-7 illustrates this phenomenon for a distribution of points close to a sine curve (with little noise). The dots represent the training data visible to our network, and the crosses represent the testing data that was not seen during training. On one extreme (underfitting), an unsophisticated model, such as a linear predictor, will not be able to represent the underlying distribution well and a high error rate on both the training data and the test data will result. On the other extreme (overfitting), a powerful model (such as a deep neural network) might have the capacity to memorize the training data, which would result in a really low error on the training data, but still a high error on the testing data. What we want is the happy middle where the training error and the testing error are both modestly low, which ideally ensures that our model will perform just as well in the real world as it does during training.

Underfitting, overfitting, and ideal fitting for points close to a sine curve

With great power comes great responsibility. It’s our responsibility to ensure that our powerful deep neural network does not overfit on our data. Overfitting is common when we have little training data. We can reduce this likelihood in a few different ways:

Somehow get more data
Heavily augment existing data
Fine tune fewer layers

There are often situations for which there’s not enough data available. Perhaps we’re working on a niche problem and data is difficult to come by. But there are a few ways that we can artificially augment our dataset for classification:

Rotation: In our example, we might want to rotate the 500 images randomly by 20 degrees in either direction, yielding up to 20,000 possible unique images.
Random Shift: Shift the images slightly to the left, or to the right.
Zoom: Zoom in and out slightly of the image.

By combining rotation, shifting, and zooming, the program can generate an almost infinite number of unique images. This important step is called data augmentation. Data augmentation is useful not only for adding more data, but also for training more robust models for real-world scenarios. For example, not all images have the cat properly centered in the middle or at a perfect 0-degree angle. Keras provides the ImageDataGenerator function that augments the data while it is being loaded from the directory. To illustrate what data augmentations of images look like, Figure 3-8 showcases example augmentations generated by the imgaug library for a sample image. (Note that we will not be using imgaug for our actual training.)

Possible image augmentations generated from a single image

Colored images usually have three channels: red, green, and blue. Each channel has an intensity value ranging from 0 to 255. To normalize it (i.e., scale down the value to between 0 and 1), we use the preprocess_input function (which, among other things, divides each pixel by 255):

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
                                   rotation_range=20,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   zoom_range=0.2)
val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

Tip

Sometimes knowing the label of a training image can be useful in determining appropriate ways of augmenting it. For example, when training a digit recognizer, you might be okay with augmentation by flipping vertically for an image of the digit “8,” but not for “6” and “9.”

Unlike our training set, we don’t want to augment our validation set. The reason is that with dynamic augmentation, the validation set would keep changing in each iteration, and the resulting accuracy metric would be inconsistent and difficult to compare across other iterations.

It’s time to load the data from its directories. Training one image at a time can be pretty inefficient, so we can batch them into groups. To introduce more randomness during the training process, we’ll keep shuffling the images in each batch. To bring reproducibility during multiple runs of the same program, we’ll give the random number generator a seed value:

train_generator = train_datagen.flow_from_directory(
                        TRAIN_DATA_DIR,
                        target_size=(IMG_WIDTH, IMG_HEIGHT),
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        seed=12345,
                        class_mode='categorical')
validation_generator = val_datagen.flow_from_directory(
                        VALIDATION_DATA_DIR,
                        target_size=(IMG_WIDTH, IMG_HEIGHT),
                        batch_size=BATCH_SIZE,
                        shuffle=False,
                        class_mode='categorical')

Model Definition

Now that the data is taken care of, we come to the most crucial component of our training process: the model. In the code that follows, we reuse a CNN previously trained on the ImageNet dataset (MobileNet in our case), throw away the last few layers, called fully connected layers (i.e., ImageNet-specific classifier layers), and replace them with our own classifier suited to the task at hand.

For transfer learning, we “freeze” the weights of the original model; that is, set those layers as unmodifiable, so only the layers of the new classifier (that we’ll add) can be modified. We use MobileNet here to keep things fast, but this method will work just as well for any neural network. The following lines include a few terms such as Dense, Dropout, and so on. Although we won’t explore them in this chapter, you can find explanations in Appendix A.

def model_maker():
    base_model = MobileNet(include_top=False, input_shape =
(IMG_WIDTH,IMG_HEIGHT,3))
    for layer in base_model.layers[:]:
        layer.trainable = False # Freeze the layers
    input = Input(shape=(IMG_WIDTH, IMG_HEIGHT, 3))
    custom_model = base_model(input)
    custom_model = GlobalAveragePooling2D()(custom_model)
    custom_model = Dense(64, activation='relu')(custom_model)
    custom_model = Dropout(0.5)(custom_model)
    predictions = Dense(NUM_CLASSES, activation='softmax')(custom_model)
    return Model(inputs=input, outputs=predictions)

Train the Model

Set Training Parameters

With both the data and model ready, all we have left to do is train the model. This is also known as fitting the model to the data. For training a model, we need to select and modify a few different training parameters.

Loss function: The loss function is the penalty we impose on the model for incorrect predictions during the training process. It is the value of this function that we seek to minimize. For example, in a task to predict house prices, the loss function could be the root-mean-square error.
Optimizer: This is an algorithm that helps minimize the loss function. We use Adam, one of the fastest optimizers out there.
Learning rate: Learning is incremental. The learning rate tells the optimizer how big of a step to take toward the solution; in other words, where the loss is minimum. Take too big of a step, and we end up wildly swinging and overshooting our target. Take too small a step, and it can take a really long time before eventually arriving at the target loss value. It is important to set an optimal learning rate to ensure that we reach our learning goal in a reasonable amount of time. In our example, we set the learning rate at 0.001.
Metric: Choose a metric to judge the performance of the trained model. Accuracy is a good explainable metric, especially when the classes are not imbalanced (i.e., roughly equal amounts of data for each class). Note that this metric is not related to the loss function and is mainly used for reporting and not as feedback for the model.

In the following piece of code, we create the custom model using the model_maker function that we wrote earlier. We use the parameters described here to customize this model further for our task of cats versus dogs:

model = model_maker()
model.compile(loss='categorical_crossentropy',
              optimizer= tf.train.Adam(lr=0.001),
              metrics=['acc'])
num_steps = math.ceil(float(TRAIN_SAMPLES)/BATCH_SIZE)              
model.fit_generator(train_generator,
                    steps_per_epoch = num_steps,
                    epochs=10,
                    validation_data = validation_generator,
                    validation_steps = num_steps)

Note

You might have noticed the term epoch in the preceding code. One epoch represents a full training step where the network has gone over the entire dataset. One epoch may consist of several minibatches.

Start Training

Run this program and let the magic begin. If you don’t have a GPU, brew a cup of coffee while you wait—it might take 5 to 10 minutes. Or why wait, when you can run the notebooks of this chapter (posted on GitHub) on Colab with a GPU runtime for free?

When complete, notice that there are four statistics: loss and acc on both the training and validation data. We are rooting for val_acc:

> Epoch 1/100 7/7 [====] - 5s - 
loss: 0.6888 - acc: 0.6756 - val_loss: 0.2786 - val_acc: 0.9018
> Epoch 2/100 7/7 [====] - 5s - 
loss: 0.2915 - acc: 0.9019 - val_loss: 0.2022 - val_acc: 0.9220
> Epoch 3/100 7/7 [====] - 4s - 
loss: 0.1851 - acc: 0.9158 - val_loss: 0.1356 - val_acc: 0.9427
> Epoch 4/100 7/7 [====] - 4s - 
loss: 0.1509 - acc: 0.9341 - val_loss: 0.1451 - val_acc: 0.9404
> Epoch 5/100 7/7 [====] - 4s - 
loss: 0.1455 - acc: 0.9464 - val_loss: 0.1637 - val_acc: 0.9381
> Epoch 6/100 7/7 [====] - 4s - 
loss: 0.1366 - acc: 0.9431 - val_loss: 0.2319 - val_acc: 0.9151
> Epoch 7/100 7/7 [====] - 4s - 
loss: 0.0983 - acc: 0.9606 - val_loss: 0.1420 - val_acc: 0.9495
> Epoch 8/100 7/7 [====] - 4s - 
loss: 0.0841 - acc: 0.9731 - val_loss: 0.1423 - val_acc: 0.9518
> Epoch 9/100 7/7 [====] - 4s - 
loss: 0.0714 - acc: 0.9839 - val_loss: 0.1564 - val_acc: 0.9509
> Epoch 10/100 7/7 [====] - 5s - 
loss: 0.0848 - acc: 0.9677 - val_loss: 0.0882 - val_acc: 0.9702

All it took was 5 seconds in the very first epoch to reach 90% accuracy on the validation set, with just 500 training images. Not bad! And by the 10th step, we observe about 97% validation accuracy. That’s the power of transfer learning.

Let us take a moment to appreciate what happened here. With just 500 images, we were able to reach a high level of accuracy in a matter of a few seconds and with very little code. In contrast, if we did not have a model previously trained on ImageNet, getting an accurate model might have needed training time anywhere between a couple of hours to a few days, and tons more data.

That’s all the code we need to train a state-of-the-art classifier on any problem. Place data into folders with the name of the class, and change the corresponding values in the configuration variables. In case our task has more than two classes, we should use categorical_crossentropy as the loss function and replace the activation function in the last layer with softmax. Table 3-2 illustrates this.

Table 3-2. Deciding the loss and activation type based on the task
Classification type	Class mode	Loss	Activation on the last layer
1 or 2 classes	binary	binary_crossentropy	sigmoid
Multiclass, single label	categorical	categorical_crossentropy	softmax
Multiclass, multilabel	categorical	binary_crossentropy	sigmoid

Before we forget, save the model that you just trained so that we can use it later:

model.save('model.h5')

Test the Model

Now that we have a trained model, we might eventually want to use it later for our application. We can now load this model anytime and classify an image. load_model, as its name suggests, loads the model:

from tf.keras.models import load_model
model = load_model('model.h5')

Now let’s try loading our original sample images and see what results we get:

img_path = '../../sample_images/dog.jpg'
img = image.load_img(img_path, target_size=(224,224))
img_array = image.img_to_array(img)
expanded_img_array = np.expand_dims(img_array, axis=0)
preprocessed_img = preprocess_input(expanded_img_array) # Preprocess the image
prediction = model.predict(preprocessed_img)
print(prediction)
print(validation_generator.class_indices)
[[0.9967706]]
{'dog': 1, 'cat': 0}

Printing the value of the probability, we see that it is 0.996. This is the probability of the given image belonging to the class “1,” which is a dog. Because the probability is greater than 0.5, the image is predicted as a dog.

That’s all that we need to train our own classifiers. Throughout this book, you can expect to reuse this code for training with minimal modifications. You can also reuse this code in your own projects. Play with the number of epochs and images, and observe how it affects the accuracy. Also, we should play with any other data we can find online. It doesn’t get easier than that!

Analyzing the Results

With our trained model, we can analyze how it’s performing on the validation dataset. Beyond the more straightforward accuracy metrics, looking at the actual images of mispredictions should give an intuition as to whether the example was truly challenging or whether our model is not sophisticated enough.

There are three questions that we want to answer for each category (cat, dog):

Which images are we most confident about being a cat/dog?
Which images are we least confident about being a cat/dog?
Which images have incorrect predictions in spite of being highly confident?

Before we get to that, let’s get predictions over the entire validation dataset. First, we set the pipeline configuration correctly:

# VARIABLES
IMG_WIDTH, IMG_HEIGHT = 224, 224
VALIDATION_DATA_DIR = 'data/val_data/'
VALIDATION_BATCH_SIZE = 64

# DATA GENERATORS
validation_datagen = ImageDataGenerator(
        preprocessing_function=preprocess_input)
validation_generator = validation_datagen.flow_from_directory(
        VALIDATION_DATA_DIR,
        target_size=(IMG_WIDTH, IMG_HEIGHT),
        batch_size=VALIDATION_BATCH_SIZE,
        shuffle=False,
        class_mode='categorical')
ground_truth = validation_generator.classes

Then, we get the predictions:

predictions = model.predict_generator(validation_generator)

To make our analysis easier, we make a dictionary storing the image index to the prediction and ground truth (the expected prediction) for each image:

# prediction_table is a dict with index, prediction, ground truth
prediction_table = {}
for index, val in enumerate(predictions):
    # get argmax index
    index_of_highest_probability = np.argmax(val)
    value_of_highest_probability = val[index_of_highest_probability]
    prediction_table[index] = [value_of_highest_probability,
index_of_highest_probability, ground_truth[index]]
assert len(predictions) == len(ground_truth) == len(prediction_table)

For the next two code blocks, we provide boilerplate code, which we reuse regularly throughout the book.

The following is the signature of the helper function we’ll use to find the images with the highest/lowest probability value for a given category. Additionally, we will be using another helper function, - display(), to output the images as a grid on-screen:

def display(sorted_indices, message):
    similar_image_paths = []
    distances = []
    for name, value in sorted_indices:
        [probability, predicted_index, gt] = value
        similar_image_paths.append(VALIDATION_DATA_DIR + fnames[name])
        distances.append(probability)
    plot_images(similar_image_paths, distances, message)

This function is defined the book’s Github website (see http://PracticalDeepLearning.ai), at code/chapter-3).

Now the fun starts! Which images are we most confident contain dogs? Let’s find images with the highest prediction probability (i.e., closest to 1.0; see Figure 3-9) with the predicted class dog (i.e., 1):

# Most confident predictions of 'dog'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=True, label=1, number_of_items=10,
only_false_predictions=False)
message = 'Images with the highest probability of containing dogs'
display(indices[:10], message)

Images with the highest probability of containing dogs

These images are indeed very dog-like. One of the reasons the probability is so high may be the fact that the images contain multiple dogs, as well as clear, unambiguous views. Now let’s try to find which images we are least confident contain dogs (see Figure 3-10):

# Least confident predictions of 'dog'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=False, label=1, number_of_items=10,
only_false_predictions=False)
message = 'Images with the lowest probability of containing dogs'
display(indices[:10], message)

Images with the lowest probability of containing dogs

To repeat, these are the images our classifier is most unsure of containing a dog. Most of these predictions are at the tipping point (i.e., 0.5 probability) to be the majority prediction. Keep in mind the probability of being a cat is just slightly smaller, around 0.49. Compared to the previous set of images, the animals appearing in these images are often smaller and less clear. And these images often result in mispredictions—only 2 of the 10 images were correctly predicted. One possible way to do better here is to train with a larger set of images.

If you are concerned about these misclassifications, worry not. A simple trick to improve the classification accuracy is to have a higher threshold for accepting a classifier’s results, say 0.75. If the classifier is unsure of an image category, its results are withheld. In Chapter 5, we look at how to find an optimal threshold.

Speaking of mispredictions, they are obviously expected when the classifier has low confidence (i.e., near 0.5 probability for a two-class problem). But what we don’t want is to mispredict when our classifier is really sure of its predictions. Let’s check which images the classifier is confident contain dogs in spite of them being cats (see Figure 3-11):

# Incorrect predictions of 'dog'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=True, label=1, number_of_items=10,
only_false_predictions=True)
message = 'Images of cats with the highest probability of containing dogs'
display(indices[:10], message)

Images of cats with the highest probability of containing dogs

Hmm…turns out half of these images contain both cats and dogs, and our classifier is correctly predicting the dog category because they are bigger in size in these images. Thus, it’s not the classifier but the data that is incorrect here. This often happens in large datasets. The remaining half often contains unclear and relatively smaller objects (but ideally we want lower confidence for these difficult-to-identify images).

Repeating the same set of questions for the cat class, which images are more cat-like (see Figure 3-12)?

# Most confident predictions of 'cat'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=True, label=0, number_of_items=10,
only_false_predictions=False)
message = 'Images with the highest probability of containing cats'
display(indices[:10], message)

Images with the highest probability of containing cats

Interestingly, many of these have multiple cats. This affirms our previous hypothesis that multiple clear, unambiguous views of cats can give higher probabilities. On the other hand, which images are we most unsure about containing cats (see Figure 3-13)?

# Least confident predictions of 'cat'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=False, label=0, number_of_items=10,
only_false_predictions=False)
message = 'Images with the lowest probability of containing cats'
display(indices[:10], message)

Images with the lowest probability of containing cats

As seen previously, the key object size is small, and some of the images are quite unclear, meaning that there is too much contrast in some cases or the object is too bright, something not in line with most of the training images. For example, the camera flash in the eighth (dog.6680) and tenth (dog.1625) images in Figure 3-13 makes the dog difficult to recognize. The sixth image contains a dog in front of a sofa of the same color. Two images contain cages.

Lastly, which images is our classifier mistakenly sure of containing cats (see Figure 3-14)?

# Incorrect predictions of 'cat'
indices = get_images_with_sorted_probabilities(prediction_table,
get_highest_probability=True, label=0, number_of_items=10,
only_false_predictions=True)
message = 'Images of dogs with the highest probability of containing cats'
display(indices[:10], message)

Images of dogs with the highest probability of containing cats

These mispredictions are what we want to reduce. Some of them are clearly wrong, whereas others are understandably confusing images. The sixth image (dog.4334) in Figure 3-14 seems to be incorrectly labeled as a dog. The seventh and tenth images are difficult to distinguish against the background. The first and tenth lack enough texture within them to give the classifier enough identification power. And some of the dogs are too small, like the second and fourth.

Going over the various analyses, we can summarize that mispredictions can be caused by low illumination, unclear, difficult-to-distinguish backgrounds, lack of texture, and smaller occupied area with regard to the image.

Analyzing our predictions is a great way to understand what our model has learned and what it’s bad at, and highlights opportunities to enhance its predictive power. Increasing the size of the training examples and more robust augmentation will help in improving the classification. It’s also important to note that showing real-world images to our model (images that look similar to the scenario where our app will be used) will help improve its accuracy drastically. In Chapter 5, we make the classifier more robust.

Summary

In this chapter, we introduced the concept of transfer learning. We reused a pretrained model to build our own cats versus dogs classifier in under 30 lines of code and with barely 500 images, reaching state-of-the-art accuracy in a few minutes. By writing this code, we also debunk the myth that we need millions of images and powerful GPUs to train our classifier (though they help).

Hopefully, with these skills, you might be able to finally answer the age-old question of who let the dogs out.

In the next couple of chapters, we use this learning to understand CNNs in more depth and take the model accuracy to the next level.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Cats Versus Dogs: Transfer Learning in 30 Lines with Keras

Create new playlist

Sign In

Sign Up

Chapter 3. Cats Versus Dogs: Transfer Learning in 30 Lines with Keras

Figure 3-1. Transfer learning in real life

Adapting Pretrained Models to New Tasks

A Shallow Dive into Convolutional Neural Networks

Figure 3-2. A high-level overview of a CNN

Note

Figure 3-3. (a) Lower-level activations, followed by (b) midlevel activations and (c) upper-layer activations (image source: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, Lee et al., ICML 2009)

Transfer Learning

Figure 3-4. An overview of transfer learning

Fine Tuning

Figure 3-5. Fine tuning a convolutional neural network

Note

How Much to Fine Tune

Building a Custom Classifier in Keras with Transfer Learning

Organize the Data

Figure 3-6. Example directory structure of the training and validation data for different classes

Build the Data Pipeline

Number of Classes

Binary classification

Tip

Multiclass classification

Batch Size

Data Augmentation

Tip

Figure 3-7. Underfitting, overfitting, and ideal fitting for points close to a sine curve

Figure 3-8. Possible image augmentations generated from a single image

Tip

Model Definition

Train the Model

Set Training Parameters

Note

Start Training

Test the Model

Analyzing the Results

Figure 3-9. Images with the highest probability of containing dogs

Figure 3-10. Images with the lowest probability of containing dogs

Figure 3-11. Images of cats with the highest probability of containing dogs

Figure 3-12. Images with the highest probability of containing cats

Figure 3-13. Images with the lowest probability of containing cats

Figure 3-14. Images of dogs with the highest probability of containing cats

Further Reading

Figure 3-15. Building a neural network in TensorFlow Playground

Figure 3-16. Defining a CNN and visualizing the output of each layer during training in ConvNetJS

Summary

Table of Contents for
3. Cats Versus Dogs: Transfer Learning in 30 Lines with Keras