Chapter 2. What’s in the Picture: Image Classification with Keras

If you have skimmed through deep learning literature, you might have come across a barrage of academic explanations laced with intimidating mathematics. Don’t worry. We will ease you into practical deep learning with an example of classifying images with just a few lines of code.

In this chapter, we take a closer look at the Keras framework, discuss its place in the deep learning landscape, and then use it to classify a few images using existing state-of-the-art classifiers. We visually investigate how these classifiers operate by using heatmaps. With these heatmaps, we make a fun project in which we classify objects in videos.

Recall from the “Recipe for the Perfect Deep Learning Solution” that we need four ingredients to create our deep learning recipe: hardware, dataset, framework, and model. Let’s see how each of these comes into play in this chapter:

  • We begin with the easy one: hardware. Even an inexpensive laptop would suffice for what we we’re doing in this chapter. Alternatively, you can run the code in this chapter by opening the GitHub notebook (see http://PracticalDeepLearning.ai) in Colab. This is just a matter of a few mouse clicks.

  • Because we won’t be training a neural network just yet, we don’t need a dataset (other than a handful of sample photos to test with).

  • Next, we come to the framework. This chapter’s title has Keras in it, so that is what we will be using for now. In fact, we use Keras for our training needs throughout a good part of the book.

  • One way to approach a deep learning problem is to obtain a dataset, write the code to train it, spend a lot of time and energy (both human and electrical) in training that model, and then use it for making predictions. But we are not gluttons for punishment. So, we will use a pretrained model instead. After all, the research community has already spent blood, sweat, and tears training and publishing many of the standard models that are now publicly available. We will be reusing one of the more famous models called ResNet-50, the little sibling of ResNet-152 that won the ILSVRC in 2015.

You will get hands-on with some code in this chapter. As we all know, the best way to learn is by doing. You might be wondering, though, what’s the theory behind this? That comes in later chapters, in which we delve deeper into the nuts and bolts of CNNs using this chapter as a foundation.

Introducing Keras

As Chapter 1 discussed, Keras started in 2015 as an easy-to-use abstraction layer over other libraries, making rapid prototyping possible. This made the learning curve a lot less steep for beginners of deep learning. At the same time, it made deep learning experts more productive by helping them rapidly iterate on experiments. In fact, the majority of the winning teams on Kaggle.com (which hosts data science competitions) have used Keras. Eventually, in 2017, the full implementation of Keras was available directly in TensorFlow, thereby combining the high scalability, performance, and vast ecosystem of TensorFlow with the ease of Keras. On the web, we often see the TensorFlow version of Keras referred to as tf.keras.

In this chapter and Chapter 3, we write all of the code exclusively in Keras. That includes boilerplate functions such as file reading, image manipulation (augmentation), and so on. We do this primarily for ease of learning. From Chapter 5 onward, we begin to gradually use more of the native performant TensorFlow functions directly for more configurability and control.

Predicting an Image’s Category

In layperson’s terms, image classification answers the question: “what object does this image contain?” More specifically, “This image contains X object with what probability,” where X is from a predefined list of categories of objects. If the probability is higher than a minimum threshold, the image is likely to contain one or more instances of X.

A simple image classification pipeline would consist of the following steps:

  1. Load an image.

  2. Resize it to a predefined size such as 224 x 224 pixels.

  3. Scale the values of the pixel to the range [0,1] or [–1,1], a.k.a. normalization.

  4. Select a pretrained model.

  5. Run the pretrained model on the image to get a list of category predictions and their respective probabilities.

  6. Display a few of the highest probability categories.

Tip

The GitHub link is provided on the website http://PracticalDeepLearning.ai. Navigate to code/chapter-2 where you will find the Jupyter notebook 1-predict-class.ipynb that has all the steps detailed.

We begin by importing all of the necessary modules from the Keras and Python packages:

import tensorflow as tf
from tf.keras.applications.resnet50 import preprocess_input, decode_predictions
from tf.keras.preprocessing import image
import numpy as np
import matplotlib.pyplot as plt

Next, we load and display the image that we want to classify (see Figure 2-1):

img_path = "../../sample-images/cat.jpg"
img = image.load_img(img_path, target_size=(224, 224))
plt.imshow(img)
plt.show()
Plot showing the contents of the input file
Figure 2-1. Plot showing the contents of the input file

Yup, it’s a cat (although the filename kind of gave it away). And that’s what our model should ideally be predicting.

Before feeding any image to Keras, we want to convert it to a standard format. This is because pretrained models expect the input to be of a specific size. The standardization in our case involves resizing the image to 224 x 224 pixels.

Most deep learning models expect a batch of images as input. But what do we do when we have just one image? We create a batch of one image, of course! That essentially involves making an array consisting of that one object. Another way to look at this is to expand the number of dimensions from three (representing the three channels of the image) to four (the extra one for the length of the array itself).

If that is not clear, consider this scenario: for a batch of 64 images of size 224 x 224 pixels, each containing three channels (RGB), the object representing that batch would have a shape 64 x 224 x 224 x 3. In the code that follows, where we’d be using only one 224 x 224 x 3 image, we’d create a batch of just that image by expanding the dimensions from three to four. The shape of this newly created batch would be 1 x 224 x 224 x 3:

img_array = image.img_to_array(img)
img_batch = np.expand_dims(img_array, axis=0) # Increase the number of dimensions

In machine learning, models perform best when they are fed with data within a consistent range. Ranges typically include [0,1] and [–1,1]. Given that image pixel values are between 0 and 255, running the preprocess_input function from Keras on input images will normalize each pixel to a standard range. Normalization or feature scaling is one of the core steps in preprocessing images to make them suitable for deep learning.

Now comes the model. We will be using a Convolutional Neural Network (CNN) called ResNet-50. The very first question we should ask is, “Where will I find the model?” Of course, we could hunt for it on the internet to find something that is compatible with our deep learning framework (Keras). But ain’t nobody got time for that! Luckily, Keras loves to make things easy and provides it to us in a single function call. After we call this function for the first time, the model will be downloaded from a remote server and cached locally:

model = tf.keras.applications.resnet50.ResNet50()

When predicting with this model, the results include probability predictions for each class. Keras also provides the decode_predictions function, which tells us the probability of each category of objects contained in the image.

Now, let’s see the entire code in one handy function:

def classify(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    model = tf.keras.applications.resnet50.ResNet50()
    img_array = image.img_to_array(img)
    img_batch = np.expand_dims(img_array, axis=0)
    img_preprocessed = preprocess_input(img_batch)
    prediction = model.predict(img_preprocessed)
    print(decode_predictions(prediction, top=3)[0])

classify("../../sample-images/cat.jpg")
[('n02123045', 'tabby', 0.50009364),
 ('n02124075', 'Egyptian_cat', 0.21690978),
 ('n02123159', 'tiger_cat', 0.2061722)]

The predicted categories for this image are various types of felines. Why doesn’t it simply predict the word “cat,” instead? The short answer is that the ResNet-50 model was trained on a granular dataset with many categories and does not include the more general “cat.” We investigate this dataset in more detail a little later, but first, let’s load another sample image (see Figure 2-2):

img_path = '../../sample-images/dog.jpg'
img = image.load_img(img_path, target_size=(224, 224))
plt.imshow(img)
plt.show()
Plot showing the contents of the file dog.jpg
Figure 2-2. Plot showing the contents of the file dog.jpg

And, again, we run our handy function from earlier:

classify("../../sample-images/dog.jpg")
[(u'n02113186', u'Cardigan', 0.809839),
 (u'n02113023', u'Pembroke', 0.17665945),
 (u'n02110806', u'basenji', 0.0042166105)]

As expected, we get different breeds of canines (and not just the “dog” category). If you are unfamiliar with the Corgi breed of dogs, the word “corgi” literally means “dwarf dog” in Welsh. The Cardigan and Pembroke are subbreeds of the Corgi family, which happen to look pretty similar to one another. It’s no wonder our model thinks that way, too.

Notice the predicted probability of each category. Usually, the prediction with the highest probability is considered the answer. Alternatively, any value over a predefined threshold can be considered as the answer, too. In the dog example, if we set a threshold of 0.5, Cardigan would be our answer.

Running the notebook on Google Colab using the browser
Figure 2-3. Running the notebook on Google Colab using the browser
Tip

You can follow along with the code in this chapter and execute it interactively without any installations in the browser itself with Google Colab. Simply find the “Run on Colab” link at the top of each notebook on GitHub that you’d like to experiment with. Then, click the “Run Cell” button; this should execute the code within that cell, as shown in Figure 2-3.

Investigating the Model

We got the predictions from our model, great! But what factors led to those predictions? There are a few questions that we need to ask here:

  • What dataset was the model trained on?

  • Are there other models that I can use? How good are they? Where can I get them?

  • Why does my model predict what it predicts?

We look into the answers to each of these questions in this section.

ImageNet Dataset

Let’s investigate the ImageNet dataset on which ResNet-50 was trained. ImageNet, as the name suggests, is a network of images; that is, a dataset of images organized as a network, as demonstrated in Figure 2-4. It is arranged in a hierarchical manner (like the WordNet hierarchy) such that the parent node encompasses a collection of images of all different varieties possible within that parent. For example, within the “animal” parent node, there are fish, birds, mammals, invertebrates, and so on. Each category has multiple subcategories, and these have subsubcategories, and so forth. For example, the category “American water spaniel” is eight levels from the root. The dog category contains 189 total subcategories in five hierarchical levels.

Visually, we developed the tree diagram shown in Figure 2-5 to help you to understand the wide variety of high-level entities that the ImageNet dataset contains. This treemap also shows the relative percentage of different categories that make up the ImageNet dataset.

The categories and subcategories in the ImageNet dataset
Figure 2-4. The categories and subcategories in the ImageNet dataset
Treemap of ImageNet and its classes
Figure 2-5. Treemap of ImageNet and its classes

The ImageNet dataset was the basis for the famous ILSVRC that started in 2010 to benchmark progress in computer vision and challenge researchers to innovate on tasks including object classification. Recall from Chapter 1 that the ImageNet challenge saw submissions that drastically improved in accuracy each year. When it started out, the error rate was nearly 30%. And now, it is 2.2%, already better than how an average human would perform at this task. This dataset and challenge are considered the single biggest reasons for the recent advancements in computer vision.

Wait, AI has better-than-human accuracy? If the dataset was created by humans, won’t humans have 100% accuracy? Well, the dataset was created by experts, with each image verified by multiple people. Then Stanford researcher (and now of Tesla fame) Andrej Karpathy attempted to figure out how much a normal human would fare on ImageNet-1000. Turns out he achieved an accuracy of 94.9%, well short of the 100% we all expected. Andrej painstakingly spent a week going over 1,500 images, spending approximately one minute per image in tagging it. How did he misclassify 5.1% of the images? The reasons are a bit subtle:

Fine-grained recognition

For many people, it is really tough to distinguish a Siberian husky from a Alaskan Malamute. Someone who is really familiar with dog breeds would be able to tell them apart because they look for finer-level details that distinguish both breeds. It turns out that neural networks are capable of learning those finer-level details much more easily than humans.

Category unawareness

Not everyone is aware of all the 120 breeds of dogs and most certainly not each one of the 1,000 classes. But the AI is. After all, it was trained on it.

Note

Similar to ImageNet, speech datasets like Switchboard report a 5.1% error rate for speech transcription (coincidentally the same number as ImageNet). It’s clear that humans have a limit, and AI is gradually beating us.

One of the other key reasons for this fast pace of improvement was that researchers were openly sharing models trained on datasets like ImageNet. In the next section, we learn about model reuse in more detail.

Model Zoos

A model zoo is a place where organizations or individuals can publicly upload models that they have built for others to reuse and improve upon. These models can be trained using any framework (e.g., Keras, TensorFlow, MXNet), for any task (classification, detection, etc.), or trained on any dataset (e.g., ImageNet, Street View House Numbers (SVHN)).

The tradition of model zoos started with Caffe, one of the first deep learning frameworks, developed at the University of California, Berkeley. Training a deep learning model from scratch on a multimillion-image database requires weeks of training time and lots of GPU computational energy, making it a difficult task. The research community recognized this as a bottleneck, and the organizations that participated in the ImageNet competition open sourced their trained models on Caffe’s website. Other frameworks soon followed suit.

When starting out on a new deep learning project, it’s a good idea to first explore whether there’s already a model that performs a similar task and was trained on a similar dataset.

The model zoo in Keras is a collection of various architectures trained using the Keras framework on the ImageNet dataset. We tabulate their details in Table 2-1.

Table 2-1. Architectural details of select pretrained ImageNet models
Model Size Top-1 accuracy Top-5 accuracy Parameters Depth
VGG16 528 MB 0.713 0.901 138,357,544 23
VGG19 549 MB 0.713 0.9 143,667,240 26
ResNet-50 98 MB 0.749 0.921 25,636,712 50
ResNet-101 171 MB 0.764 0.928 44,707,176 101
ResNet-152 232 MB 0.766 0.931 60,419,944 152
InceptionV3 92 MB 0.779 0.937 23,851,784 159
InceptionResNetV2 215 MB 0.803 0.953 55,873,736 572
NASNetMobile 23 MB 0.744 0.919 5,326,716
NASNetLarge 343 MB 0.825 0.96 88,949,818
MobileNet 16 MB 0.704 0.895 4,253,864 88
MobileNetV2 14 MB 0.713 0.901 3,538,984 88

The column “Top-1 accuracy” indicates how many times the best guess was the correct answer, and the column “Top-5 accuracy” indicates how many times at least one out of five guesses were correct. The “Depth” of the network indicates how many layers are present in the network. The “Parameters” column indicates the size of the model; that is, how many individual weights the model has: the more parameters, the “heavier” the model is, and the slower it is to make predictions. In this book, we often use ResNet-50 (the most common architecture cited in research papers for high accuracy) and MobileNet (for a good balance between speed, size, and accuracy).

Class Activation Maps

Image saliency, usually famous in UX research, is trying to answer the question “What part of the image are users paying attention to?” This is facilitated with the help of eye-tracking studies and represented in heatmaps. For example, big, bold fonts or people’s faces usually get more attention than backgrounds. It’s easy to guess how useful these heatmaps would be to designers and advertisers, who can then adapt their content to maximize users’ attention. Taking inspiration from this human version of saliency, wouldn’t it be great to learn which part of the image the neural network is paying attention to? That’s precisely what we will be experimenting with.

In our experiment, we will be overlaying a class activation map (or colloquially a heatmap) on top of a video in order to understand what the network pays attention to. The heatmap tells us something like “In this picture, these pixels were responsible for the prediction of the class dog where “dog” was the category with the highest probability. The “hot” pixels are represented with warmer colors such as red, orange, and yellow, whereas the “cold” pixels are represented using blue. The “hotter” a pixel is, the higher the signal it provides toward the prediction. Figure 2-6 gives us a clearer picture. (If you’re reading the print version, refer to the book’s GitHub for the original color image.)

Original image of a dog and its generated heatmap
Figure 2-6. Original image of a dog and its generated heatmap

In the GitHub repository (see http://PracticalDeepLearning.ai), navigate to code/chapter-2. There, you’ll find a handy Jupyter notebook, 2-class-activation-map-on-video.ipynb, which describes the following steps:

First, we need to install keras-vis using pip:

$ pip install keras-vis --user

We then run the visualization script on a single image to generate the heatmap for it:

$ python visualization.py --process image --path ../sample-images/dog.jpg

We should see a newly created file called dog-output.jpg that shows a side-by-side view of the original image and its heatmap. As we can see from Figure 2-6, the right half of the image indicates the “areas of heat” along with the correct prediction of a “Cardigan” (i.e., Welsh Corgi).

Next, we want to visualize the heatmap for frames in a video. For that, we need FFmpeg, an open source multimedia framework. You can find the download binary as well as the installation instructions for your operating system at https://www.ffmpeg.org.

We use ffmpeg to split up a video into individual frames (at 25 frames per second) and then run our visualization script on each of those frames. We must first create a directory to store these frames and pass its name as part of the ffmpeg command:

$ mkdir kitchen
$ ffmpeg -i video/kitchen-input.mov -vf fps=25 kitchen/thumb%04d.jpg -hide_banner

We then run the visualization script with the path of the directory containing the frames from the previous step:

$ python visualization.py --process video --path kitchen/

We should see a newly created kitchen-output directory that contains all of the heatmaps for the frames from the input directory.

Finally, compile a video from those frames using ffmpeg:

$ ffmpeg -framerate 25 -i kitchen-output/result-%04d.jpg kitchen-output.mp4

Perfect! The result is the original video side by side with a copy of the heatmap overlaid on it. This is a useful tool, in particular, to discover whether the model has learned the correct features or if it picked up stray artifacts during its training.

Imagine generating heatmaps to analyze the strong points and shortfalls of our trained model or a pretrained model.

You should try this experiment out on your own by shooting a video with your smartphone camera and running the aforementioned scripts on the file. Don’t forget to post your videos on Twitter, tagging @PracticalDLBook!

Tip

Heatmaps are a great way to visually detect bias in the data. The quality of a model’s predictions depends heavily on the data on which it was trained. If the data is biased, that will reflect in the predictions. A great example of this is (although probably an urban legend) one in which the US Army wanted to use neural networks to detect enemy tanks camouflaged in trees.1 The researchers who were building the model took photographs—50% containing camouflaged tanks and 50% with just trees. Model training yielded 100% accuracy. A cause for celebration? That sadly wasn’t the case when the US Army tested it. The model had performed very poorly—no better than random guesses. Investigation revealed that photos with the tanks were taken on cloudy (overcast) days and those without the tanks on clear, sunny days. And the neural network model began looking for the sky instead of the tank. If the researchers had visualized the model using heatmaps, they would have caught that issue pretty early.

As we collect data, we must be vigilant at the outset of potential bias that can pollute our model’s learning. For example, when collecting images to build a food classifier, we should verify that the other artifacts such as plates and utensils are not being learned as food. Otherwise, the presence of chopsticks might get our food classified as chow mein. Another term to define this is co-occurrence. Food very frequently co-occurs with cutlery. So watch out for these artifacts seeping into your classifier’s training.

Summary

In this chapter, we got a glimpse of the deep learning universe using Keras. It’s an easy-to-use yet powerful framework that we use in the next several chapters. We observed that there is often no need to collect millions of images and use powerful GPUs to train a custom model because we can use a pretrained model to predict the category of an image. By diving deeper into datasets like ImageNet, we learned the kinds of categories these pretrained models can predict. We also learned about finding these models in model zoos that exist for most frameworks.

In Chapter 3, we explore how we can tweak an existing pretrained model to make predictions on classes of input for which it was not originally intended. As with the current chapter, our approach is geared toward obtaining output without needing millions of images and lots of hardware resources to train a classifier.

1 “Artificial Intelligence as a Positive and Negative Factor in Global Risk” by Eliezer Yudkowsky in Global Catastrophic Risks (Oxford University Press).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset