Chapter 12. Not Hotdog on iOS with Core ML and Create ML

“I’m a rich,” said Jian-Yang, a newly minted millionaire in an interview with Bloomberg (Figure 12-1). What did he do? He created the Not Hotdog app (Figure 12-2) and made the world “a better place.”

Jian Yang being interviewed by Bloomberg News after Periscope acquires his “Not Hotdog” technology (source: HBO’s Silicon Valley)
Figure 12-1. Jian-Yang being interviewed by Bloomberg News after Periscope acquires his “Not Hotdog” technology (image source: From HBO’s Silicon Valley)

To the few of us who may be confused (including a third of the authors of this book), we are making a reference to HBO’s Silicon Valley, a show in which one of the characters is tasked with making SeeFood—the “Shazam for food.” It was meant to classify pictures of food and give recipes and nutritional information. Hilariously, the app ends up being good only for recognizing hot dogs. Anything else would be classified as “Not Hotdog.”

There are a few reasons we chose to reference this fictitious app. It’s very much a part of popular culture and something many people can easily relate to. It’s an exemplar: easy enough to build, yet powerful enough to see the magic of deep learning in a real-world application. It is also very trivially generalizable to recognize more than one class of items.

The Not Hotdog app in action (image source: Apple App Store listing for the Not Hotdog app)
Figure 12-2. The Not Hotdog app in action (image source: Apple App Store listing for the Not Hotdog app)

In this chapter, we work through a few different approaches to building a Not Hotdog clone. The general outline of the end-to-end process is as follows:

  1. Collect relevant data.

  2. Train the model.

  3. Convert to Core ML.

  4. Build the iOS app.

Table 12-1 presents the different options available for steps 1 through 3. Further along in the chapter, we do a deep dive into each of them.

Table 12-1. Various approaches to getting a model ready for mobile deployment, right from scratch
Data collection Training mechanism Model conversion
  • Find or collect a dataset

  • Fatkun Chrome browser extension

  • Web scraper using Bing Image Search API

  • Web-based GUI: CustomVision.ai, IBM Watson, Clarifai, Google AutoML

  • Create ML

  • Fine-tune using any framework of choice like Keras

  • Create ML, CustomVision.ai, and other GUI tools generate .mlmodel.

  • For Keras, use Core ML Tools.

  • For TensorFlow trained models, use tf-coreml.

Let’s dive right in!

Collecting Data

To begin solving any computer-vision task using deep learning, we first need to have a dataset of images to train on. In this section, we use three different approaches to collecting the images of the relevant categories in increasing order of time required, from minutes to days.

Approach 1: Find or Collect a Dataset

The fastest way to get our problem solved is to have an existing dataset in hand. There are tons of publicly available datasets for which a category or a subcategory might be relevant to our task. For example, Food-101 (https://www.vision.ee.ethz.ch/datasets_extra/food-101/) from ETH Zurich contains a class of hot dogs. Alternatively, ImageNet contains 1,257 images of hot dogs. We can use a random sample of images from the remaining classes as “Not Hotdog.”

To download images from a particular category, you can use the ImageNet-Utils tool:

  1. Search for the relevant category on the ImageNet website; for example, “Hot dog.”

  2. Note the wnid (WordNet ID)in the URL: http://image-net.org/synset?wnid=n07697537.

  3. Clone the ImageNet-Utils repository:

    $ git clone --recursive
    	https://github.com/tzutalin/ImageNet_Utils.git
  4. Download the images for the particular category by specifying the wnid:

    $ ./downloadutils.py --downloadImages --wnid n07697537

In case we can’t find a dataset, we can also build our own dataset by taking pictures ourselves with a smartphone. It’s essential that we take pictures representative of how our application would be used in the real world. Alternatively, crowdsourcing this problem, like asking friends, family, and coworkers, can generate a diverse dataset. Another approach used by large companies is to hire contractors who are tasked with collecting images. For example, Google Allo released a feature to convert selfies into stickers. To build it, they hired a team of artists to take an image and create the corresponding sticker so that they could train a model on it.

Note

Make sure to check the license under which the images in the dataset are released. It’s best to use images released under permissive licenses such as Creative Commons.

Approach 2: Fatkun Chrome Browser Extension

There are several browser extensions that allow us to batch download multiple images from a website. One such example is the Fatkun Batch Download Image, a browser extension available on the Chrome browser.

We can have the entire dataset ready in the following short and quick steps.

  1. Add the extension to our browser.

  2. Search for the keyword either on Google or Bing Image search.

  3. Select the appropriate filter for image licenses in the search settings.

  4. After the page reloads, scroll to the bottom of it a few times repeatedly to ensure more thumbnails are loaded on the page.

  5. Open the extension and select “This Tab” option, as demonstrated in Figure 12-3.

    Bing Search results for “hot dog”
    Figure 12-3. Bing Search results for “hot dog”
  6. Notice that all the thumbnails are selected by default. At the top of the screen, click the Toggle button to deselect all the thumbnails and select only the ones we need. We can set the minimum width and height to be 224 (most pretrained models take 224x224 as input size).

    Selecting images through the Fatkun extension
    Figure 12-4. Selecting images through the Fatkun extension
  7. In the upper-right corner, click Save Image to download all of the selected thumbnails to our computer.

Note

Note that the images shown in the screenshots are iconic images (i.e., the main object is in direct focus with a clean background). Chances are that using such images exclusively in our model will cause it to not generalize well to real-world images. For example, in images with a clean white background (like on an ecommerce website), the neural network might incorrectly learn that white background equals hot dog. Hence, while performing data collection, ensure that your training images are representative of the real world.

Tip

For the negative class of “Not Hotdog,” we want to collect random images that are abundantly available. Additionally, collect items that look similar to a hot dog but are not; for example, a submarine sandwich, bread, plate, hamburgers, and so on.

An absence of commonly co-occurring items with hot dogs like plates with food, tissues, ketchup bottles or packets, can mistakenly lead the model to think that those are the real hot dogs. So be sure to add these to the negative class.

When you install a browser extension like Fatkun, it will request permissions to read and modify data on all the websites we visit. It might be a good idea to disable the extension when you’re not using it for downloading images.

Approach 3: Web Scraper Using Bing Image Search API

For building larger datasets, using Fatkun to collect images can be a tedious process. Additionally, images returned by the Fatkun browser extension are thumbnails and not the original size images. For large-scale image collections, we can use an API for searching images, like the Bing Image Search API, where we can establish certain constraints such as the keyword, image size, and license. Google used to have the Image Search API, but it was discontinued in 2011.

Bing’s Search API is an amalgamation of its AI-based image understanding and traditional information retrieval methods (i.e., using tags from fields like “alt-text,” “metadata,” and “caption”). Many times, we can end up with some number of irrelevant images because of misleading tags from these fields. As a result, we want to manually parse the collected images to make sure that they are actually relevant to our task.

When we have a very large image dataset, it can be a daunting task to have to go through it manually and filter out all of the poor training examples. It’s easier to approach this in an iterative manner, slowly improving the quality of the training dataset with each iteration. Here are the high-level steps:

  1. Create a subset of the training data by manually reviewing a small number of images. For example, if we have 50k images in our original dataset, we might want to manually select around 500 good training examples for the first iteration.

  2. Train the model on those 500 images.

  3. Test the model on the remaining images and get the confidence value for each image.

  4. Among the images with the least confidence values (i.e., often mispredictions), review a subset (say, 500) and discard the irrelevant images. Add the remaining images from this subset to the training set.

  5. Repeat steps 1 through 4 for a few iterations until we are happy with the quality of the model.

This is a form of semisupervised learning.

Tip

You can improve the model accuracy further by reusing the discarded images as negative training examples.

Note

For a large set of images that don’t have labels, you might want to use co-occurrence of other defining text as labels; for example, hashtags, emojis, alt-text, etc.

Facebook built a dataset of 3.5 billion images using the hashtags from the text of corresponding posts as weak labels, training them, and eventually fine tuning them on the ImageNet dataset. This model beat the state-of-the-art result by 2% (85% top 1% accuracy).

Now that we have collected our image datasets, let’s finally begin training them.

Training Our Model

Broadly speaking there are three easy ways to train, all of which we have discussed previously. Here, we provide a brief overview of a few different approaches.

Approach 1: Use Web UI-based Tools

As discussed in Chapter 8, there are several tools to build custom models by supplying labeled images and performing training using the web UI. Microsoft’s CustomVision.ai, Google AutoML, IBM Watson Visual Recognition, Clarifai, and Baidu EZDL are a few examples. These methods are code free and many provide simple drag-and-drop GUIs for training.

Let’s look at how we can have a mobile-friendly model ready in less than five minutes using CustomVision.ai:

  1. Go to http://customvision.ai, and make a new project. Because we want to export the trained model to a mobile phone, select a compact model type. Since our domain is food related, select “Food (Compact),” as shown in Figure 12-5.

    Define a new project on CustomVision.ai
    Figure 12-5. Define a new project on CustomVision.ai
  2. Upload the images and assign tags (labels), as depicted in Figure 12-6. Upload at least 30 images per tag.

    Uploading images on the CustomVision.ai dashboard. Note that the tags have been populated as Hotdog and Not Hotdog
    Figure 12-6. Uploading images on the CustomVision.ai dashboard. Note that the tags have been populated as Hotdog and Not Hotdog
  3. Click the Train button. A dialog box opens, as shown in Figure 12-7. Fast Training essentially trains the last few layers, whereas Advanced Training can potentially tune the full network giving even higher accuracy (and obviously take more time and money). The Fast Training option should be sufficient for most cases.

    Options for training type
    Figure 12-7. Options for training type
  4. In under a minute, a screen should appear, showing the precision and recall for the newly trained model per category, as shown in Figure 12-8. (This should ring a bell because we had discussed precision and recall earlier in the book.)

    Precision, Recall, and Average Precision of the newly trained model
    Figure 12-8. Precision, Recall, and Average Precision of the newly trained model
  5. Play with the probability threshold to see how it changes the model’s performance. The default 90% threshold achieves pretty good results. The higher the threshold, the more precise the model becomes, but at the expense of reduced recall.

  6. Press the Export button and select the iOS platform (Figure 12-9). Internally, CustomVision.ai converts the model to Core ML (or TensorFlow Lite if you’re exporting for Android).

    The model exporter options in CustomVision.ai
    Figure 12-9. The model exporter options in CustomVision.ai

And we’re done, all without writing a single line of code! Now let’s look at an even more convenient way of training without coding.

Approach 2: Use Create ML

In 2018, Apple launched Create ML as a way for developers within the Apple ecosystem to train computer-vision models natively. Developers could open a playground and write a few lines of Swift code to train an image classifier. Alternatively, they could use CreateMLUI import to display a limited GUI training experience within the playground. It was a good way to get Swift developers to deploy Core ML models without requiring much experience in machine learning.

A year later, at the Apple Worldwide Developers Conference (WWDC) 2019, Apple lowered the barrier even further by announcing the standalone Create ML app on macOS Catalina (10.15). It provides an easy-to-use GUI to train neural networks without needing to write any code at all. Training a neural network simply became a matter of dragging and dropping files into this UI. In addition to supporting the image classifier, they also announced support for object detectors, NLP, sound classification, activity classification (classify activities based on motion sensor data from Apple Watch and iPhone), as well as tabular data (including recommendation systems).

And, it’s fast! Models can be trained in under a minute. This is because it uses transfer learning, so it doesn’t need to train all of the layers in the network. It also supports various data augmentations such as rotations, blur, noise, and so on, and all you need to do is click checkboxes.

Before Create ML came along, it was generally a given that anyone seeking to train a serious neural network in a reasonable amount of time had to own an NVIDIA GPU. Create ML took advantage of the onboard Intel and/or Radeon graphics cards, which allowed faster training on MacBooks without the need to purchase additional hardware. Create ML allows us to train multiple models, from different data sources, all at the same time. It can benefit particularly from powerful hardware such as the Mac Pro or even an external GPU (eGPU).

One major motivation to use Create ML is the size of the models it outputs. A full model can be broken down into a base model (which emits features) and lighter task-specific classification layers. Apple ships the base models into each of its operating systems. So, Create ML just needs to output the task-specific classifier. How small are these models? As little as a few kilobytes (compared to more than 15 MB for a MobileNet model, which is already pretty small for a CNN). This is important in a day and age when more and more app developers are beginning to incorporate deep learning into their apps. The same neural networks do not need to be unnecessarily replicated across several apps consuming valuable storage space.

In short, Create ML is easy, speedy, and tiny. Sounds too good to be true. Turns out the flip-side of having full vertical integration is that the developers are tied into the Apple ecosystem. Create ML exports only .mlmodel files, which can be used exclusively on Apple operating systems such as iOS, iPadOS, macOS, tvOS, and watchOS. Sadly, Android integration is not yet a reality for Create ML.

In this section, we build the Not Hotdog classifier using Create ML:

  1. Open the Create ML app, click New Document, and select the Image Classifier template from among the several options available (including Sound, Activity, Text, Tabular), as shown in Figure 12-10. Note that this is only available on Xcode 11 (or greater), on macOS 10.15 (or greater).

    Choosing a template for a new project
    Figure 12-10. Choosing a template for a new project
  2. In the next screen, enter a name for the project, and then select Done.

  3. We need to sort the data into the correct directory structure. As Figure 12-11 illustrates, we place images in directories that have the names of their labels. It is useful to have separate train and test datasets with their corresponding directories.

    Train and test data in separate directories
    Figure 12-11. Train and test data in separate directories
  4. Point the UI to the training and test data directories, as shown in Figure 12-12.

    Training interface in Create ML
    Figure 12-12. Training interface in Create ML
  5. Figure 12-12 shows the UI after you select the train and test data directories. Notice that the validation data was automatically selected by Create ML. Additionally, notice the augmentation options available. It is at this point that we can click the Play button (the right-facing triangle; see Figure 12-13) to start the training process.

    Create ML screen that opens after loading train and test data
    Figure 12-13. Create ML screen that opens after loading train and test data
    Note

    As you experiment, you will quickly notice that each augmentation that we add will make the training slower. To set a quick baseline performance metric, we should avoid using augmentations in the first run. Subsequently, we can experiment with adding more and more augmentations to assess how they affect the quality of the model.

  6. When the training completes, we can see how the model performed on the training data, (auto-selected) validation data, and the test data, as depicted in Figure 12-14. At the bottom of the screen, we can also observe how long the training process took and the size of the final model. 97% test accuracy in under two minutes. And all that with a 17 KB output. Not too shabby.

    The Create ML screen after training completes
    Figure 12-14. The Create ML screen after training completes
  7. We’re so close now—we just need to export the final model. Drag the Output button (highlighted in Figure 12-14) to the desktop to create the .mlmodel file.

  8. We can double-click on the newly exported .mlmodel file to inspect the input and output layers, as well as test drive the model by dropping images into it, as shown in Figure 12-15.

    The model inspector UI within Xcode
    Figure 12-15. The model inspector UI within Xcode

The model is now ready to be plugged into apps on any Apple device.

Note

Create ML uses transfer learning, training only the last few layers. Depending on your use case, the underlying model that Apple provides you might be insufficient to make high-quality predictions. This is because you are unable to train the earlier layers in the model, thereby restricting the potential to which the model can be tuned. For most day-to-day problems, this should not be an issue. However, for very domain-specific applications like X-rays, or very similar-looking objects for which the tiniest of details matter (like distinguishing currency notes), training a full CNN would be a better approach. We look at doing so in the following section.

Approach 3: Fine Tuning Using Keras

By now, we have become experts in using Keras. This option can get us even higher accuracy if we are up for experimentation and are willing to spend more time training the model. Let’s reuse the code from Chapter 3 and modify parameters such as directory and filename, batch size, and number of images. You will find the code on the book’s GitHub website (see http://PracticalDeepLearning.ai) at code/chapter-12/1-keras-custom-classifier-with-transfer-learning.ipynb.

The model training should take a few minutes to complete, depending on the hardware, and at the end of the training, we should have a NotHotDog.h5 file ready on the disk.

Model Conversion Using Core ML Tools

As discussed in Chapter 11, there are several ways of converting our models to the Core ML format.

Models generated from CustomVision.ai are directly available in Core ML format, hence no conversion is necessary. For models trained in Keras, Core ML Tools can help convert as follows. Note that because we are using a MobileNet model, which uses a custom layer called relu6, we need to import CustomObjectScope:

from tensorflow.keras.models import load_model
from tensorflow.keras.utils.generic_utils import CustomObjectScope
import tensorflow.keras

with CustomObjectScope({'relu6':
tensorflow.keras.applications.mobilenet.relu6,'DepthwiseConv2D':
tensorflow.keras.applications.mobilenet.DepthwiseConv2D}):
    model = load_model('NotHotDog-model.h5')

import coremltools
coreml_model = coremltools.converters.keras.convert(model)
coreml_model.save('NotHotDog.mlmodel')

Now that we have a Core ML model ready, all we need to do is build the app.

Building the iOS App

We can use the code from Chapter 11 and simply replace the .mlmodel with the newly generated model file, as demonstrated in Figure 12-16.

Loading the .mlmodel into Xcode
Figure 12-16. Loading the .mlmodel into Xcode

Now, compile and run the app and you’re done! Figure 12-17 presents the awesome results.

Our app identifying the hot dog
Figure 12-17. Our app identifying the hot dog

Further Exploration

Can we make this application more interesting? We can build an actual “Shazam for food” by training for all the categories in the Food-101 dataset, which we cover in the next chapter. Additionally, we can enhance the UI compared to the barebones percentages our current app shows. And, to make it viral just like “Not Hotdog,” provide a way to share the classifications to social media platforms.

Summary

In this chapter, we worked through an end-to-end pipeline of collecting data, training and converting a model, and using it in the real world on an iOS device. For each step of the pipeline, we have explored a few different options in varying degrees of complexity. And, we have placed the concepts covered in the previous chapters in the context of a real-world application.

And now, like Jian-Yang, go make your millions!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset