Chapter 15. Becoming a Maker: Exploring Embedded AI at the Edge

You know how to build a great AI application, but you want more. You don’t want to be limited to just running AI software on some computer, you want to bring it out in the real physical world. You want to build devices to make things more interactive, to make life easier, to serve humanity, or perhaps just for the fun of it. Maybe you want to build an interactive painting that smiles at you when you look at it. A camera on your door that makes a loud alarm when an unauthorized person attempts to steal delivered packages. Maybe a robotic arm that sorts recyclables and trash. A device in the woods to prevent wildlife poaching, perhaps? Or a drone that can autonomously survey large areas and identify people in distress during floods. Maybe even a wheelchair that could drive on its own. What you need is a smart electronic device, but how would you build it, what would it cost, how powerful would it be? In this chapter, we begin to address those questions.

We look at how to implement AI on an embedded device—a device that you might use in a “maker” project. Makers are people with a DIY spirit who use their creativity to build something new. Often starting as amateur hobbyists, makers are fun-loving problem solvers, roboticists, innovators, and sometimes entrepreneurs.

The aim of this chapter is to spark your ability to select the appropriate device for the task (which means not trying to run a heavy GAN on a tiny CPU or get a quadrillion-core GPU to run a “Not Hotdog” classifier), and set it up for the tests as quickly and easily as possible. We do this by exploring a few of the better-known devices out there, and seeing how we can use them to perform inferencing of our model. And finally, we look at how makers around the world are using AI to build robotic projects.

Let’s take our first step and look at the current landscape of embedded AI devices.

Exploring the Landscape of Embedded AI Devices

In this section, we explore a few well-known embedded AI devices, listed in Table 15-1. We talk about their inner workings and the differences between them before we go into testing.

Table 15-1. Device list
Raspberry Pi 4 The most famous single-board computer, as of this writing
Intel Movidius NCS2 A USB accelerator using a 16-core Visual Processing Unit (VPU)
Google Coral USB A USB accelerator using a custom Google Application-Specific Integrated Circuit (ASIC)
NVIDIA Jetson Nano A single-board computer using a combination of CPU and 128-core CUDA GPU
PYNQ-Z2 A single-board computer using the combination of CPU and 50k CLB Field-Programmable Gate Array (FPGA)

Instead of saying one is better than the other, we want to learn how to choose a setup for a particular project. We don’t typically see a Ferrari during the morning commute. Smaller, more compact cars are more common, and they get the job done just as well, if not better. Similarly, using a powerful $1,000-plus NVIDIA 2080 Ti GPU might be overkill for a battery-powered drone. Following are some questions we should be asking ourselves to understand which of these edge devices would best suit our needs:

  1. How big is the device (for instance, compared to a coin)?

  2. How much does the device cost? Consider whether you’re on a tight budget.

  3. How fast is the device? Will it process a single FPS or 100 FPS?

  4. How much power (in watts) does the device typically need? For battery-powered projects, this can be essential.

With these questions in mind, let’s explore some of the devices in Figure 15-1 one by one.

Family photo of Embedded AI devices; starting at the top, going clockwise: PYNQ-Z2, Arduino UNO R3, Intel Movidius NCS2, Raspberry Pi 4, Google Coral USB Accelerator, NVIDIA Jetson Nano, and a €1 coin for reference in the middle
Figure 15-1. Family photo of Embedded AI devices; starting at the top, going clockwise: PYNQ-Z2, Arduino UNO R3, Intel Movidius NCS2, Raspberry Pi 4, Google Coral USB Accelerator, NVIDIA Jetson Nano, and a €1 coin for reference in the middle

Raspberry Pi

Because we will be talking about embedded devices for makers, let’s begin with the one universally synonymous with electronic projects: the Raspberry Pi (Figure 15-2). It’s cheap, it’s easy to build on, and it has a huge community.

Raspberry Pi 4
Figure 15-2. Raspberry Pi 4
Size 85.6 mm x 56.5 mm
Price Starting at $35
Processing unit ARM Cortex-A72 CPU
Power rating 15W

First of all, what is a Raspberry Pi? It’s a “single board computer,” which simply means that all calculations can be performed on a single printed circuit board (PCB). The fourth version of the single-board computer houses a Broadcom SoC (system-on-a-chip) containing (most important) an ARMv8-A72 quad-core CPU, a VideoCore VI 3D unit, and some video decoders and encoders. It comes with a choice of RAM size, up to 4 GB.

This version is a big step up from the Raspberry Pi 3 in terms of performance, but it does lose a bit of efficiency. This is due to the Raspberry Pi 3 having an ARMv8-A53 core, which is the high-efficiency version, whereas the ARMv8-A72 core (used in version 4) is the high-performance version. We can see this reflected in the power supply recommendation, which was 2.5 amps for the Raspberry Pi 3. For the Raspberry Pi 4, this became 3 amps. It is also important to note that the Raspberry Pi 4 has USB 3 ports, where the Raspberry Pi 3 only has USB 2 ports. This will prove to be important information in the future.

On to some machine learning stuff now. However powerful the Raspberry Pi 4 has become, it still mainly consists of four sequential CPU cores (which now support Out of Order [OoO] execution). It has the VideoCore VI, but there is no TensorFlow version available for this architecture as of this writing. Koichi Nakamura from Idein Inc. has built py-videocore, a Python library for accessing the Quad Processing Units (QPUs; the GPU-like cores of the Raspberry Pi’s SoC). He has accelerated neural networks with it before, but simple acceleration of TensorFlow isn’t possible yet. C++ libraries for accessing these cores are also available. But as you might suspect, these cores are not that powerful, so even when used to accelerate neural networks, they might not yield the desired results. For the sake of simplicity, we will not go into these libraries, because digging into the algorithms is beyond the scope of this book. And, and as we will see further on, this might not be necessary at all.

In the end, the Raspberry Pi has proven to be an immensely useful piece of hardware for numerous tasks. It is often used for educational purposes, and by makers. You would be surprised how many industries use the Raspberry Pi in industrial environments (there is a thing called the netPi, which is just a Raspberry Pi with a robust enclosure).

Intel Movidius Neural Compute Stick

Let’s now dive straight into one of the reasons the Raspberry Pi is the go-to base for a lot of projects: The Intel Movidius Neural Compute Stick 2 (Figure 15-3), the second generation of a USB accelerator created by Intel.

Intel Neural Compute Stick 2
Figure 15-3. Intel Neural Compute Stick 2
Size 72.5 mm x 27 mm
Price $87.99
Processing unit Myriad X VPU
Power rating 1W

What it does is pretty simple: you give it a neural network and some data, and it does all the calculations necessary for inference. And it’s connected only through USB, so you can just hook it up to your Raspberry Pi, run your inferencing on the USB accelerator, and free up the CPU of your Raspberry Pi to do all the other cool stuff you want it to do.

It is based on the Myriad VPU. The first generation had a Myriad 2 VPU, this one has a Myriad X VPU. The VPU contains 16 SHAVE (Streaming Hybrid Architecture Vector Engine) cores, which are kind of like GPU cores but less tailored to graphics-related operations. It features on-chip RAM, which is especially useful when doing neural network inferencing because these networks tend to create a lot of data while they compute, which then can be stored right next to the core, which reduces access time drastically.

Google Coral USB Accelerator

The Google Coral (Figure 15-4), containing the Google Edge TPU, is the second USB accelerator we will discuss. First, a little explanation on why we are looking at two different USB accelerators. The Intel stick, as mentioned, has a number of SHAVE cores, which have multiple available instructions, like GPU cores, and thus act as a processing unit. The Google Edge TPU, on the other hand, is an ASIC (Application Specific Integrated Circuit), which also does some processing, but it serves a single purpose (hence the “Specific” keyword). An ASIC comes with a few properties inherent to the hardware, some of which are really nice, others less so:

Speed

Because all electronic circuits inside the TPU serve a single purpose, there is no overhead in terms of decoding operations. You pump in input data and weights, and it gives you a result, almost instantly.

Efficiency

All ASICs serve a single purpose, so no extra energy is required. The performance/watt figure for ASICs is usually the highest in the business.

Flexibility

An ASIC can do only what it was designed for; in this case, that would be accelerating TensorFlow Lite neural networks. You will need to stick to the Google Edge TPU compiler and 8-bit .tflite models.

Complexity

Google is, in essence, a software company and it knows how to make things easy to use. That is exactly what it has done here, as well. The Google Coral is incredibly easy to get started with.

So how does the Google Edge TPU work? Information on the Edge TPU itself has not been shared, but information about the Cloud TPU has, so the assumption is that the Edge TPU works in broadly the same way as the Cloud TPU. It has dedicated hardware to do the multiply-add, activation, and pooling operations. All the transistors on the chip have been connected in such a way that it can take weights and input data, and it will calculate the output in a highly parallel fashion. The single biggest part of the chip (apart from the on-chip memory, that is) is a part that does exactly what it sounds like: the “Matrix Multiply Unit.” It uses a rather clever, though not so new, principle called systolic execution. This execution principle helps lower memory bandwidth by storing intermediate results in the processing elements rather than in memory.

Google Coral USB Accelerator
Figure 15-4. Google Coral USB Accelerator
Size 65 mm x 30 mm
Price $74.99
Processing unit Google Edge TPU
Power rating 2.5W
Note

What’s the main difference between the two USB accelerators we’ve discussed? The Google Coral is more powerful, but a (little) less flexible than the Intel Movidius NCS2. That being said, the Coral is far easier to set up and work with, certainly after you have the trained and converted model.

NVIDIA Jetson Nano

On to a different kind of hardware: the NVIDIA Jetson Nano (Figure 15-5), a single board AI computer. Kind of like the Raspberry Pi, but with a 128-core CUDA-enabled Maxwell GPU. The addition of the GPU makes it kind of similar to a Raspberry Pi with an Intel NCS2, but where the NCS2 has 16 cores, this Jetson has 128. What more does it contain? A quad-core A57 ARMv8 CPU, which is the predecessor of the ARMv8 A72 in the Raspberry Pi 4 (it is also a few months older than the Raspberry Pi 4), 4 GB of Low-Power Double Data Rate 4 (LPDDR4) RAM memory, which is, quite conveniently, shared between the CPU and GPU, which allows you to process data on the GPU, without copying it.

NVIDIA Jetson Nano
Figure 15-5. NVIDIA Jetson Nano
Size 100 mm x 79 mm
Price $99.00
Processing unit ARM A57 + 128 core Maxwell GPU
Power rating 10W (Can spike higher under high load)

The thing about this single board computer is that, as said before, these GPU cores are CUDA enabled, so yes, they can run TensorFlow-GPU or any-other-framework, which will make a big difference compared to a Raspberry Pi, especially when you also plan to train networks. Also, it’s dirt cheap for a GPU. At this point in time, the Jetson Nano is yours for $99, and this includes a rather high-performance ARM CPU and a 128-core GPU. In comparison, the Raspberry Pi 4 with 4 GB of memory is around $55, the Coral USB accelerator is around $75, and the Movidius NCS2 is about $90, as well. The latter two are not standalone, and will at least need an additional Raspberry Pi to actually do something, and the Pi has no GPU that can easily accelerate deep learning applications.

One more note about the Jetson Nano: it can accelerate default TensorFlow 32-bit floating-point operations, but it will get much more efficient when 16-bit floating-point operations are used, and even more efficient if you use its own TensorRT framework. Luckily, the company has a nice little open source library called TensorFlow-TensorRT (TF-TRT) that will accelerate the available operations with TensorRT automatically while allowing TensorFlow to do the rest. This library offers grand speedups of around four times compared to TensorFlow-GPU. With all this in mind, this makes the Jetson Nano easily the most flexible device.

FPGA + PYNQ

Now is the time to fasten your seatbelts, because we are about to take a deep dive into the dark world of electronics. The PYNQ platform (Figure 15-6), based on the Xilinx Zynq family of chips is, for the most part, a totally different side of the electronics world compared to the other devices discussed in this topic. If you do a bit of research, you’ll find out it has a dual-core ARM-A9 CPU, at a whopping 667 MHz. The first thing you’ll think is “Are you serious? That is ridiculous compared to the 1.5 GHz quad-core A72 from the Raspberry Pi?!” And you’d be right, the CPU in this thing is, for the most part, absolutely worthless. But it has something else on board—an FPGA.

Xilinx PYNQ-Z2
Figure 15-6. Xilinx PYNQ-Z2
Size 140 mm x 87 mm
Price $119.00
Processing unit Xilinx Zynq-7020
Power rating 13.8W

FPGAs

To understand what an FPGA is, we need to first look at some familiar concepts. Let’s begin with a CPU: a device that knows a list of operations known as the Instruction Set Architecture (ISA). This is a set that defines everything the CPU can do, and usually contains operations such as “Load Word” (a word is typically a number of bits equal to the datapath size of the CPU, usually 32-bit or 64-bit), which will load some value into an internal register of the CPU, and “Add,” which can add up two of the internal registers, and store the result in a third register, for example.

The reason the CPU can do this is because it contains a whole bunch of transistors (look at these as electrical switches if you like) that are hardwired in such a way that the CPU automatically translates the operations and does whatever that operation was intended to do. A software program is just a really long list of these operations, in a precisely thought-out order. A single-core CPU will take in operation per operation and carry them out, at pretty impressive speeds.

Let’s look at parallelism. Neural networks, as you know, consist of mostly convolutions or linear nodes, which can all be translated to matrix multiplications. If we look at the mathematics behind these operations, we can spot that each output point of a layer can be calculated independently from the other output points, as demonstrated in Figure 15-7.

Matrix multiplication
Figure 15-7. Matrix multiplication

We can now see that all of these operations could actually be done in a parallel fashion. Our single-core CPU won’t be able to do so, because it can do only one operation at a time, but that’s where multicore CPUs come in. A quad-core CPU can do four operations at a time, which means a theoretical speedup of four times. The SHAVE cores in the Intel Movidius NCS2 and the CUDA cores in the Jetson Nano might not be as complex as a CPU core, but they are good enough for these multiplications and additions, and instead of having four of these, the NCS2 has 16, and the Jetson Nano has 128. A bigger GPU like a RTX 2080 Ti even has 4,352 CUDA cores. It’s easy to see now why GPUs are better at performing deep learning tasks compared to CPUs.

Let’s get back to FPGAs. Whereas CPUs and GPUs are huge collections of transistors hardwired to carry out a set of instructions, you can think of an FPGA as that same huge collection of transistors, but not wired. You can choose how they are wired, and rewire them whenever you want, making them reconfigurable. You can wire them to be a CPU. You can also find schematics and projects for which people have wired them to be a GPU. But most interesting here is that they can even be wired to the exact same architecture as your deep learning neural network, which would actually make them a physical implementation of your network. The word “wired” is used intentionally here. Often, you’ll find people talking about this configuration with the word “program,” but this can be confusing. What you’re doing is reconfiguring the actual hardware, unlike a CPU or GPU for which you download a program that the hardware can run.

PYNQ platform

We usually call the “wiring” file for an FPGA a bitstream or a bitmap, because what you flash onto the chip is basically just a map of the connections that should be made. As you can imagine, making these bitstreams is quite a lot more complicated than running a Python script. That’s where PYNQ comes in. Its tagline is “Python productivity for Zynq.” The company installs the PYNQ image that automatically runs a Jupyter Notebook on the ARM inside the chip, and it comes with a few basic bitstreams; however, more bitstreams will likely become available in the future. Within the PYNQ world, these bitstreams are called overlays.

If you go looking for examples, you’ll quickly find an example called “BNN-PYNQ,” which has a simple VGG-style, six-layer CNN that can run at 3,000-ish FPS on the PYNQ-Z1 and Z2, and close to 10,000 FPS on the ZCU104, which has the Ultrascale+ version of the Zynq chip onboard. These numbers look pretty insane, but take into account that they run on 32x32 pixel images, rather than the usual 224x224 pixel images and that this network is “binarized,” which means it has weights and activations of one bit, instead of the 32-bits of TensorFlow. To have a better comparison of performance, I tried to recreate a similar network in Keras.

I built the FP32 model and trained it on the CIFAR-10 dataset. The CIFAR-10 dataset is easily available with Keras using keras.datasets.cifar10. The FP32 model reached a roughly 17% error rate, which is, surprisingly, only 2% better than the binary model. Inference speeds are around 132 FPS on an Intel i9 eight-core CPU. There is, to my knowledge, no easy way of using a binary network efficiently and easily on a CPU—you’ll need to dig into some specific Python packages or some C code to get the most out of the hardware. You could potentially achieve a speedup of three to five times. This would, however, still be far less than a low-end FPGA, and a CPU will usually draw more power in doing so. Of course, everything has a flip side, and for FPGAs that has to be the complexity of the design. Open source frameworks exist, namely the FINN framework, backed by Xilinx. Other manufacturers offer additional frameworks, but none of them come close to how easy to use software packages and frameworks like TensorFlow are. Designing a neural network for an FPGA would also involve a lot of electronics knowledge, and thus is beyond the scope of this book.

Arduino

Arduino (Figure 15-8) can make your life a lot easier when trying to interact with the real world through sensors and actuators.

AnMicroController Units (MCUs) Arduino is a microcontroller development board built around the 8-bit Advanced Virtual RISC (AVR) ATMega328p microcontroller running at 16 MHz (so yeah, let’s not try to run a decent neural network on that). A microcontroller is something everyone has come into contact with—you can find microcontrollers in almost everything that is slightly electronic, ranging from every screen/TV, to your desktop keyboard, heck bicycle lights may even contain a microcontroller to enable the flashing mode.

These boards come in a variety of shapes, sizes, and performances, and although the MCU in the Arduino UNO is pretty low performance, more potent MCUs are available. The ARM Cortex-M series is probably the most well known. Recently TensorFlow and uTensor (an extremely lightweight machine learning inference framework aimed at MCUs) joined forces to enable the possibility to run TensorFlow models on these MCUs.

Arduino UNO R3
Figure 15-8. Arduino UNO R3
Size 68.6 mm x 53.3 mm
Price Less than $10.00
Processing unit AVR ATMega328p
Power rating 0.3W

Although the Arduino UNO is not really useful for performing lots of calculations, it has many other benefits that make it a mandatory part of the workbench of a maker. It’s cheap, it’s simple, it’s reliable, and it has a huge community surrounding it. Most sensors and actuators have some kind of library and shield for Arduino, which makes it a really easy platform to interact with. The AVR architecture is a kind of dated 8-bit modified Harvard architecture, but it’s really easy to learn, especially with the Arduino framework on top of it. The huge community around Arduino obviously brings a lot of tutorials with it, so just look up any Arduino tutorial for your sensor of choice, and you’ll surely find what you need for your project.

Tip

Using Python’s serial library, you can easily interface with an Arduino using a Universal Asynchronous Receiver/Transmitter (UART) through a USB cable to send commands or data back and forth.

A Qualitative Comparison of Embedded AI Devices

Table 15-2 summarizes the platforms we’ve talked about, and Figure 15-9 plots performance versus complexity for each device.

Table 15-2. Pros and cons of tested platforms
Embedded device Pros Cons
Raspberry Pi 4
  • Easy

  • Large community

  • Readily available

  • Cheap

  • Lacks computing power

Intel Movidius NCS2
  • Easy optimizations

  • Supports most platforms

  • Rather expensive for the speedup

  • Not so powerful

Google Coral USB
  • Easy getting started

  • Huge speedup

  • Price versus speedup

  • Supports only .tflite

  • Needs eight-bit quantization

NVIDIA Jetson Nano
  • All-in-one

  • Training is possible

  • Cheap

  • Really easy to achieve decent performance

  • CUDA GPU

  • Performance might still be lacking

  • Advanced optimizations can be complex

PYNQ-Z2
  • Potentially huge power efficiency

  • Actual hardware design

  • Potentially huge performance

  • Even more versatile than the Jetson Nano

  • Massively complex

  • Long design flow

Performance versus complexity-to-use (the size of the circle represents price)
Figure 15-9. Performance versus complexity-to-use (the size of the circle represents price)

In this section, we attempt to test some platforms by performing 250 classifications of the same image using MobileNetV2, with the top platform trained on the ImageNet dataset. We use the same image every time because this will allow us to eliminate part of the data bottlenecks that can occur in systems that don’t have enough RAM to store all weights and lots of different images.

Ok, let’s find an image first. Who doesn’t like cats? So, let’s have an image of a beautiful cat (Figure 15-10) that is exactly 224 x 224 pixels, that way we don’t need to scale the image.

The image of the cat we will be using for our experiments
Figure 15-10. The image of the cat we will be using for our experiments

It’s always nice to have a benchmark, so let’s run the model on our PC first. Chapter 2 explains how to do this, so let’s go ahead and write a script that will invoke the prediction 250 times, and measure how long this takes. You can find the full script on the book’s GitHub website (see http://PracticalDeepLearning.ai) at code/chapter-15:

$ python3 benchmark.py
Using TensorFlow backend.
tf version : 1.14.0
keras version : 2.2.4
input tensor shape : (1, 224, 224, 3)
warmup prediction
[('n02124075', 'Egyptian_cat', 0.6629321)]
starting now...
Time[s] : 16.704
FPS     : 14.968
Model saved.

Now that we’ve established our benchmark, it’s time to begin looking at how to run this model on the embedded devices and see whether and how we can take the performance to a useful level.

Hands-On with the Raspberry Pi

We begin with the most well-known and most basic of the previously discussed devices: the Raspberry Pi (RPi for short).

We start by installing Raspbian, a Linux variant made especially for the Raspberry Pi. For the sake of brevity, let’s assume that we have a Raspberry Pi with everything installed, updated, connected, and ready. We can go straight to installing TensorFlow on the Pi, which should work using the pip installer:

$ pip3 install tensorflow
Note

Due to the recent switch to Raspbian Buster, we encountered some problems, which were resolved by using the following piwheel:

$ pip3 install
https://www.piwheels.org/simple/tensorflow/tensorflow
-1.13.1-cp37-none-linux_armv7l.whl

Installing Keras is straightforward with pip, but don’t forget to install libhdf5-dev, which you will need if you want to load weights into a neural network:

$ pip3 install keras
$ apt install libhdf5-dev

Because installing OpenCV, called in code as cv2, on the RPi can be a burden (especially when a recent switch of OS has happened), we can load the image using the PIL instead of OpenCV. This means replacing the import of cv2 with an import for PIL, and changing the code to load the image to use this library:

#input_image = cv2.imread(input_image_path)
#input_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)
input_image = PIL.Image.open(input_image_path)
input_image = np.asarray(input_image)
Tip

Unlike OpenCV, PIL loads images in the RGB format, so a conversion from BGR to RGB is no longer needed.

And that’s it! The exact same script should run on your Raspberry Pi now:

$ python3 benchmark.py
Using TensorFlow backend.
tf version : 1.14.0
keras version : 2.2.4
input tensor shape : (1, 224, 224, 3)
warmup prediction
[('n02124075', 'Egyptian_cat', 0.6629321)]
starting now...
Time[s] : 91.041
FPS     : 2.747

As you can see, it runs a lot slower, dropping to less than three FPS. Time to think about how we can speed this up.

The first thing we could do is have a look at TensorFlow Lite. TensorFlow has, as discussed in Chapter 13, a converter built in, which lets us easily convert any model to a TensorFlow Lite (.tflite) model.

We proceed by writing a small script, very similar to the original benchmark script, which will just set up everything to use the TensorFlow Lite model and run the 250 predictions. This script is, of course, also available on the book’s GitHub webiste (see http://PracticalDeepLearning.ai).

Let’s go ahead and run this so that we can evaluate our effort to speed things up.

$ python3 benchmark_tflite.py
Using TensorFlow backend.
input tensor shape : (1, 224, 224, 3)
conversion to tflite is done
INFO: Initialized TensorFlow Lite runtime.
[[('n02124075', 'Egyptian_cat', 0.68769807)]]
starting now (tflite)...
Time[ms] : 32.152
FPS      : 7.775
Note

The Google Coral website has 8-bit quantized CPU models available, which you can run using TensorFlow Lite.

For the Raspberry Pi 4, we observed an increase in speed of three times (see Table 15-3). Not bad for a quick try, but we still only achieved 7.8 FPS, which is kind of nice and perfectly fine for a lot of applications. But what if our original plan was to do something on live video, for example? Then this is not enough. Quantizing the model can achieve even more performance, but this will be only a marginal increase given that the CPU in the Raspberry Pi has not been optimized for 8-bit integer operations.

Table 15-3. Raspberry Pi benchmarks
Setup FPS
Raspberry Pi 4 (tflite, 8-bit) 8.6
Raspberry Pi 4 (tflite) 7.8
Raspberry Pi 3B+ (tflite, 8-bit) 4.2
Raspberry Pi 4 2.7
Raspberry Pi 3B+ 2.1

So how can we speed things up even more so that they become useful for applications like autonomous racecars or automatic drones? We touched on it before: the USB accelerators.

Speeding Up with the Google Coral USB Accelerator

That’s right, these accelerators come in quite handy right now. All of our hardware setup can stay exactly the same, we just need to add a little bit here and there. So let’s find out: how do we get the Google Coral USB Accelerator to run on the Raspberry Pi, and will it speed things up?

First things first. Assuming that we have a USB accelerator on our hands, the first thing we should always do is look for a “Getting Started” guide from the vendor. Check out https://coral.withgoogle.com/, head to Docs>USB Accelerator>Get Started, and you’re rolling in mere minutes.

As of this writing, the Coral USB Accelerator is not yet fully compatible with the Raspberry Pi 4, but after fiddling around with the install.sh script, it is pretty easy to get it running. You need to simply add a part to the install script so that it recognizes the Pi 4, as demonstrated in Figure 15-11.

Google Coral install script changes
Figure 15-11. Google Coral install script changes

Now the install.sh script will run correctly. Then, we need to rename a .so file, so that it will also work with Python 3.7. To do this, use the Unix copy command:

$ sudo cp /usr/local/lib/python3.7/dist-
packages/edgetpu/swig/_edgetpu_cpp_wrapper.cpython-
35m-arm-linux-gnueabihf.so
/usr/local/lib/python3.7/dist-
packages/edgetpu/swig/_edgetpu_cpp_wrapper.cpython-
37m-arm-linux-gnueabihf.so

After you do this, everything should work correctly, and the demonstrations from Google should run.

Now that we have the Coral running, let’s try to benchmark the same MobileNetV2 model again. But this time, it must be the quantized version of it, because Google’s Edge TPU supports only 8-bit integer operations. Google supplied us with the quantized and converted MobileNetV2 model, ready to use with the Edge TPU. We can download it from the Coral website under Resources > See pre-compiled models > MobileNetV2(ImageNet) > Edge TPU Model.

Note

If you want to create, train, quantize, and convert your own model for the Edge TPU, this is a bit more work and cannot be done using Keras. You can find information on how to do this on the Google Coral website.

We have a working USB accelerator and a model to run on it. Now, let’s make a new script to test its performance. The file we are about to make is again very similar to the previous one and takes a lot straight from the examples Google supplied with the Coral. And, as always, this file is available for download at GitHub website (see http://PracticalDeepLearning.ai):

$ python3 benchmark_edgetpu.py
INFO: Initialized TensorFlow Lite runtime.
warmup prediction
Egyptian cat
0.59375
starting now (Edge TPU)...
Time[s] : 1.042
FPS     : 240.380

Yes, that is correct. It did all 250 classifications already! For the Raspberry Pi 4 with USB3, it took only 1.04 seconds, which translates to 240.38 FPS! Now that’s a speedup. As you can expect with these speeds, live video would be no problem at all.

Lots of precompiled models are available, for all kinds of different purposes, so check them out. You might be able to find a model that suits your needs so that you don’t need to go through the flow (or struggle) to create, train, quantize, and convert your own.

The Raspberry Pi 3 has no USB3 ports, only USB2. We can clearly see in Table 15-4 that this creates a data bottleneck for the Coral.

Table 15-4. Google Coral benchmarks
Setup FPS
i7-7700K + Coral (tflite, 8-bit) 352.1
Raspberry Pi 4 + Coral (tflite, 8-bit) 240.4
Jetson Nano + Coral (tflite, 8-bit) 223.2
RPi3 + Coral (tflite, 8-bit) 75.5

Port to NVIDIA Jetson Nano

That USB accelerator is nice, but what if your project needs a real niche model, trained on some crazy, self-made, dataset? As we said earlier, creating a new model for the Coral Edge TPU is quite a bit more work than just creating a Keras model. So, is there an easy way to have a custom model running on the edge with some decent performance? NVIDIA to the rescue! With its Jetson Nano, the company has created a replacement for the Raspberry Pi that has a CUDA enabled, and rather efficient, GPU on board, which lets you accelerate not only TensorFlow, but anything you like.

Again, we must begin by installing all needed packages. First, you want to download a copy of NVIDIA’s JetPack, which is its version of a Debian-like OS for the Jetson platform. Then, to install TensorFlow and the needed packages, NVIDIA walks us through how to do that here.

Note

Note there is a known issue with pip3: “cannot import name ‘main’”.

Here’s the solution:

$ sudo apt install nano
$ sudo nano /usr/bin/pip3

 
Replace : main => __main__
Replace : main() => __main__._main()
Tip

Updating all pip packages can take a long time; if you want to ensure that your Jetson did not freeze, you might want to first install htop:

$ sudo apt install htop
$ htop

This lets you monitor CPU utilization. As long as this is working, your Jetson is, too.

Note

The Jetson Nano tends to run rather hot when compiling, so we recommend using some kind of fan. The development board has a connector, but keep in mind this will feed 5V to the fan, whereas most fans are rated for 12V.

Some 12V fans might do the trick out of the box, some might need a little push to get started. You will need a 40 mm x 40 mm fan. 5V versions are also available.

Unfortunately, the NVIDIA walkthrough is not as easy as the Google one. Chances are you will go through a bit of a struggle with the occasional table-flip moment.

But hang in there; you’ll get there eventually.

Tip

Installing Keras on the Jetson requires scipy, which requires libatlas-base-dev and gfortran, so start by installing the latter, and move to the front:

$ sudo apt install libatlas-base-dev gfortran
$ sudo pip3 install scipy
$ sudo pip3 install keras

After everything is done, we can choose to run the benchmark.py file directly, which will be, because of the GPU, a lot faster than it was on the Raspberry Pi:

$ python3 benchmark.py
Using TensorFlow backend.
tf version : 1.14.0
keras version : 2.2.4
input tensor shape : (1, 224, 224, 3)
warmup prediction
[('n02124075', 'Egyptian_cat', 0.6629321)]
starting now...
Time[s] : 20.520
FPS     : 12.177

This immediately shows the power of the Jetson Nano: it runs almost exactly like any other PC with a GPU. However, 12 FPS is still not huge, so let’s look at how to optimize this.

Tip

You also can attach the Google Coral to your Jetson Nano, which opens up the possibility to run one model on the Coral, and another one on the GPU simultaneously.

The GPU of the Jetson is actually built specifically for 16-bit-floating point operations, so inherently, this will be the best trade-off between precision and performance. As mentioned earlier, NVIDIA has a package called TF-TRT, which facilitates optimization and conversion. However, it is still a bit more complex than the Coral example (keep in mind that Google supplied us with the precompiled model file for the Coral). You’ll need to freeze the Keras model and then create a TF-TRT inference graph. Doing this can take quite a while if you need to find everything laying around on GitHub, so it’s bundled on this book’s GitHub website (see http://PracticalDeepLearning.ai).

You can use the tftrt_helper.py file to optimize your own models for inferencing on the Jetson Nano. All it really does is freeze the Keras model, removes the training nodes, and then use NVIDIA’s TF-TRT Python package (included in TensorFlow Contrib) to optimize the model.

$ python3 benchmark_jetson.py
FrozenGraph build.
TF-TRT model ready to rumble!
input tensor shape : (1, 224, 224, 3)
[[('n02124075', 'Egyptian_cat', 0.66293204)]]
starting now (Jetson Nano)...
Time[s] : 5.124
FPS     : 48.834

48 FPS, that’s a speedup of four times with just a few lines of code. Exciting, isn’t it? You can achieve similar speedups on your own custom models, tailored to the specific needs of your project; these performance benchmarks can be found in Table 15-5.

Table 15-5. NVIDIA Jetson Nano benchmarks
Setup FPS
Jetson Nano + Coral (tflite, 8-bit) 223.2
Jetson Nano (TF-TRT, 16-bit) 48.8
Jetson Nano 128CUDA 12.2
Jetson Nano (tflite, 8-bit) 11.3

Comparing the Performance of Edge Devices

Table 15-6 shows a quantitative comparison of several edge devices running the MobileNetV2 model.

Table 15-6. Full benchmarking results
Setup FPS
i7-7700K + Coral (tflite, 8-bit) 352.1
i7-7700K + GTX1080 2560CUDA 304.9
Raspberry Pi 4 + Coral 240.4
Jetson Nano + Coral (tflite, 8-bit) 223.2
RPi3 + Coral (tflite, 8-bit) 75.5
Jetson Nano (TF-TRT, 16-bit) 48.8
i7-7700K (tflite, 8-bit) 32.4
i9-9880HQ (2019 MacBook Pro) 15.0
Jetson Nano 128CUDA 12.2
Jetson Nano (tflite, 8-bit) 11.3
i7-4870HQ (2014 MacBook Pro) 11.1
Jetson Nano (tflite, 8-bit) 10.9
Raspberry Pi 4 (tflite, 8-bit) 8.6
Raspberry Pi 4 (tflite) 7.8
RPi3B+ (tflite, 8-bit) 4.2
Raspberry Pi 4 2.7
RPi3B+ 2.1

We can summarize the takeaways from these experiments as follows:

  1. Good optimizations make a big difference. The world of computing optimization is still growing rapidly, and we will see big things coming up in the next few years.

  2. More is always better when talking about computing units for AI.

  3. ASIC > FPGA > GPU > CPU in terms of performance/watt.

  4. TensorFlow Lite is awesome, especially for small CPUs. Keep in mind that it is specifically aimed at small CPUs, and using it for x64 machines will not be that interesting.

Case Studies

This chapter is about makers. And it would be incomplete without showcasing what they make. Let’s look at a few things that have been created by running AI on the edge.

JetBot

NVIDIA has been at the forefront of the deep learning revolution, enabling researchers and developers with powerful hardware and software. And in 2019, they took the next step to enable makers, too, by releasing the Jetson Nano. We already know the power of this hardware, so now it’s time to build something with it—like a DIY miniature car. Luckily, NVIDIA has us covered with JetBot (Figure 15-12), the open source robot that can be controlled by an onboard Jetson Nano.

The JetBot wiki features beginner-friendly step-by-step instructions on building it. Here’s a high-level look:

  1. Buy the parts specified in the Bill of Materials, like the motor, caster, camera, and WiFi antenna. There are about 30 parts, costing around $150, on top of the $99 Jetson Nano.

  2. It’s time to channel our inner MacGyver, get a set of pliers and a cross-tip screwdriver, and assemble the parts. The end result should be something resembling Figure 15-12.

  3. Next, we’d set up the software. This involves flashing the JetBot image onto an SD card, booting the Jetson Nano, connecting it to WiFi, and finally connecting to our JetBot from a web browser.

  4. Finally, we run through the example notebooks provided. The provided notebooks enable us, through little code, to not only control the bot from a web browser, but also use its camera to collect training data from streaming video, train a deep learning model right on the device (the Jetson Nano is a mini-GPU after all), and use it for tasks such as avoiding obstacles and following objects such as a person or a ball.

NVIDIA JetBot
Figure 15-12. NVIDIA JetBot

Makers are already using this base framework to extend the JetBot’s capabilities, such as attaching lidar sensors to it for understanding the environment better, bringing JetBot to different physical forms like tanks, Roomba vacuum cleaners (JetRoomba), object-following spiders (JetSpider), and more. Because all roads eventually lead to racing cars, the Jetson Nano team eventually released a similar open source recipe book for a racing car called JetRacer. It’s based on a faster chassis, higher camera frame rate, and optimizations for inference (with TensorRT) to handle the speed. The end result: JetRacer is already being raced at events like the DIY Robocars meetups.

The slowest step here is…waiting for the hardware to arrive.

Squatting for Metro Tickets

When good health can’t motivate people to exercise, what else would? The answer turns out to be…free train tickets! To raise public awareness for its growing obesity crisis, Mexico City installed ticket machines with cameras at subway stations that give free tickets, but only if you exercise. Ten squats get a free ticket (Figure 15-13). Moscow, too, implemented a similar system at its metro stations, but apparently, officials there had a much higher fitness standard at 30 squats per person.

Inspired by these, we could foresee how to build our own “squat-tracker.” We could achieve this in multiple ways, the simplest being to train our own classifier with the classes “squatting” and “not squatting.” This would clearly involve building a dataset with thousands of pictures of people either squatting or not squatting.

A much better way to accomplish this would involve running PoseNet (which tracks body joints, as we saw in Chapter 10) on, say, Google Coral USB Accelerator. With its 10-plus FPS, we could track how many times the hip points drop low enough and count much more accurately. All we would need is a Raspberry Pi, a Pi Camera, a Coral USB Accelerator, and a public transport operator to provide a ticket printer and endless free tickets, and we could start making our cities fitter than ever.

Squatting for tickets
Figure 15-13. Squatting for tickets

Cucumber Sorter

Makoto Koike, the son of cucumber farmers in Japan, observed that his parents spent a lot of time postharvest sorting through cucumbers. The sorting required separating cucumbers into nine categories based on fine-grained analysis of very small features such as minute texture differences, tiny scratches, and prickles, along with larger attributes such as size, thickness, and curvature. The system was complicated and hiring part-time workers was pretty much a no-go because of how long it would take to train them. He couldn’t sit by idly watching his parents go through the labor-intensive task for more than eight hours every single day.

One detail we didn’t mention yet is that Mr. Koike was formerly an embedded systems designer at Toyota. He realized that a lot of the visual inspection could be automated using the power of deep learning. He took more than 7,000 pictures of different cucumbers that his mother had manually sorted over a three-month period. He then trained a classifier that could look at a picture of a cucumber and predict with a high degree of accuracy to which category it belonged. But a classifier is not going to do much on its own now, will it?

Using his experience in embedded systems, he deployed a classifier to a Raspberry Pi that was connected to a camera, a conveyor-belt system, and a sorting arm that pushed each cucumber into one of the multiple bins based on the predictions from the Raspberry Pi (Figure 15-14). The Raspberry Pi ran a small cucumber/not-cucumber classifier as a first pass. For images classified as cucumber, the Raspberry Pi would send the image to a server that ran the more sophisticated multiple category cucumber classifier. Keep in mind that this was in 2015, shortly after TensorFlow was released (long before the days of TensorFlow Lite, MobileNet, and the extra power of the current Raspberry Pi 4 or even 3). With a machine like this, his parents could spend more time in actual farming rather than sorting through already harvested cucumbers for days on end.

Makoto Koike’s cucumber sorting machine (image source)
Figure 15-14. Makoto Koike’s cucumber sorting machine (image source)

Even though the debate on AI taking over jobs appears daily in the news, Mr. Koike’s story truly highlights the real value that AI provides: making humans more productive and augmenting our abilities, along with making his parents proud.

Further Exploration

There are two equally important aspects to focus on when becoming a maker: the hardware and the software. One is incomplete without the other. There are more DIY projects to explore than one can build in a lifetime. But the key is to find some that you’re passionate about, build them end to end, and feel the excitement when they start to actually function. Gradually, you’ll develop the know-how so that the next time you read a project, you can guess how to build it yourself.

The best way to get started is to get hands-on and build more and more projects. Some excellent sources for inspiration include Hackster.io, Hackaday.io, and Instructables.com, which feature a large range of project themes, platforms, and skill levels (from beginner to advanced), often along with the step-by-step instructions.

To reduce the overhead of assembling hardware for beginners, there are several kits available in the market. For example, AI Kits for Raspberry Pi including AIY Vision Kit (Google), AIY Speech Kit (Google), GoPiGo (Dexter Industries), and DuckieTown platform (MIT), to name a few. They lower the barriers for getting started with electronic hardware projects, making the new area more approachable and hence usable for everyone, from high schoolers to rapid prototypers.

On the software side of things, running AI faster on low-power devices means making the models more performant. Model compression techniques like quantization and pruning (as mentioned in Chapter 6) can help get us there. To go even faster (at some loss in accuracy), BNNs, which represent models in one bit instead of the normal 32 bits, are a potential solution. Companies like XNOR.ai and Microsoft’s Embedded Learning Library allow us to make such models. For running AI on even lower-power devices like microcontrollers with less than 100 KB of memory, Pete Warden’s book TinyML is an excellent learning resource.

Summary

In this chapter, we introduced some well-known embedded AI devices and looked at how to run a Keras model on some of them. We presented a high-level view of their inner workings, and how they achieve the computing capacity needed for running neural networks. Eventually benchmarking them to develop an intuition on which might be suitable for our projects depending on size, costs, latency, and power requirements. Going from platforms to robotic projects, we looked at some ways in which makers around the world are using them to build electronic projects. The devices discussed in this chapter are obviously just a handful of those available. New devices will keep popping up as the edge becomes a more desirable place to be for AI.

With the power of edge AI, makers can imagine and bring their dream projects to reality, which might have been considered science fiction just a few years ago. On your journey in this DIY AI world, remember, you are among a tight-knit community of tinkerers with a lot of openness and willingness to share. The forums are active, so if you ever get stuck on a problem, there are helpful souls to call upon. And hopefully, with your project, you can inspire others to become makers themselves.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset