15 Deploying to production

This chapter covers

  • Options for deploying PyTorch models
  • Working with the PyTorch JIT
  • Deploying a model server and exporting models
  • Running exported and natively implemented models from C++
  • Running models on mobile

In part 1 of this book, we learned a lot about models; and part 2 left us with a detailed path for creating good models for a particular problem. Now that we have these great models, we need to take them where they can be useful. Maintaining infrastructure for executing inference of deep learning models at scale can be impactful from an architectural as well as cost standpoint. While PyTorch started off as a framework focused on research, beginning with the 1.0 release, a set of production-oriented features were added that today make PyTorch an ideal end-to-end platform from research to large-scale production.

What deploying to production means will vary with the use case:

  • Perhaps the most natural deployment for the models we developed in part 2 would be to set up a network service providing access to our models. We’ll do this in two versions using lightweight Python web frameworks: Flask (http:// flask.pocoo.org) and Sanic (https://sanicframework.org). The first is arguably one of the most popular of these frameworks, and the latter is similar in spirit but takes advantage of Python’s new async/await support for asynchronous operations for efficiency.

  • We can export our model to a well-standardized format that allows us to ship it using optimized model processors, specialized hardware, or cloud services. For PyTorch models, the Open Neural Network Exchange (ONNX) format fills this role.

  • We may wish to integrate our models into larger applications. For this it would be handy if we were not limited to Python. Thus we will explore using PyTorch models from C++ with the idea that this also is a stepping-stone to any language.

  • Finally, for some things like the image zebraification we saw in chapter 2, it may be nice to run our model on mobile devices. While it is unlikely that you will have a CT module for your mobile, other medical applications like do-it-yourself skin screenings may be more natural, and the user might prefer running on the device versus having their skin sent to a cloud service. Luckily for us, PyTorch has gained mobile support recently, and we will explore that.

As we learn how to implement these use cases, we will use the classifier from chapter 14 as our first example for serving, and then switch to the zebraification model for the other bits of deployment.

15.1 Serving PyTorch models

We’ll begin with what it takes to put our model on a server. Staying true to our hands-on approach, we’ll start with the simplest possible server. Once we have something basic that works, we’ll take look at its shortfalls and take a stab at resolving them. Finally, we’ll look at what is, at the time of writing, the future. Let’s get something that listens on the network.1

15.1.1 Our model behind a Flask server

Flask is one of the most widely used Python modules. It can be installed using pip:2

pip install Flask

The API can be created by decorating functions.

Listing 15.1 flask_hello_world.py:1

from flask import Flask
app = Flask(__name__)
 
@app.route("/hello")
def hello():
  return "Hello World!"
 
if __name__ == '__main__':
  app.run(host='0.0.0.0', port=8000)

When started, the application will run at port 8000 and expose one route, /hello, that returns the “Hello World” string. At this point, we can augment our Flask server by loading a previously saved model and exposing it through a POST route. We will use the nodule classifier from chapter 14 as an example.

We’ll use Flask’s (somewhat curiously imported) request to get our data. More precisely, request.files contains a dictionary of file objects indexed by field names. We’ll use JSON to parse the input, and we’ll return a JSON string using flask’s jsonify helper.

Instead of /hello, we will now expose a /predict route that takes a binary blob (the pixel content of the series) and the related metadata (a JSON object containing a dictionary with shape as a key) as input files provided with a POST request and returns a JSON response with the predicted diagnosis. More precisely, our server takes one sample (rather than a batch) and returns the probability that it is malignant.

In order to get to the data, we first need to decode the JSON to binary, which we can then decode into a one-dimensional array with numpy.frombuffer. We’ll convert this to a tensor with torch.from_numpy and view its actual shape.

The actual handling of the model is just like in chapter 14: we’ll instantiate LunaModel from chapter 14, load the weights we got from our training, and put the model in eval mode. As we are not training anything, we’ll tell PyTorch that we will not want gradients when running the model by running in a with torch.no_grad() block.

Listing 15.2 flask_server.py:1

import numpy as np
import sys
import os
import torch
from flask import Flask, request, jsonify
import json
 
from p2ch13.model_cls import LunaModel
 
app = Flask(__name__)
 
model = LunaModel()                                                
model.load_state_dict(torch.load(sys.argv[1],
                 map_location='cpu')['model_state'])
model.eval()
 
def run_inference(in_tensor):
  with torch.no_grad():                                           
    # LunaModel takes a batch and outputs a tuple (scores, probs)
    out_tensor = model(in_tensor.unsqueeze(0))[1].squeeze(0)
  probs = out_tensor.tolist()
  out = {'prob_malignant': probs[1]}
  return out
 
@app.route("/predict", methods=["POST"])                          
def predict():
  meta = json.load(request.files['meta'])                         
  blob = request.files['blob'].read()
  in_tensor = torch.from_numpy(np.frombuffer(
    blob, dtype=np.float32))                                      
  in_tensor = in_tensor.view(*meta['shape'])
  out = run_inference(in_tensor)
  return jsonify(out)                                             
 
if __name__ == '__main__':
  app.run(host='0.0.0.0', port=8000)
  print (sys.argv[1])

Sets up our model, loads the weights, and moves to evaluation mode

No autograd for us.

We expect a form submission (HTTP POST) at the “/predict” endpoint.

Our request will have one file called meta.

Converts our data from binary blob to torch

Encodes our response content as JSON

Run the server as follows:

python3 -m p3ch15.flask_server data/part2/models/cls_2019-10-19_15.48.24_final_cls.best.state

We prepared a trivial client at cls_client.py that sends a single example. From the code directory, you can run it as

python3 p3ch15/cls_client.py

It should tell you that the nodule is very unlikely to be malignant. Clearly, our server takes inputs, runs them through our model, and returns the outputs. So are we done? Not quite. Let’s look at what could be better in the next section.

15.1.2 What we want from deployment

Let’s collect some things we desire for serving models.3 First, we want to support modern protocols and their features. Old-school HTTP is deeply serial, which means when a client wants to send several requests in the same connection, the next requests will only be sent after the previous request has been answered. Not very efficient if you want to send a batch of things. We will partially deliver here--our upgrade to Sanic certainly moves us to a framework that has the ambition to be very efficient.

When using GPUs, it is often much more efficient to batch requests than to process them one by one or fire them in parallel. So next, we have the task of collecting requests from several connections, assembling them into a batch to run on the GPU, and then getting the results back to the respective requesters. This sounds elaborate and (again, when we write this) seems not to be done very often in simple tutorials. That is reason enough for us to do it properly here! Note, though, that until latency induced by the duration of a model run is an issue (in that waiting for our own run is OK; but waiting for the batch that’s running when the request arrives to finish, and then waiting for our run to give results, is prohibitive), there is little reason to run multiple batches on one GPU at a given time. Increasing the maximum batch size will generally be more efficient.

We want to serve several things in parallel. Even with asynchronous serving, we need our model to run efficiently on a second thread--this means we want to escape the (in)famous Python global interpreter lock (GIL) with our model.

We also want to do as little copying as possible. Both from a memory-consumption and a time perspective, copying things over and over is bad. Many HTTP things are encoded in Base64 (a format restricted to 6 bits per byte to encode binary in more or less alphanumeric strings), and--say, for images--decoding that to binary and then again to a tensor and then to the batch is clearly relatively expensive. We will partially deliver on this--we’ll use streaming PUT requests to not allocate Base64 strings and to avoid growing strings by successively appending to them (which is terrible for performance for strings as much as tensors). We say we do not deliver completely because we are not truly minimizing the copying, though.

The last desirable thing for serving is safety. Ideally, we would have safe decoding. We want to guard against both overflows and resource exhaustion. Once we have a fixed-size input tensor, we should be mostly good, as it is hard to crash PyTorch starting from fixed-sized inputs. The stretch to get there, decoding images and the like, is likely more of a headache, and we make no guarantees. Internet security is a large enough field that we will not cover it at all. We should note that neural networks are known to be susceptible to manipulation of the inputs to generate desired but wrong or unforeseen outputs (known as adversarial examples), but this isn’t extremely pertinent to our application, so we’ll skip it here.

Enough talk. Let’s improve on our server.

15.1.3 Request batching

Our second example server will use the Sanic framework (installed via the Python package of the same name). This will give us the ability to serve many requests in parallel using asynchronous processing, so we’ll tick that off our list. While we are at it, we will also implement request batching.


Figure 15.1 Dataflow with request batching

Asynchronous programming can sound scary, and it usually comes with lots of terminology. But what we are doing here is just allowing functions to non-blockingly wait for results of computations or events.4

In order to do request batching, we have to decouple the request handling from running the model. Figure 15.1 shows the flow of the data.

At the top of figure 15.1 are the clients, making requests. One by one, these go through the top half of the request processor. They cause work items to be enqueued with the request information. When a full batch has been queued or the oldest request has waited for a specified maximum time, a model runner takes a batch from the queue, processes it, and attaches the result to the work items. These are then processed one by one by the bottom half of the request processor.

Implementation

We implement this by writing two functions. The model runner function starts at the beginning and runs forever. Whenever we need to run the model, it assembles a batch of inputs, runs the model in a second thread (so other things can happen), and returns the result.

The request processor then decodes the request, enqueues inputs, waits for the processing to be completed, and returns the output with the results. In order to appreciate what asynchronous means here, think of the model runner as a wastepaper basket. All the figures we scribble for this chapter can be quickly disposed of to the right of the desk. But every once in a while--either because the basket is full or when it is time to clean up in the evening--we need to take all the collected paper out to the trash can. Similarly, we enqueue new requests, trigger processing if needed, and wait for the results before sending them out as the answer to the request. Figure 15.2 shows our two functions in the blocks we execute uninterrupted before handing back to the event loop.


Figure 15.2 Our asynchronous server consists of three blocks: request processor, model runner, and model execution. These blocks are a bit like functions, but the first two will yield to the event loop in between.

A slight complication relative to this picture is that we have two occasions when we need to process events: if we have accumulated a full batch, we start right away; and when the oldest request reaches the maximum wait time, we also want to run. We solve this by setting a timer for the latter.5

All our interesting code is in a ModelRunner class, as shown in the following listing.

Listing 15.3 request_batching_server.py:32, ModelRunner

class ModelRunner:
  def __init__(self, model_name):
    self.model_name = model_name
    self.queue = []                                    
 
    self.queue_lock = None                             
 
    self.model = get_pretrained_model(self.model_name,
                      map_location=device)             
 
    self.needs_processing = None                       
 
    self.needs_processing_timer = None                 

The queue

This will become our lock.

Loads and instantiates the model. This is the (only) thing we will need to change for switching to the JIT. For now, we import the CycleGAN (with the slight modification of standardizing to 0..1 input and output) from p3ch15/cyclegan.py.

Our signal to run the model

Finally, the timer

ModelRunner first loads our model and takes care of some administrative things. In addition to the model, we also need a few other ingredients. We enter our requests into a queue. This is a just a Python list in which we add work items at the back and remove them in the front.

When we modify the queue, we want to prevent other tasks from changing the queue out from under us. To this effect, we introduce a queue_lock that will be an asyncio.Lock provided by the asyncio module. As all asyncio objects we use here need to know the event loop, which is only available after we initialize the application, we temporarily set it to None in the instantiation. While locking like this may not be strictly necessary because our methods do not hand back to the event loop while holding the lock, and operations on the queue are atomic thanks to the GIL, it does explicitly encode our underlying assumption. If we had multiple workers, we would need to look at locking. One caveat: Python’s async locks are not threadsafe. (Sigh.)

ModelRunner waits when it has nothing to do. We need to signal it from RequestProcessor that it should stop slacking off and get to work. This is done via an asyncio.Event called needs_processing. ModelRunner uses the wait() method to wait for the needs_processing event. The RequestProcessor then uses set() to signal, and ModelRunner wakes up and clear()s the event.

Finally, we need a timer to guarantee a maximal wait time. This timer is created when we need it by using app.loop.call_at. It sets the needs_processing event; we just reserve a slot now. So actually, sometimes the event will be set directly because a batch is complete or when the timer goes off. When we process a batch before the timer goes off, we will clear it so we don’t do too much work.

From request to queue

Next we need to be able to enqueue requests, the core of the first part of RequestProcessor in figure 15.2 (without the decoding and reencoding). We do this in our first async method, process_input.

Listing 15.4 request_batching_server.py:54

async def process_input(self, input):
  our_task = {"done_event": asyncio.Event(loop=app.loop),   
        "input": input,
        "time": app.loop.time()}
  async with self.queue_lock:                               
    if len(self.queue) >= MAX_QUEUE_SIZE:
      raise HandlingError("I'm too busy", code=503)
    self.queue.append(our_task)
    self.schedule_processing_if_needed()                    
 
  await our_task["done_event"].wait()                       
  return our_task["output"]

Sets up the task data

With the lock, we add our task and ...

... schedule processing. Processing will set needs_processing if we have a full batch. If we don’t and no timer is set, it will set one to when the max wait time is up.

Waits (and hands back to the loop using await) for the processing to finish

We set up a little Python dictionary to hold our task’s information: the input of course, the time it was queued, and a done_event to be set when the task has been processed. The processing adds an output.

Holding the queue lock (conveniently done in an async with block), we add our task to the queue and schedule processing if needed. As a precaution, we error out if the queue has become too large. Then all we have to do is wait for our task to be processed, and return it.

Note It is important to use the loop time (typically a monotonic clock), which may be different from the time.time(). Otherwise, we might end up with events scheduled for processing before they have been queued, or no processing at all.

This is all we need for the request processing (except decoding and encoding).

Running batches from the queue

Next, let’s look at the model_runner function on the right side of figure 15.2, which does the model invocation.

Listing 15.5 request_batching_server.py:71, .run_model

async def model_runner(self):
  self.queue_lock = asyncio.Lock(loop=app.loop)
  self.needs_processing = asyncio.Event(loop=app.loop)
  while True:
    await self.needs_processing.wait()                 
    self.needs_processing.clear()
    if self.needs_processing_timer is not None:        
      self.needs_processing_timer.cancel()
      self.needs_processing_timer = None
    async with self.queue_lock:
      # ... line 87
      to_process = self.queue[:MAX_BATCH_SIZE]         
      del self.queue[:len(to_process)]
      self.schedule_processing_if_needed()
    batch = torch.stack([t["input"] for t in to_process], dim=0)
    # we could delete inputs here...
 
    result = await app.loop.run_in_executor(
      None, functools.partial(self.run_model, batch)   
    )
    for t, r in zip(to_process, result):               
      t["output"] = r
      t["done_event"].set()
    del to_process

Waits until there is something to do

Cancels the timer if it is set

Grabs a batch and schedules the running of the next batch, if needed

Runs the model in a separate thread, moving data to the device and then handing over to the model. We continue processing after it is done.

Adds the results to the work-item and sets the ready event

As indicated in figure 15.2, model_runner does some setup and then infinitely loops (but yields to the event loop in between). It is invoked when the app is instantiated, so it can set up queue_lock and the needs_processing event we discussed earlier. Then it goes into the loop, await-ing the needs_processing event.

When an event comes, first we check whether a time is set and, if so, clear it, because we’ll be processing things now. Then model_runner grabs a batch from the queue and, if needed, schedules the processing of the next batch. It assembles the batch from the individual tasks and launches a new thread that evaluates the model using asyncio's app.loop.run_in_executor. Finally, it adds the outputs to the tasks and sets done_event.

And that’s basically it. The web framework--roughly looking like Flask with async and await sprinkled in--needs a little wrapper. And we need to start the model_runner function on the event loop. As mentioned earlier, locking the queue really is not necessary if we do not have multiple runners taking from the queue and potentially interrupting each other, but knowing our code will be adapted to other projects, we stay on the safe side of losing requests.

We start our server with

python3 -m p3ch15.request_batching_server data/p1ch2/horse2zebra_0.4.0.pth

Now we can test by uploading the image data/p1ch2/horse.jpg and saving the result:

curl -T data/p1ch2/horse.jpg http://localhost:8000/image --output /tmp/res.jpg

Note that this server does get a few things right--it batches requests for the GPU and runs asynchronously--but we still use the Python mode, so the GIL hampers running our model in parallel to the request serving in the main thread. It will not be safe for potentially hostile environments like the internet. In particular, the decoding of request data seems neither optimal in speed nor completely safe.

In general, it would be nicer if we could have decoding where we pass the request stream to a function along with a preallocated memory chunk, and the function decodes the image from the stream to us. But we do not know of a library that does things this way.

15.2 Exporting models

So far, we have used PyTorch from the Python interpreter. But this is not always desirable: the GIL is still potentially blocking our improved web server. Or we might want to run on embedded systems where Python is too expensive or unavailable. This is when we export our model. There are several ways in which we can play this. We might go away from PyTorch entirely and move to more specialized frameworks. Or we might stay within the PyTorch ecosystem and use the JIT, a just in time compiler for a PyTorch-centric subset of Python. Even when we then run the JITed model in Python, we might be after two of its advantages: sometimes the JIT enables nifty optimizations, or--as in the case of our web server--we just want to escape the GIL, which JITed models do. Finally (but we take some time to get there), we might run our model under libtorch, the C++ library PyTorch offers, or with the derived Torch Mobile.

15.2.1 Interoperability beyond PyTorch with ONNX

Sometimes we want to leave the PyTorch ecosystem with our model in hand--for example, to run on embedded hardware with a specialized model deployment pipeline. For this purpose, Open Neural Network Exchange provides an interoperational format for neural networks and machine learning models (https://onnx.ai). Once exported, the model can be executed using any ONNX-compatible runtime, such as ONNX Runtime,6 provided that the operations in use in our model are supported by the ONNX standard and the target runtime. It is, for example, quite a bit faster on the Raspberry Pi than running PyTorch directly. Beyond traditional hardware, a lot of specialized AI accelerator hardware supports ONNX (https://onnx.ai/supported-tools .html#deployModel).

In a way, a deep learning model is a program with a very specific instruction set, made of granular operations like matrix multiplication, convolution, relu, tanh, and so on. As such, if we can serialize the computation, we can reexecute it in another runtime that understands its low-level operations. ONNX is a standardization of a format describing those operations and their parameters.

Most of the modern deep learning frameworks support serialization of their computations to ONNX, and some of them can load an ONNX file and execute it (although this is not the case for PyTorch). Some low-footprint (“edge”) devices accept an ONNX files as input and generate low-level instructions for the specific device. And some cloud computing providers now make it possible to upload an ONNX file and see it exposed through a REST endpoint.

In order to export a model to ONNX, we need to run a model with a dummy input: the values of the input tensors don’t really matter; what matters is that they are the correct shape and type. By invoking the torch.onnx.export function, PyTorch will trace the computations performed by the model and serialize them into an ONNX file with the provided name:

torch.onnx.export(seg_model, dummy_input, "seg_model.onnx")

The resulting ONNX file can now be run in a runtime, compiled to an edge device, or uploaded to a cloud service. It can be used from Python after installing onnxruntime or onnxruntime-gpu and getting a batch as a NumPy array.

Listing 15.6 onnx_example.py

import onnxruntime
 
sess = onnxruntime.InferenceSession("seg_model.onnx")   
input_name = sess.get_inputs()[0].name
pred_onnx, = sess.run(None, {input_name: batch})

The ONNX runtime API uses sessions to define models and then calls the run method with a set of named inputs. This is a somewhat typical setup when dealing with computations defined in static graphs.

Not all TorchScript operators can be represented as standardized ONNX operators. If we export operations foreign to ONNX, we will get errors about unknown aten operators when we try to use the runtime.

15.2.2 PyTorch’s own export: Tracing

When interoperability is not the key, but we need to escape the Python GIL or otherwise export our network, we can use PyTorch’s own representation, called the TorchScript graph. We will see what that is and how the JIT that generates it works in the next section. But let’s give it a spin right here and now.

The simplest way to make a TorchScript model is to trace it. This looks exactly like ONNX exporting. This isn’t surprising, because that is what the ONNX model uses under the hood, too. Here we just feed dummy inputs into the model using the torch.jit.trace function. We import UNetWrapper from chapter 13, load the trained parameters, and put the model into evaluation mode.

Before we trace the model, there is one additional caveat: none of the parameters should require gradients, because using the torch.no_grad() context manager is strictly a runtime switch. Even if we trace the model within no_grad but then run it outside, PyTorch will record gradients. If we take a peek ahead at figure 15.4, we see why: after the model has been traced, we ask PyTorch to execute it. But the traced model will have parameters requiring gradients when executing the recorded operations, and they will make everything require gradients. To escape that, we would have to run the traced model in a torch.no_grad context. To spare us this--from experience, it is easy to forget and then be surprised by the lack of performance--we loop through the model parameters and set all of them to not require gradients.

But then all we need to do is call torch.jit.trace. 7

Listing 15.7 trace_example.py

import torch
from p2ch13.model_seg import UNetWrapper
 
seg_dict = torch.load('data-unversioned/part2/models/p2ch13/seg_2019-10-20_15.57.21_none.best.state', map_location='cpu')
seg_model = UNetWrapper(in_channels=8, n_classes=1, depth=4, wf=3, padding=True, batch_norm=True, up_mode='upconv')
seg_model.load_state_dict(seg_dict['model_state'])
seg_model.eval()
for p in seg_model.parameters():                             
    p.requires_grad_(False)
 
dummy_input = torch.randn(1, 8, 512, 512)
traced_seg_model = torch.jit.trace(seg_model, dummy_input)   

Sets the parameters to not require gradients

The tracing

The tracing gives us a warning:

TracerWarning: Converting a tensor to a Python index might cause the trace 
to be incorrect. We can't record the data flow of Python values, so this 
value will be treated as a constant in the future. This means the trace 
might not generalize to other inputs!
  return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])]

This stems from the cropping we do in U-Net, but as long as we only ever plan to feed images of size 512 × 512 into the model, we will be OK. In the next section, we’ll take a closer look at what causes the warning and how to get around the limitation it highlights if we need to. It will also be important when we want to convert models that are more complex than convolutional networks and U-Nets to TorchScript.

We can save the traced model

torch.jit.save(traced_seg_model, 'traced_seg_model.pt')

and load it back without needed anything but the saved file, and then we can call it:

loaded_model = torch.jit.load('traced_seg_model.pt')
prediction = loaded_model(batch)

The PyTorch JIT will keep the model’s state from when we saved it: that we had put it into evaluation mode and that our parameters do not require gradients. If we had not taken care of it beforehand, we would need to use with torch.no_grad(): in the execution.

tip You can run the JITed and exported PyTorch model without keeping the source. However, we always want to establish a workflow where we automatically go from source model to installed JITed model for deployment. If we do not, we will find ourselves in a situation where we would like to tweak something with the model but have lost the ability to modify and regenerate. Always keep the source, Luke!

15.2.3 Our server with a traced model

Now is a good time to iterate our web server to what is, in this case, our final version. We can export the traced CycleGAN model as follows:

python3 p3ch15/cyclegan.py data/p1ch2/horse2zebra_0.4.0.pth data/p3ch15/traced_zebra_model.pt

Now we just need to replace the call to get_pretrained_model with torch.jit.load in our server (and drop the now-unnecessary import of get_pretrained_model). This also means our model runs independent of the GIL--and this is what we wanted our server to achieve here. For your convenience, we have put the small modifications in request_batching_jit_server.py. We can run it with the traced model file path as a command-line argument.

Now that we have had a taste of what the JIT can do for us, let’s dive into the details!

15.3 Interacting with the PyTorch JIT

Debuting in PyTorch 1.0, the PyTorch JIT is at the center of quite a few recent innovations around PyTorch, not least of which is providing a rich set of deployment options.

15.3.1 What to expect from moving beyond classic Python/PyTorch

Quite often, Python is said to lack speed. While there is some truth to this, the tensor operations we use in PyTorch usually are in themselves large enough that the Python slowness between them is not a large issue. For small devices like smartphones, the memory overhead that Python brings might be more important. So keep in mind that frequently, the speedup gained by taking Python out of the computation is 10% or less.

Another immediate speedup from not running the model in Python only appears in multithreaded environments, but then it can be significant: because the intermediates are not Python objects, the computation is not affected by the menace of all Python parallelization, the GIL. This is what we had in mind earlier and realized when we used a traced model in our server.

Moving from the classic PyTorch way of executing one operation before looking at the next does give PyTorch a holistic view of the calculation: that is, it can consider the calculation in its entirety. This opens the door to crucial optimizations and higher-level transformations. Some of those apply mostly to inference, while others can also provide a significant speedup in training.

Let’s use a quick example to give you a taste of why looking at several operations at once can be beneficial. When PyTorch runs a sequence of operations on the GPU, it calls a subprogram (kernel, in CUDA parlance) for each of them. Every kernel reads the input from GPU memory, computes the result, and then stores the result. Thus most of the time is typically spent not computing things, but reading from and writing to memory. This can be improved on by reading only once, computing several operations, and then writing at the very end. This is precisely what the PyTorch JIT fuser does. To give you an idea of how this works, figure 15.3 shows the pointwise computation taking place in long short-term memory (LSTM; https://en.wikipedia.org/wiki/ Long_short-term_memory) cell, a popular building block for recurrent networks.

The details of figure 15.3 are not important to us here, but there are 5 inputs at the top, 2 outputs at the bottom, and 7 intermediate results represented as rounded indices. By computing all of this in one go in a single CUDA function and keeping the intermediates in registers, the JIT reduces the number of memory reads from 12 to 5 and the number of writes from 9 to 2. These are the large gains the JIT gets us; it can reduce the time to train an LSTM network by a factor of four. This seemingly simple trick allows PyTorch to significantly narrow the gap between the speed of LSTM and generalized LSTM cells flexibly defined in PyTorch and the rigid but highly optimized LSTM implementation provided by libraries like cuDNN.

In summary, the speedup from using the JIT to escape Python is more modest than we might naively expect when we have been told that Python is awfully slow, but avoiding the GIL is a significant win for multithreaded applications. The large speedups in JITed models come from special optimizations that the JIT enables but that are more elaborate than just avoiding Python overhead.


Figure 15.3 LSTM cell pointwise operations. From five inputs at the top, this block computes two outputs at the bottom. The boxes in between are intermediate results that vanilla PyTorch will store in memory but the JIT fuser will just keep in registers.

15.3.2 The dual nature of PyTorch as interface and backend

To understand how moving beyond Python works, it is beneficial to mentally separate PyTorch into several parts. We saw a first glimpse of this in section 1.4. Our PyTorch torch.nn modules--which we first saw in chapter 6 and which have been our main tool for modeling ever since--hold the parameters of our network and are implemented using the functional interface: functions taking and returning tensors. These are implemented as a C++ extension, handed over to the C++-level autograd-enabled layer. (This then hands the actual computation to an internal library called ATen, performing the computation or relying on backends to do so, but this is not important.)

Given that the C++ functions are already there, the PyTorch developers made them into an official API. This is the nucleus of LibTorch, which allows us to write C++ tensor operations that look almost like their Python counterparts. As the torch.nn modules are Python-only by nature, the C++ API mirrors them in a namespace torch::nn that is designed to look a lot like the Python part but is independent.

This would allow us to redo in C++ what we did in Python. But that is not what we want: we want to export the model. Happily, there is another interface to the same functions provided by PyTorch: the PyTorch JIT. The PyTorch JIT provides a “symbolic” representation of the computation. This representation is the TorchScript intermediate representation (TorchScript IR, or sometimes just TorchScript). We mentioned TorchScript in section 15.2.2 when discussing delayed computation. In the following sections, we will see how to get this representation of our Python models and how they can be saved, loaded, and executed. Similar to what we discussed for the regular PyTorch API, the PyTorch JIT functions to load, inspect, and execute TorchScript modules can also be accessed both from Python and from C++.

In summary, we have four ways of calling PyTorch functions, illustrated in figure 15.4: from both C++ and Python, we can either call functions directly or have the JIT as an intermediary. All of these eventually call the C++ LibTorch functions and from there ATen and the computational backend.


Figure 15.4 Many ways of calling into PyTorch

15.3.3 TorchScript

TorchScript is at the center of the deployment options envisioned by PyTorch. As such, it is worth taking a close look at how it works.

There are two straightforward ways to create a TorchScript model: tracing and scripting. We will look at each of them in the following sections. At a very high level, the two work as follows:

In tracing, which we used in in section 15.2.2, we execute our usual PyTorch model using sample (random) inputs. The PyTorch JIT has hooks (in the C++ autograd interface) for every function that allows it to record the computation. In a way, it is like saying “Watch how I compute the outputs--now you can do the same.” Given that the JIT only comes into play when PyTorch functions (and also nn.Modules) are called, you can run any Python code while tracing, but the JIT will only notice those bits (and notably be ignorant of control flow). When we use tensor shapes--usually a tuple of integers--the JIT tries to follow what’s going on but may have to give up. This is what gave us the warning when tracing the U-Net.

In scripting, the PyTorch JIT looks at the actual Python code of our computation and compiles it into the TorchScript IR. This means that, while we can be sure that every aspect of our program is captured by the JIT, we are restricted to those parts understood by the compiler. This is like saying “I am telling you how to do it--now you do the same.” Sounds like programming, really.

We are not here for theory, so let’s try tracing and scripting with a very simple function that adds inefficiently over the first dimension:

# In[2]:
def myfn(x):
    y = x[0]
    for i in range(1, x.size(0)):
        y = y + x[i]
    return y

We can trace it:

# In[3]:
inp = torch.randn(5,5)
traced_fn = torch.jit.trace(myfn, inp)
print(traced_fn.code)
 
# Out[3]:
def myfn(x: Tensor) -> Tensor:
  y = torch.select(x, 0, 0)                                                
  y0 = torch.add(y, torch.select(x, 0, 1), alpha=1)                        
  y1 = torch.add(y0, torch.select(x, 0, 2), alpha=1)
  y2 = torch.add(y1, torch.select(x, 0, 3), alpha=1)
  _0 = torch.add(y2, torch.select(x, 0, 4), alpha=1)
  return _0
 
 
TracerWarning: Converting a tensor to a Python index might cause the trace 
to be incorrect. We can't record the data flow of Python values, so this
value will be treated as a constant in the future. This means the
trace might not generalize to other inputs!

Indexing in the first line of our function

Our loop--but completely unrolled and fixed to 1...4 regardless of the size of x

Scary, but so true!

We see the big warning--and indeed, the code has fixed indexing and additions for five rows, and it would not deal as intended with four or six rows.

This is where scripting helps:

# In[4]:
scripted_fn = torch.jit.script(myfn)
print(scripted_fn.code)
 
# Out[4]:
def myfn(x: Tensor) -> Tensor:
  y = torch.select(x, 0, 0)
  _0 = torch.__range_length(1, torch.size(x, 0), 1)     
  y0 = y
  for _1 in range(_0):                                  
    i = torch.__derive_index(_1, 1, 1)
    y0 = torch.add(y0, torch.select(x, 0, i), alpha=1)  
  return y0

PyTorch constructs the range length from the tensor size.

Our for loop--even if we have to take the funny-looking next line to get our index i

Our loop body, which is just a tad more verbose

We can also print the scripted graph, which is closer to the internal representation of TorchScript:

# In[5]:
xprint(scripted_fn.graph)
# end::cell_5_code[]
 
# tag::cell_5_output[]
# Out[5]:
graph(%x.1 : Tensor):
  %10 : bool = prim::Constant[value=1]()               
  %2 : int = prim::Constant[value=0]()
  %5 : int = prim::Constant[value=1]()
  %y.1 : Tensor = aten::select(%x.1, %2, %2)           
  %7 : int = aten::size(%x.1, %2)
  %9 : int = aten::__range_length(%5, %7, %5)          
  %y : Tensor = prim::Loop(%9, %10, %y.1)              
    block0(%11 : int, %y.6 : Tensor):
      %i.1 : int = aten::__derive_index(%11, %5, %5)
      %18 : Tensor = aten::select(%x.1, %2, %i.1)      
      %y.3 : Tensor = aten::add(%y.6, %18, %5)
      -> (%10, %y.3)
  return (%y)

Seems a lot more verbose than we need

The first assignment of y

Constructing the range is recognizable after we see the code.

Our for loop returns the value (y) it calculates.

Body of the for loop: selects a slice, and adds to y

In practice, you would most often use torch.jit.script in the form of a decorator:

@torch.jit.script
def myfn(x):
  ...

You could also do this with a custom trace decorator taking care of the inputs, but this has not caught on.

Although TorchScript (the language) looks like a subset of Python, there are fundamental differences. If we look very closely, we see that PyTorch has added type specifications to the code. This hints at an important difference: TorchScript is statically typed--every value (variable) in the program has one and only one type. Also, the types are limited to those for which the TorchScript IR has a representation. Within the program, the JIT will usually infer the type automatically, but we need to annotate any non-tensor arguments of scripted functions with their types. This is in stark contrast to Python, where we can assign anything to any variable.

So far, we’ve traced functions to get scripted functions. But we graduated from just using functions in chapter 5 to using modules a long time ago. Sure enough, we can also trace or script models. These will then behave roughly like the modules we know and love. For both tracing and scripting, we pass an instance of Module to torch.jit.trace (with sample inputs) or torch.jit.script (without sample inputs), respectively. This will give us the forward method we are used to. If we want to expose other methods (this only works in scripting) to be called from the outside, we decorate them with @torch.jit.export in the class definition.

When we said that the JITed modules work like they did in Python, this includes the fact that we can use them for training, too. On the flip side, this means we need to set them up for inference (for example, using the torch.no_grad() context) just like our traditional models, to make them do the right thing.

With algorithmically relatively simple models--like the CycleGAN, classification models and U-Net-based segmentation--we can just trace the model as we did earlier. For more complex models, a nifty property is that we can use scripted or traced functions from other scripted or traced code, and that we can use scripted or traced submodules when constructing and tracing or scripting a module. We can also trace functions by calling nn.Models, but then we need to set all parameters to not require gradients, as the parameters will be constants for the traced model.

As we have seen tracing already, let’s look at a practical example of scripting in more detail.

15.3.4 Scripting the gaps of traceability

In more complex models, such as those from the Fast R-CNN family for detection or recurrent networks used in natural language processing, the bits with control flow like for loops need to be scripted. Similarly, if we needed the flexibility, we would find the code bit the tracer warned about.

Listing 15.8 From utils/unet.py

class UNetUpBlock(nn.Module):
    ...
    def center_crop(self, layer, target_size):
        _, _, layer_height, layer_width = layer.size()
        diff_y = (layer_height - target_size[0]) // 2
        diff_x = (layer_width - target_size[1]) // 2
        return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])]                            
 
    def forward(self, x, bridge):
        ...
        crop1 = self.center_crop(bridge, up.shape[2:])
 ...

The tracer warns here.

What happens is that the JIT magically replaces the shape tuple up.shape with a 1D integer tensor with the same information. Now the slicing [2:] and the calculation of diff_x and diff_y are all traceable tensor operations. However, that does not save us, because the slicing then wants Python ints; and there, the reach of the JIT ends, giving us the warning.

But we can solve this issue in a straightforward way: we script center_crop. We slightly change the cut between caller and callee by passing up to the scripted center _crop and extracting the sizes there. Other than that, all we need is to add the @torch.jit.script decorator. The result is the following code, which makes the U-Net model traceable without warnings.

Listing 15.9 Rewritten excerpt from utils/unet.py

@torch.jit.script
def center_crop(layer, target):                         
    _, _, layer_height, layer_width = layer.size()
    _, _, target_height, target_width = target.size()   
    diff_y = (layer_height - target_height) // 2
    diff_x = (layer_width - target_width]) // 2
    return layer[:, :, diff_y:(diff_y + target_height),  diff_x:(diff_x + target_width)]                     
 
class UNetUpBlock(nn.Module):
    ...
 
    def forward(self, x, bridge):
        ...
        crop1 = center_crop(bridge, up)                 
  ...

Changes the signature, taking target instead of target_size

Gets the sizes within the scripted part

The indexing uses the size values we got.

We adapt our call to pass up rather than the size.

Another option we could choose--but that we will not use here--would be to move unscriptable things into custom operators implemented in C++. The TorchVision library does that for some specialty operations in Mask R-CNN models.

15.4 LibTorch: PyTorch in C++

We have seen various way to export our models, but so far, we have used Python. We’ll now look at how we can forgo Python and work with C++ directly.

Let’s go back to the horse-to-zebra CycleGAN example. We will now take the JITed model from section 15.2.3 and run it from a C++ program.

15.4.1 Running JITed models from C++

The hardest part about deploying PyTorch vision models in C++ is choosing an image library to choose the data.8 Here, we go with the very lightweight library CImg (http://cimg.eu). If you are very familiar with OpenCV, you can adapt the code to use that instead; we just felt that CImg is easiest for our exposition.

Running a JITed model is very simple. We’ll first show the image handling; it is not really what we are after, so we will do this very quickly.9

Listing 15.10 cyclegan_jit.cpp

#include "torch/script.h"                                       
#define cimg_use_jpeg
#include "CImg.h"
using namespace cimg_library;
int main(int argc, char **argv) {
  CImg<float> image(argv[2]);                                   
  image = image.resize(227, 227);                               
  // ...here we need to produce an output tensor from input
  CImg<float> out_img(output.data_ptr<float>(), output.size(2), 
                      output.size(3), 1, output.size(1));
  out_img.save(argv[3]);                                        
  return 0;
}

Includes the PyTorch script header and CImg with native JPEG support

Loads and decodes the image into a float array

Resizes to a smaller size

The method data_ptr<float>() gives us a pointer to the tensor storage. With it and the shape information, we can construct the output image.

Saves the image

For the PyTorch side, we include a C++ header torch/script.h. Then we need to set up and include the CImg library. In the main function, we load an image from a file given on the command line and resize it (in CImg). So we now have a 227 × 227 image in the CImg<float> variable image. At the end of the program, we’ll create an out_img of the same type from our (1, 3, 277, 277)-shaped tensor and save it.

Don’t worry about these bits. They are not the PyTorch C++ we want to learn, so we can just take them as is.

The actual computation is straightforward, too. We need to make an input tensor from the image, load our model, and run the input tensor through it.

Listing 15.11 cyclegan_jit.cpp

auto input_ = torch::tensor(
    torch::ArrayRef<float>(image.data(), image.size()));  
  auto input = input_.reshape({1, 3, image.height(),
                   image.width()}).div_(255);             
 
  auto module = torch::jit::load(argv[1]);                
 
  std::vector<torch::jit::IValue> inputs;                 
  inputs.push_back(input);
  auto output_ = module.forward(inputs).toTensor();       
 
  auto output = output_.contiguous().mul_(255);           

Puts the image data into a tensor

Reshapes and rescales to move from CImg conventions to PyTorch’s

Loads the JITed model or function from a file

Packs the input into a (one-element) vector of IValues

Calls the module and extracts the result tensor. For efficiency, the ownership is moved, so if we held on to the IValue, it would be empty afterward.

Makes sure our result is contiguous

Recall from chapter 3 that PyTorch keeps the values of a tensor in a large chunk of memory in a particular order. So does CImg, and we can get a pointer to this memory chunk (as a float array) using image.data() and the number of elements using image.size(). With these two, we can create a somewhat smarter reference: a torch::ArrayRef (which is just shorthand for pointer plus size; PyTorch uses those at the C++ level for data but also for returning sizes without copying). Then we can just parse that into the torch::tensor constructor, just as we would with a list.

tip Sometimes you might want to use the similar-working torch::from_blob instead of torch::tensor. The difference is that tensor will copy the data. If you do not want copying, you can use from_blob, but then you need to take care that the underpinning memory is available during the lifetime of the tensor.

Our tensor is only 1D, so we need to reshape it. Conveniently, CImg uses the same ordering as PyTorch (channel, rows, columns). If not, we would need to adapt the reshaping and permute the axes as we did in chapter 4. As CImg uses a range of 0...255 and we made our model to use 0...1, we divide here and multiply later. This could, of course, be absorbed into the model, but we wanted to reuse our traced model.

A common pitfall to avoid: pre- and postprocessing

When switching from one library to another, it is easy to forget to check that the conversion steps are compatible. They are non-obvious unless we look up the memory layout and scaling convention of PyTorch and the image processing library we use. If we forget, we will be disappointed by not getting the results we anticipate.

Here, the model would go wild because it gets extremely large inputs. However, in the end, the output convention of our model is to give RGB values in the 0..1 range. If we used this directly with CImg, the result would look all black.

Other frameworks have other conventions: for example OpenCV likes to store images as BGR instead of RGB, requiring us to flip the channel dimension. We always want to make sure the input we feed to the model in the deployment is the same as what we fed into it in Python.

 

Loading the traced model is very straightforward using torch::jit::load. Next, we have to deal with an abstraction PyTorch introduces to bridge between Python and C++: we need to wrap our input in an IValue (or several IValues), the generic data type for any value. A function in the JIT is passed a vector of IValues, so we declare that and then push_back our input tensor. This will automatically wrap our tensor into an IValue. We feed this vector of IValues to the forward and get a single one back. We can then unpack the tensor in the resulting IValue with .toTensor.

Here we see a bit about IValues: they have a type (here, Tensor), but they could also be holding int64_ts or doubles or a list of tensors. For example, if we had multiple outputs, we would get an IValue holding a list of tensors, which ultimately stems from the Python calling conventions. When we unpack a tensor from an IValue using .toTensor, the IValue transfers ownership (becomes invalid). But let’s not worry about it; we got a tensor back. Because sometimes the model may return non-contiguous data (with gaps in the storage from chapter 3), but CImg reasonably requires us to provide it with a contiguous block, we call contiguous. It is important that we assign this contiguous tensor to a variable that is in scope until we are done working with the underlying memory. Just like in Python, PyTorch will free memory if it sees that no tensors are using it anymore.

So let’s compile this! On Debian or Ubuntu, you need to install cimg-dev, libjpeg-dev, and libx11-dev to use CImg.

You can download a C++ library of PyTorch from the PyTorch page. But given that we already have PyTorch installed,10 we might as well use that; it comes with all we need for C++. We need to know where our PyTorch installation lives, so open Python and check torch.__file__, which may say /usr/local/lib/python3.7/dist-packages/ torch/__init__.py. This means the CMake files we need are in /usr/local/lib/ python3.7/dist-packages/torch/share/cmake/.

While using CMake seems like overkill for a single source file project, linking to PyTorch is a bit complex; so we just use the following as a boilerplate CMake file.11

Listing 15.12 CMakeLists.txt

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(cyclegan-jit)                                         
 
find_package(Torch REQUIRED)                                  
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
 
add_executable(cyclegan-jit cyclegan_jit.cpp)                 
target_link_libraries(cyclegan-jit pthread jpeg X11)          
target_link_libraries(cyclegan-jit "${TORCH_LIBRARIES}")
set_property(TARGET cyclegan-jit PROPERTY CXX_STANDARD 14)

Project name. Replace it with your own here and on the other lines.

We need Torch.

We want to compile an executable named cyclegan-jit from the cyclegan_jit.cpp source file.

Links to the bits required for CImg. CImg itself is all-include, so it does not appear here.

It is best to make a build directory as a subdirectory of where the source code resides and then in it run CMake as12 CMAKE_PREFIX_PATH=/usr/local/lib/python3.7/ dist-packages/torch/share/cmake/ cmake .. and finally make. This will build the cyclegan-jit program, which we can then run as follows:

./cyclegan-jit ../traced_zebra_model.pt  ../../data/p1ch2/horse.jpg /tmp/z.jpg

We just ran our PyTorch model without Python. Awesome! If you want to ship your application, you likely want to copy the libraries from /usr/local/lib/python3.7/dist-packages/torch/lib into where your executable is, so that they will always be found.

15.4.2 C++ from the start: The C++ API

The C++ modular API is intended to feel a lot like the Python one. To get a taste, we will translate the CycleGAN generator into a model natively defined in C++, but without the JIT. We do, however, need the pretrained weights, so we’ll save a traced version of the model (and here it is important to trace not a function but the model).

We’ll start with some administrative details: includes and namespaces.

Listing 15.13 cyclegan_cpp_api.cpp

#include <torch/torch.h>   
#define cimg_use_jpeg
#include <CImg.h>
using torch::Tensor;       

Imports the one-stop torch/torch.h header and CImg

Spelling out torch::Tensor can be tedious, so we import the name into the main namespace.

When we look at the source code in the file, we find that ConvTransposed2d is ad hoc defined, when ideally it should be taken from the standard library. The issue here is that the C++ modular API is still under development; and with PyTorch 1.4, the premade ConvTranspose2d module cannot be used in Sequential because it takes an optional second argument.13 Usually we could just leave Sequential--as we did for Python--but we want our model to have the same structure as the Python CycleGAN generator from chapter 2.

Next, let’s look at the residual block.

Listing 15.14 Residual block in cyclegan_cpp_api.cpp

struct ResNetBlock : torch::nn::Module {
  torch::nn::Sequential conv_block;
  ResNetBlock(int64_t dim)
      : conv_block(                                   
           torch::nn::ReflectionPad2d(1),
           torch::nn::Conv2d(torch::nn::Conv2dOptions(dim, dim, 3)),
           torch::nn::InstanceNorm2d(
           torch::nn::InstanceNorm2dOptions(dim)),
           torch::nn::ReLU(/*inplace=*/true),
        torch::nn::ReflectionPad2d(1),
           torch::nn::Conv2d(torch::nn::Conv2dOptions(dim, dim, 3)),
           torch::nn::InstanceNorm2d(
           torch::nn::InstanceNorm2dOptions(dim))) {
    register_module("conv_block", conv_block);        
  }
 
  Tensor forward(const Tensor &inp) {
    return inp + conv_block->forward(inp);            
  }
};.

Initializes Sequential, including its submodules

Always remember to register the modules you assign, or bad things will happen!

As might be expected, our forward function is pretty simple.

Just as we would in Python, we register a subclass of torch::nn::Module. Our residual block has a sequential conv_block submodule.

And just as we did in Python, we need to initialize our submodules, notably Sequential. We do so using the C++ initialization statement. This is similar to how we construct submodules in Python in the __init__ constructor. Unlike Python, C++ does not have the introspection and hooking capabilities that enable redirection of __setattr__ to combine assignment to a member and registration.

Since the lack of keyword arguments makes the parameter specification awkward with default arguments, modules (like tensor factory functions) typically take an options argument. Optional keyword arguments in Python correspond to methods of the options object that we can chain. For example, the Python module nn.Conv2d(in_channels, out_channels, kernel_size, stride=2, padding=1) that we need to convert translates to torch::nn::Conv2d(torch::nn::Conv2dOptions (in_channels, out_channels, kernel_size).stride(2).padding(1)). This is a bit more tedious, but you’re reading this because you love C++ and aren’t deterred by the hoops it makes you jump through.

We should always take care that registration and assignment to members is in sync, or things will not work as expected: for example, loading and updating parameters during training will happen to the registered module, but the actual module being called is a member. This synchronization was done behind the scenes by the Python nn.Module class, but it is not automatic in C++. Failing to do so will cause us many headaches.

In contrast to what we did (and should!) in Python, we need to call m->forward(...) for our modules. Some modules can also be called directly, but for Sequential, this is not currently the case.

A final comment on calling conventions is in order: depending on whether you modify tensors provided to functions,14 tensor arguments should always be passed as const Tensor& for tensors that are left unchanged or Tensor if they are changed. Tensors should be returned as Tensor. Wrong argument types like non-const references (Tensor&) will lead to unparsable compiler errors.

In the main generator class, we’ll follow a typical pattern in the C++ API more closely by naming our class ResNetGeneratorImpl and promoting it to a torch module ResNetGenerator using the TORCH_MODULE macro. The background is that we want to mostly handle modules as references or shared pointers. The wrapped class accomplishes this.

Listing 15.15 ResNetGenerator in cyclegan_cpp_api.cpp

struct ResNetGeneratorImpl : torch::nn::Module {
  torch::nn::Sequential model;
  ResNetGeneratorImpl(int64_t input_nc = 3, int64_t output_nc = 3,
                      int64_t ngf = 64, int64_t n_blocks = 9) {
    TORCH_CHECK(n_blocks >= 0);
    model->push_back(torch::nn::ReflectionPad2d(3));    
    ...                                                 
      model->push_back(torch::nn::Conv2d(
          torch::nn::Conv2dOptions(ngf * mult, ngf * mult * 2, 3)
              .stride(2)
              .padding(1)));                            
    ...
    register_module("model", model);
  }
  Tensor forward(const Tensor &inp) { return model->forward(inp); }
};
 
TORCH_MODULE(ResNetGenerator);                          

Adds modules to the Sequential container in the constructor. This allows us to add a variable number of modules in a for loop.

Spares us from reproducing some tedious things

An example of Options in action

Creates a wrapper ResNetGenerator around our ResNetGeneratorImpl class. As archaic as it seems, the matching names are important here.

That’s it--we’ve defined the perfect C++ analogue of the Python ResNetGenerator model. Now we only need a main function to load parameters and run our model. Loading the image with CImg and converting from image to tensor and tensor back to image are the same as in the previous section. To include some variation, we’ll display the image instead of writing it to disk.

Listing 15.16 cyclegan_cpp_api.cpp main

ResNetGenerator model;                                                    
  ...
  torch::load(model, argv[1]);                                            
  ...
  cimg_library::CImg<float> image(argv[2]);
  image.resize(400, 400);
  auto input_ =
      torch::tensor(torch::ArrayRef<float>(image.data(), image.size()));
  auto input = input_.reshape({1, 3, image.height(), image.width()});
  torch::NoGradGuard no_grad;                                             
 
  model->eval();                                                          
 
  auto output = model->forward(input);                                    
  ...
  cimg_library::CImg<float> out_img(output.data_ptr<float>(),
                    output.size(3), output.size(2),
                    1, output.size(1));
  cimg_library::CImgDisplay disp(out_img, "See a C++ API zebra!");        
  while (!disp.is_closed()) {
    disp.wait();
  }

Instantiates our model

Loads the parameters

Declaring a guard variable is the equivalent of the torch.no_grad() context. You can put it in a { ... } block if you need to limit how long you turn off gradients.

As in Python, eval mode is turned on (for our model, it would not be strictly relevant).

Again, we call forward rather than the model.

Displaying the image, we need to wait for a key rather than immediately exiting our program.

The interesting changes are in how we create and run the model. Just as expected, we instantiate the model by declaring a variable of the model type. We load the model using torch::load (here it is important that we wrapped the model). While this looks very familiar to PyTorch practitioners, note that it will work on JIT-saved files rather than Python-serialized state dictionaries.

When running the model, we need the equivalent of with torch.no_grad():. This is provided by instantiating a variable of type NoGradGuard and keeping it in scope for as long as we do not want gradients. Just like in Python, we set the model into evaluation mode calling model->eval(). This time around, we call model->forward with our input tensor and get a tensor as a result--no JIT is involved, so we do not need IValue packing and unpacking.

Phew. Writing this in C++ was a lot of work for the Python fans that we are. We are glad that we only promised to do inference here, but of course LibTorch also offers optimizers, data loaders, and much more. The main reason to use the API is, of course, when you want to create models and neither the JIT nor Python is a good fit.

For your convenience, CMakeLists.txt contains also the instructions for building cyclegan-cpp-api, so building is just like in the previous section.

We can run the program as

./cyclegan_cpp_api ../traced_zebra_model.pt ../../data/p1ch2/horse.jpg

But we knew what the model would be doing, didn’t we?

15.5 Going mobile

As the last variant of deploying a model, we will consider deployment to mobile devices. When we want to bring our models to mobile, we are typically looking at Android and/or iOS. Here, we’ll focus on Android.

The C++ parts of PyTorch--LibTorch--can be compiled for Android, and we could access that from an app written in Java using the Android Java Native Interface (JNI). But we really only need a handful of functions from PyTorch--loading a JITed model, making inputs into tensors and IValues, running them through the model, and getting results back. To save us the trouble of using the JNI, the PyTorch developers wrapped these functions into a small library called PyTorch Mobile.

The stock way of developing apps in Android is to use the Android Studio IDE, and we will be using it, too. But this means there are a few dozen files of administrativa--which also happen to change from one Android version to the next. As such, we focus on the bits that turn one of the Android Studio templates (Java App with Empty Activity) into an app that takes a picture, runs it through our zebra-CycleGAN, and displays the result. Sticking with the theme of the book, we will be efficient with the Android bits (and they can be painful compared with writing PyTorch code) in the example app.

To infuse life into the template, we need to do three things. First, we need to define a UI. To keep things as simple as we can, we have two elements: a TextView named headline that we can click to take and transform a picture; and an ImageView to show our picture, which we call image_view. We will leave the picture-taking to the camera app (which you would likely avoid doing in an app for a smoother user experience), because dealing with the camera directly would blur our focus on deploying PyTorch models.15

Then, we need to include PyTorch as a dependency. This is done by editing our app’s build.gradle file and adding pytorch_android and pytorch_android_torchvision.

Listing 15.17 Additions to build.gradle

dependencies {                                                     
  ...
  implementation 'org.pytorch:pytorch_android:1.4.0'               
 
  implementation 'org.pytorch:pytorch_android_torchvision:1.4.0'   
}

The dependencies section is very likely already there. If not, add it at the bottom.

The pytorch_android library gets the core things mentioned in the text.

The helper library pytorch_android_torchvision--perhaps a bit immodestly named when compared to its larger TorchVision sibling--contains a few utilities to convert bitmap objects to tensors, but at the time of writing not much more.

We need to add our traced model as an asset.

Finally, we can get to the meat of our shiny app: the Java class derived from activity that contains our main code. We’ll just discuss an excerpt here. It starts with imports and model setup.

Listing 15.18 MainActivity.java part 1

...
import org.pytorch.IValue;                                                 
import org.pytorch.Module;
import org.pytorch.Tensor;
import org.pytorch.torchvision.TensorImageUtils;
...
public class MainActivity extends AppCompatActivity {
  private org.pytorch.Module model;                                        
 
  @Override
  protected void onCreate(Bundle savedInstanceState) {
    ...
    try {                                                                  
      model = Module.load(assetFilePath(this, "traced_zebra_model.pt"));   
    } catch (IOException e) {
      Log.e("Zebraify", "Error reading assets", e);
      finish();
    }
    ...
  }
  ...
}

Don’t you love imports?

Holds our JITed model

In Java we have to catch the exceptions.

Loads the module from a file

We need some imports from the org.pytorch namespace. In the typical style that is a hallmark of Java, we import IValue, Module, and Tensor, which do what we might expect; and the class org.pytorch.torchvision.TensorImageUtils, which holds utility functions to convert between tensors and images.

First, of course, we need to declare a variable holding our model. Then, when our app is started--in onCreate of our activity--we’ll load the module using the Model.load method from the location given as an argument. There is a slight complication though: apps’ data is provided by the supplier as assets that are not easily accessible from the filesystem. For this reason, a utility method called assetFilePath (taken from the PyTorch Android examples) copies the asset to a location in the filesystem. Finally, in Java, we need to catch exceptions that our code throws, unless we want to (and are able to) declare the method we are coding as throwing them in turn.

When we get an image from the camera app using Android’s Intent mechanism, we need to run it through our model and display it. This happens in the onActivityResult event handler.

Listing 15.19 MainActivity.java, part 2

@Override
protected void onActivityResult(int requestCode, int resultCode,
                                Intent data) {
  if (requestCode == REQUEST_IMAGE_CAPTURE &&
      resultCode == RESULT_OK) {                                          
    Bitmap bitmap = (Bitmap) data.getExtras().get("data");
 
    final float[] means = {0.0f, 0.0f, 0.0f};                             
    final float[] stds = {1.0f, 1.0f, 1.0f};
 
    final Tensor inputTensor = TensorImageUtils.bitmapToFloat32Tensor(    
        bitmap, means, stds);
 
    final Tensor outputTensor = model.forward(                            
        IValue.from(inputTensor)).toTensor();
    Bitmap output_bitmap = tensorToBitmap(outputTensor, means, stds,
        Bitmap.Config.RGB_565);                                           
    image_view.setImageBitmap(output_bitmap);
  }
}

This is executed when the camera app takes a picture.

Performs normalization, but the default is images in the range of 0...1 so we do not need to transform: that is, have 0 shift and a scaling divisor of 1.

Gets a tensor from a bitmap, combining steps like TorchVision’s ToTensor (converting to a float tensor with entries between 0 and 1) and Normalize

This looks almost like what we did in C++.

tensorToBitmap is our own invention.

Converting the bitmap we get from Android to a tensor is handled by the TensorImageUtils .bitmapToFloat32Tensor function (static method), which takes two float arrays, means and stds, in addition to bitmap. Here we specify the mean and standard deviation of our input data(set), which will then be mapped to have zero mean and unit standard deviation just like TorchVision’s Normalize transform. Android already gives us the images in the 0..1 range that we need to feed into our model, so we specify mean 0 and standard deviation 1 to prevent the normalization from changing our image.

Around the actual call to model.forward, we then do the same IValue wrapping and unwrapping dance that we did when using the JIT in C++, except that our forward takes a single IValue rather than a vector of them. Finally, we need to get back to a bitmap. Here PyTorch will not help us, so we need to define our own tensorToBitmap (and submit the pull request to PyTorch). We spare you the details here, as they are tedious and full of copying (from the tensor to a float[] array to a int[] array containing ARGB values to the bitmap), but it is as it is. It is designed to be the inverse of bitmapToFloat32Tensor.


Figure 15.5 Our CycleGAN zebra app

And that’s all we need to do to get PyTorch into Android. Using the minimal additions to the code we left out here to request a picture, we have a Zebraify Android app that looks like in what we see in figure 15.5. Well done!16

We should note that we end up with a full version of PyTorch with all ops on Android. This will, in general, also include operations you will not need for a given task, leading to the question of whether we could save some space by leaving them out. It turns out that starting with PyTorch 1.4, you can build a customized version of the PyTorch library that includes only the operations you need (see https://pytorch.org/mobile/ android/#custom-build).

15.5.1 Improving efficiency: Model design and quantization

If we want to explore mobile in more detail, our next step is to try to make our models faster. When we wish to reduce the memory and compute footprint of our models, the first thing to look at is streamlining the model itself: that is, computing the same or very similar mappings from inputs to outputs with fewer parameters and operations. This is often called distillation. The details of distillation vary--sometimes we try to shrink each weight by eliminating small or irrelevant weights;17 in other examples, we combine several layers of a net into one (DistilBERT) or even train a fully different, simpler model to reproduce the larger model’s outputs (OpenNMT’s original CTranslate). We mention this because these modifications are likely to be the first step in getting models to run faster.

Another approach to is to reduce the footprint of each parameter and operation: instead of expending the usual 32-bit per parameter in the form of a float, we convert our model to work with integers (a typical choice is 8-bit). This is quantization.18

PyTorch does offer quantized tensors for this purpose. They are exposed as a set of scalar types similar to torch.float, torch.double, and torch.long (compare section 3.5). The most common quantized tensor scalar types are torch.quint8 and torch.qint8, representing numbers as unsigned and signed 8-bit integers, respectively. PyTorch uses a separate scalar type here in order to use the dispatch mechanism we briefly looked at in section 3.11.

It might seem surprising that using 8-bit integers instead of 32-bit floating-points works at all; and typically there is a slight degradation in results, but not much. Two things seem to contribute: if we consider rounding errors as essentially random, and convolutions and linear layers as weighted averages, we may expect rounding errors to typically cancel.19 This allows reducing the relative precision from more than 20 bits in 32-bit floating-points to the 7 bits that signed integers offer. The other thing quantization does (in contrast to training with 16-bit floating-points) is move from floating-point to fixed precision (per tensor or channel). This means the largest values are resolved to 7-bit precision, and values that are one-eighth of the largest values to only 7 - 3 = 4 bits. But if things like L1 regularization (briefly mentioned in chapter 8) work, we might hope similar effects allow us to afford less precision to the smaller values in our weights when quantizing. In many cases, they do.

Quantization debuted with PyTorch 1.3 and is still a bit rough in terms of supported operations in PyTorch 1.4. It is rapidly maturing, though, and we recommend checking it out if you are serious about computationally efficient deployment.

15.6 Emerging technology: Enterprise serving of PyTorch models

We may ask ourselves whether all the deployment aspects discussed so far should involve as much coding as they do. Sure, it is common enough for someone to code all that. As of early 2020, while we are busy with the finishing touches to the book, we have great expectations for the near future; but at the same time, we feel that the deployment landscape will significantly change by the summer.

Currently, RedisAI (https://github.com/RedisAI/redisai-py), which one of the authors is involved with, is waiting to apply Redis goodness to our models. PyTorch has just experimentally released TorchServe (after this book is finalized, see https://pytorch.org/ blog/pytorch-library-updates-new-model-serving-library/#torchserve-experimental).

Similarly, MLflow (https://mlflow.org) is building out more and more support, and Cortex (https://cortex.dev) wants us to use it to deploy models. For the more specific task of information retrieval, there also is EuclidesDB (https://euclidesdb.readthedocs.io/ en/latest) to do AI-based feature databases.

Exciting times, but unfortunately, they do not sync with our writing schedule. We hope to have more to tell in the second edition (or a second book)!

15.7 Conclusion

This concludes our short tour of how to get our models out to where we want to apply them. While the ready-made Torch serving is not quite there yet as we write this, when it arrives you will likely want to export your models through the JIT--so you’ll be glad we went through it here. In the meantime, you now know how to deploy your model to a network service, in a C++ application, or on mobile. We look forward to seeing what you will build!

Hopefully we’ve also delivered on the promise of this book: a working knowledge of deep learning basics, and a level of comfort with the PyTorch library. We hope you’ve enjoyed reading as much as we’ve enjoyed writing.20

15.8 Exercises

As we close out Deep Learning with PyTorch, we have one final exercise for you:

  1. Pick a project that sounds exciting to you. Kaggle is a great place to start looking. Dive in.

You have acquired the skills and learned the tools you need to succeed. We can’t wait to hear what you do next; drop us a line on the book’s forum and let us know!

15.9 Summary

  • We can serve PyTorch models by wrapping them in a Python web server framework such as Flask.

  • By using JITed models, we can avoid the GIL even when calling them from Python, which is a good idea for serving.

  • Request batching and asynchronous processing helps use resources efficiently, in particular when inference is on the GPU.

  • To export models beyond PyTorch, ONNX is a great format. ONNX Runtime provides a backend for many purposes, including the Raspberry Pi.

  • The JIT allows you to export and run arbitrary PyTorch code in C++ or on mobile with little effort.

  • Tracing is the easiest way to get JITed models; you might need to use scripting for some particularly dynamic parts.

  • There also is good support for C++ (and an increasing number of other languages) for running models both JITed and natively.

  • PyTorch Mobile lets us easily integrate JITed models into Android or iOS apps.

  • For mobile deployments, we want to streamline the model architecture and quantize models if possible.

  • A few deployment frameworks are emerging, but a standard isn’t quite visible yet.


1.To play it safe, do not do this on an untrusted network.

2.Or pip3 for Python3. You also might want to run it from a Python virtual environment.

3.One of the earliest public talks discussing the inadequacy of Flask serving for PyTorch models is Christian Perone’s “PyTorch under the Hood,” http://mng.bz/xWdW.

4.Fancy people call these asynchronous function generators or sometimes, more loosely, coroutines: https:// en.wikipedia.org/wiki/Coroutine.

5.An alternative might be to forgo the timer and just run whenever the queue is not empty. This would potentially run smaller “first” batches, but the overall performance impact might not be so large for most applications.

6.The code lives at https://github.com/microsoft/onnxruntime, but be sure to read the privacy statement! Currently, building ONNX Runtime yourself will get you a package that does not send things to the mothership.

7.Strictly speaking, this traces the model as a function. Recently, PyTorch gained the ability to preserve more of the module structure using torch.jit.trace_module, but for us, the plain tracing is sufficient.

8.But TorchVision may develop a convenience function for loading images.

9.The code works with PyTorch 1.4 and, hopefully, above. In PyTorch versions before 1.3 you needed data in place of data_ptr.

10.We hope you have not been slacking off about trying out things you read.

11.The code directory has a bit longer version to work around Windows issues.

12.You might have to replace the path with where your PyTorch or LibTorch installation is located. Note that the C++ library can be more picky than the Python one in terms of compatibility: If you are using a CUDA-enabled library, you need to have the matching CUDA headers installed. If you get cryptic error messages about “Caffe2 using CUDA,” you need to install a CPU-only version of the library, but CMake found a CUDA-enabled one.

13.This is a great improvement over PyTorch 1.3, where we needed to implement custom modules for ReLU, ÌnstanceNorm2d, and others.

14.This is a bit blurry because you can create a new tensor sharing memory with an input and modify it in place, but it’s best to avoid that if possible.

15.We are very proud of the topical metaphor.

16.At the time of writing, PyTorch Mobile is still relatively young, and you may hit rough edges. On Pytorch 1.3, the colors were off on an actual 32-bit ARM phone while working in the emulator. The reason is likely a bug in one of the computational backend functions that are only used on ARM. With PyTorch 1.4 and a newer phone (64-bit ARM), it seemed to work better.

17.Examples include the Lottery Ticket Hypothesis and WaveRNN.

18.In contrast to quantization, (partially) moving to 16-bit floating-point for training is usually called reduced or (if some bits stay 32-bit) mixed-precision training.

19.Fancy people would refer to the Central Limit Theorem here. And indeed, we must take care that the independence (in the statistical sense) of rounding errors is preserved. For example, we usually want zero (a prominent output of ReLU) to be exactly representable. Otherwise, all the zeros would be changed by the exact same quantity in rounding, leading to errors adding up rather than canceling.

20.More, actually; writing books is hard!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset