11 Training a classification model to detect suspected tumors

This chapter covers

  • Using PyTorch DataLoaders to load data
  • Implementing a model that performs classification on our CT data
  • Setting up the basic skeleton for our application
  • Logging and displaying metrics

In the previous chapters, we set the stage for our cancer-detection project. We covered medical details of lung cancer, took a look at the main data sources we will use for our project, and transformed our raw CT scans into a PyTorch Dataset instance. Now that we have a dataset, we can easily consume our training data. So let’s do that!

11.1 A foundational model and training loop

We’re going to do two main things in this chapter. We’ll start by building the nodule classification model and training loop that will be the foundation that the rest of part 2 uses to explore the larger project. To do that, we’ll use the Ct and LunaDataset classes we implemented in chapter 10 to feed DataLoader instances. Those instances, in turn, will feed our classification model with data via training and validation loops.

We’ll finish the chapter by using the results from running that training loop to introduce one of the hardest challenges in this part of the book: how to get high-quality results from messy, limited data. In later chapters, we’ll explore the specific ways in which our data is limited, as well as mitigate those limitations.

Let’s recall our high-level roadmap from chapter 9, shown here in figure 11.1. Right now, we’ll work on producing a model capable of performing step 4: classification. As a reminder, we will classify candidates as nodules or non-nodules (we’ll build another classifier to attempt to tell malignant nodules from benign ones in chapter 14). That means we’re going to assign a single, specific label to each sample that we present to the model. In this case, those labels are “nodule” and “non-nodule,” since each sample represents a single candidate.

FIgure 11.1 Our end-to-end project to detect lung cancer, with a focus on this chapter’s topic: step 4, classification

Getting an early end-to-end version of a meaningful part of your project is a great milestone to reach. Having something that works well enough for the results to be evaluated analytically let’s you move forward with future changes, confident that you are improving your results with each change--or at least that you’re able to set aside any changes and experiments that don’t work out! Expect to have to do a lot of experimentation when working on your own projects. Getting the best results will usually require considerable tinkering and tweaking.

But before we can get to the experimental phase, we must lay our foundation. Let’s see what our part 2 training loop looks like in figure 11.2: it should seem generally familiar, given that we saw a similar set of core steps in chapter 5. Here we will also use a validation set to evaluate our training progress, as discussed in section 5.5.3.

FIgure 11.2 The training and validation script we will implement in this chapter

The basic structure of what we’re going to implement is as follows:

  • Initialize our model and data loading.

  • Loop over a semi-arbitrarily chosen number of epochs.

    • Loop over each batch of training data returned by LunaDataset.
    • The data-loader worker process loads the relevant batch of data in the background.
    • Pass the batch into our classification model to get results.
    • Calculate our loss based on the difference between our predicted results and our ground-truth data.
    • Record metrics about our model’s performance into a temporary data structure.
    • Update the model weights via backpropagation of the error.
    • Loop over each batch of validation data (in a manner very similar to the training loop).
    • Load the relevant batch of validation data (again, in the background worker process).
    • Classify the batch, and compute the loss.
    • Record information about how well the model performed on the validation data.
    • Print out progress and performance information for this epoch.

As we go through the code for the chapter, keep an eye out for two main differences between the code we’re producing here and what we used for a training loop in part 1. First, we’ll put more structure around our program, since the project as a whole is quite a bit more complicated than what we did in earlier chapters. Without that extra structure, the code can get messy quickly. And for this project, we will have our main training application use a number of well-contained functions, and we will further separate code for things like our dataset into self-contained Python modules.

Make sure that for your own projects, you match the level of structure and design to the complexity level of your project. Too little structure, and it will become difficult to perform experiments cleanly, troubleshoot problems, or even describe what you’re doing! Conversely, too much structure means you’re wasting time writing infrastructure that you don’t need and most likely slowing yourself down by having to conform to it after all that plumbing is in place. Plus it can be tempting to spend time on infrastructure as a procrastination tactic, rather than digging into the hard work of making actual progress on your project. Don’t fall into that trap!

The other big difference between this chapter’s code and part 1 will be a focus on collecting a variety of metrics about how training is progressing. Being able to accurately determine the impact of changes on training is impossible without having good metrics logging. Without spoiling the next chapter, we’ll also see how important it is to collect not just metrics, but the right metrics for the job. We’ll lay the infrastructure for tracking those metrics in this chapter, and we’ll exercise that infrastructure by collecting and displaying the loss and percent of samples correctly classified, both overall and per class. That’s enough to get us started, but we’ll cover a more realistic set of metrics in chapter 12.

11.2 The main entry point for our application

One of the big structural differences from earlier training work we’ve done in this book is that part 2 wraps our work in a fully fledged command-line application. It will parse command-line arguments, have a full-featured --help command, and be easy to run in a wide variety of environments. All this will allow us to easily invoke the training routines from both Jupyter and a Bash shell.1

Our application’s functionality will be implemented via a class so that we can instantiate the application and pass it around if we feel the need. This can make testing, debugging, or invocation from other Python programs easier. We can invoke the application without needing to spin up a second OS-level process (we won’t do explicit unit testing in this book, but the structure we create can be helpful for real projects where that kind of testing is appropriate).

One way to take advantage of being able to invoke our training by either function call or OS-level process is to wrap the function invocations into a Jupyter Notebook so the code can easily be called from either the native CLI or the browser.

Listing 11.1 code/p2_run_everything.ipynb

# In[2]:w
def run(app, *argv):
    argv = list(argv)
    argv.insert(0, '--num-workers=4')                       
    log.info("Running: {}({!r}).main()".format(app, argv))
 
    app_cls = importstr(*app.rsplit('.', 1))                
    app_cls(argv).main()
 
    log.info("Finished: {}.{!r}).main()".format(app, argv))
 
# In[6]:
run('p2ch11.training.LunaTrainingApp', '--epochs=1')

We assume you have a four-core, eight-thread CPU. Change the 4 if needed.

This is a slightly cleaner call to __import__.

Note The training here assumes that you’re on a workstation that has a four-core, eight-thread CPU, 16 GB of RAM, and a GPU with 8 GB of RAM. Reduce --batch-size if your GPU has less RAM, and --num-workers if you have fewer CPU cores, or less CPU RAM.

Let’s get some semistandard boilerplate code out of the way. We’ll start at the end of the file with a pretty standard if main stanza that instantiates the application object and invokes the main method.

Listing 11.2 training.py:386

if __name__ == '__main__':
  LunaTrainingApp().main()

From there, we can jump back to the top of the file and have a look at the application class and the two functions we just called, __init__ and main. We’ll want to be able to accept command-line arguments, so we’ll use the standard argparse library (https://docs .python.org/3/library/argparse.html) in the application’s __init__ function. Note that we can pass in custom arguments to the initializer, should we wish to do so. The main method will be the primary entry point for the core logic of the application.

Listing 11.3 training.py:31, class LunaTrainingApp

class LunaTrainingApp:
  def __init__(self, sys_argv=None):
    if sys_argv is None:                                                   
       sys_argv = sys.argv[1:]
 
    parser = argparse.ArgumentParser()
    parser.add_argument('--num-workers',
      help='Number of worker processes for background data loading',
      default=8,
      type=int,
    )
    # ... line 63
    self.cli_args = parser.parse_args(sys_argv)
    self.time_str = datetime.datetime.now().strftime('%Y-%m-%d_%H.%M.%S')  
 
  # ... line 137
  def main(self):
    log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))

If the caller doesn’t provide arguments, we get them from the command line.

We’ll use the timestamp to help identify training runs.

This structure is pretty general and could be reused for future projects. In particular, parsing arguments in __init__ allows us to configure the application separately from invoking it.

If you check the code for this chapter on the book’s website or GitHub, you might notice some extra lines mentioning TensorBoard. Ignore those for now; we’ll discuss them in detail later in the chapter, in section 11.9.

11.3 Pretraining setup and initialization

Before we can begin iterating over each batch in our epoch, some initialization work needs to happen. After all, we can’t train a model if we haven’t even instantiated one yet! We need to do two main things, as we can see in figure 11.3. The first, as we just mentioned, is to initialize our model and optimizer; and the second is to initialize our Dataset and DataLoader instances. LunaDataset will define the randomized set of samples that will make up our training epoch, and our DataLoader instance will perform the work of loading the data out of our dataset and providing it to our application.

FIgure 11.3 The training and validation script we will implement in this chapter, with a focus on the preloop variable initialization

11.3.1 Initializing the model and optimizer

For this section, we are treating the details of LunaModel as a black box. In section 11.4, we will detail the internal workings. You are welcome to explore changes to the implementation to better meet our goals for the model, although that’s probably best done after finishing at least chapter 12.

Let’s see what our starting point looks like.

Listing 11.4 training.py:31, class LunaTrainingApp

class LunaTrainingApp:
  def __init__(self, sys_argv=None):
    # ... line 70
    self.use_cuda = torch.cuda.is_available()
    self.device = torch.device("cuda" if self.use_cuda else "cpu")
 
    self.model = self.initModel()
    self.optimizer = self.initOptimizer()
 
  def initModel(self):
    model = LunaModel()
    if self.use_cuda:
      log.info("Using CUDA; {} devices.".format(torch.cuda.device_count()))
      if torch.cuda.device_count() > 1:                                    
         model = nn.DataParallel(model)                                    
       model = model.to(self.device)                                       
     return model
 
  def initOptimizer(self):
    return SGD(self.model.parameters(), lr=0.001, momentum=0.99)

Detects multiple GPUs

Wraps the model

Sends model parameters to the GPU

If the system used for training has more than one GPU, we will use the nn.DataParallel class to distribute the work between all of the GPUs in the system and then collect and resync parameter updates and so on. This is almost entirely transparent in terms of both the model implementation and the code that uses that model.

DataParallel vs. DistributedDataParallel

In this book, we use DataParallel to handle utilizing multiple GPUs. We chose DataParallel because it’s a simple drop-in wrapper around our existing models. It is not the best-performing solution for using multiple GPUs, however, and it is limited to working with the hardware available in a single machine.

PyTorch also provides DistributedDataParallel, which is the recommended wrapper class to use when you need to spread work between more than one GPU or machine. Since the proper setup and configuration are nontrivial, and we suspect that the vast majority of our readers won’t see any benefit from the complexity, we won’t cover DistributedDataParallel in this book. If you wish to learn more, we suggest reading the official documentation: https://pytorch.org/tutorials/intermediate/ ddp_tutorial.html.

 

Assuming that self.use_cuda is true, the call self.model.to(device) moves the model parameters to the GPU, setting up the various convolutions and other calculations to use the GPU for the heavy numerical lifting. It’s important to do so before constructing the optimizer, since, otherwise, the optimizer would be left looking at the CPU-based parameter objects rather than those copied to the GPU.

For our optimizer, we’ll use basic stochastic gradient descent (SGD; https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) with momentum. We first saw this optimizer in chapter 5. Recall from part 1 that many different optimizers are available in PyTorch; while we won’t cover most of them in any detail, the official documentation (https://pytorch.org/docs/stable/optim.html#algorithms) does a good job of linking to the relevant papers.

Using SGD is generally considered a safe place to start when it comes to picking an optimizer; there are some problems that might not work well with SGD, but they’re relatively rare. Similarly, a learning rate of 0.001 and a momentum of 0.9 are pretty safe choices. Empirically, SGD with those values has worked reasonably well for a wide range of projects, and it’s easy to try a learning rate of 0.01 or 0.0001 if things aren’t working well right out of the box.

That’s not to say any of those values is the best for our use case, but trying to find better ones is getting ahead of ourselves. Systematically trying different values for learning rate, momentum, network size, and other similar configuration settings is called a hyperparameter search. There are other, more glaring issues we need to address first in the coming chapters. Once we address those, we can begin to fine-tune these values. As we mentioned in the section “Testing other optimizers” in chapter 5, there are also other, more exotic optimizers we might choose; but other than perhaps swapping torch.optim.SGD for torch.optim.Adam, understanding the trade-offs involved is a topic too advanced for this book.

11.3.2 Care and feeding of data loaders

The LunaDataset class that we built in the last chapter acts as the bridge between whatever Wild West data we have and the somewhat more structured world of tensors that the PyTorch building blocks expect. For example, torch.nn.Conv3d (https:// pytorch.org/docs/stable/nn.html#conv3d) expects five-dimensional input: (N, C, D, H, W): number of samples, channels per sample, depth, height, and width. Quite different from the native 3D our CT provides!

You may recall the ct_t.unsqueeze(0) call in LunaDataset.__getitem__ from the last chapter; it provides the fourth dimension, a “channel” for our data. Recall from chapter 4 that an RGB image has three channels, one each for red, green, and blue. Astronomical data could have dozens, one each for various slices of the electromagnetic spectrum--gamma rays, X-rays, ultraviolet light, visible light, infrared, microwaves, and/or radio waves. Since CT scans are single-intensity, our channel dimension is only size 1.

Also recall from part 1 that training on single samples at a time is typically an inefficient use of computing resources, because most processing platforms are capable of more parallel calculations than are required by a model to process a single training or validation sample. The solution is to group sample tuples together into a batch tuple, as in figure 11.4, allowing multiple samples to be processed at the same time. The fifth dimension (N) differentiates multiple samples in the same batch.

FIgure 11.4 Sample tuples being collated into a single batch tuple inside a data loader

Conveniently, we don’t have to implement any of this batching: the PyTorch DataLoader class will handle all of the collation work for us. We’ve already built the bridge from the CT scans to PyTorch tensors with our LunaDataset class, so all that remains is to plug our dataset into a data loader.

Listing 11.5 training.py:89, LunaTrainingApp.initTrainDl

def initTrainDl(self):
  train_ds = LunaDataset(                    
    val_stride=10,
    isValSet_bool=False,
  )
 
  batch_size = self.cli_args.batch_size
  if self.use_cuda:
    batch_size *= torch.cuda.device_count()
 
  train_dl = DataLoader(                     
    train_ds,
    batch_size=batch_size,                   
    num_workers=self.cli_args.num_workers,
    pin_memory=self.use_cuda,                
  )
 
  return train_dl
 
# ... line 137
def main(self):
  train_dl = self.initTrainDl()
  val_dl = self.initValDl()                

Our custom dataset

An off-the-shelf class

Batching is done automatically.

Pinned memory transfers to GPU quickly.

The validation data loader is very similar to training.

In addition to batching individual samples, data loaders can also provide parallel loading of data by using separate processes and shared memory. All we need to do is specify num_workers=... when instantiating the data loader, and the rest is taken care of behind the scenes. Each worker process produces complete batches as in figure 11.4. This helps make sure hungry GPUs are well fed with data. Our validation_ds and validation_dl instances look similar, except for the obvious isValSet_bool=True.

When we iterate, like for batch_tup in self.train_dl:, we won’t have to wait for each Ct to be loaded, samples to be taken and batched, and so on. Instead, we’ll get the already loaded batch_tup immediately, and a worker process will be freed up in the background to begin loading another batch to use on a later iteration. Using the data-loading features of PyTorch can help speed up most projects, because we can overlap data loading and processing with GPU calculation.

11.4 Our first-pass neural network design

The possible design space for a convolutional neural network capable of detecting tumors is effectively infinite. Luckily, considerable effort has been spent over the past decade or so investigating effective models for image recognition. While these have largely focused on 2D images, the general architecture ideas transfer well to 3D, so there are many tested designs that we can use as a starting point. This helps because although our first network architecture is unlikely to be our best option, right now we are only aiming for “good enough to get us going.”

We will base the network design on what we used in chapter 8. We will have to update the model somewhat because our input data is 3D, and we will add some complicating details, but the overall structure shown in figure 11.5 should feel familiar. Similarly, the work we do for this project will be a good base for your future projects, although the further you get from classification or segmentation projects, the more you’ll have to adapt this base to fit. Let’s dissect this architecture, starting with the four repeated blocks that make up the bulk of the network.

FIgure 11.5 The architecture of the LunaModel class consisting of a batch-normalization tail, a four-block backbone, and a head comprised of a linear layer followed by softmax

11.4.1 The core convolutions

Classification models often have a structure that consists of a tail, a backbone (or body), and a head. The tail is the first few layers that process the input to the network. These early layers often have a different structure or organization than the rest of the network, as they must adapt the input to the form expected by the backbone. Here we use a simple batch normalization layer, though often the tail contains convolutional layers as well. Such convolutional layers are often used to aggressively downsample the size of the image; since our image size is already small, we don’t need to do that here.

Next, the backbone of the network typically contains the bulk of the layers, which are usually arranged in series of blocks. Each block has the same (or at least a similar) set of layers, though often the size of the expected input and the number of filters changes from block to block. We will use a block that consists of two 3 × 3 convolutions, each followed by an activation, with a max-pooling operation at the end of the block. We can see this in the expanded view of figure 11.5 labeled Block[block1]. Here’s what the implementation of the block looks like in code.

Listing 11.6 model.py:67, class LunaBlock

class LunaBlock(nn.Module):
  def __init__(self, in_channels, conv_channels):
    super().__init__()

    self.conv1 = nn.Conv3d(
      in_channels, conv_channels, kernel_size=3, padding=1, bias=True,
    )
    self.relu1 = nn.ReLU(inplace=True)  1((CO5-1))
     self.conv2 = nn.Conv3d(
      conv_channels, conv_channels, kernel_size=3, padding=1, bias=True,
    )
    self.relu2 = nn.ReLU(inplace=True)    
 
    self.maxpool = nn.MaxPool3d(2, 2)
 
  def forward(self, input_batch):
    block_out = self.conv1(input_batch)
    block_out = self.relu1(block_out)     
    block_out = self.conv2(block_out)
    block_out = self.relu2(block_out)     
 
    return self.maxpool(block_out)

These could be implemented as calls to the functional API instead.

Finally, the head of the network takes the output from the backbone and converts it into the desired output form. For convolutional networks, this often involves flattening the intermediate output and passing it to a fully connected layer. For some networks, it makes sense to also include a second fully connected layer, although that is usually more appropriate for classification problems in which the imaged objects have more structure (think about cars versus trucks having wheels, lights, grill, doors, and so on) and for projects with a large number of classes. Since we are only doing binary classification, and we don’t seem to need the additional complexity, we have only a single flattening layer.

Using a structure like this can be a good first building block for a convolutional network. There are more complicated designs out there, but for many projects they’re overkill in terms of both implementation complexity and computational demands. It’s a good idea to start simple and add complexity only when there’s a demonstrable need for it.

We can see the convolutions of our block represented in 2D in figure 11.6. Since this is a small portion of a larger image, we ignore padding here. (Note that the ReLU activation function is not shown, as applying it does not change the image sizes.)

Let’s walk through the information flow between our input voxels and a single voxel of output. We want to have a strong sense of how our output will respond when the inputs change. It might be a good idea to review chapter 8, particularly sections 8.1 through 8.3, just to make sure you’re 100% solid on the basic mechanics of convolutions.

FIgure 11.6 The convolutional architecture of a LunaModel block consisting of two 3 × 3 convolutions followed by a max pool. The final pixel has a receptive field of 6 × 6.

We’re using 3 × 3 × 3 convolutions in our block. A single 3 × 3 × 3 convolution has a receptive field of 3 × 3 × 3, which is almost tautological. Twenty-seven voxels are fed in, and one comes out.

It gets interesting when we use two 3 × 3 × 3 convolutions stacked back to back. Stacking convolutional layers allows the final output voxel (or pixel) to be influenced by an input further away than the size of the convolutional kernel suggests. If that output voxel is fed into another 3 × 3 × 3 kernel as one of the edge voxels, then some of the inputs to the first layer will be outside of the 3 × 3 × 3 area of input to the second. The final output of those two stacked layers has an effective receptive field of 5 × 5 × 5. That means that when taken together, the stacked layers act as similar to a single convolutional layer with a larger size.

Put another way, each 3 × 3 × 3 convolutional layer adds an additional one-voxel-per-edge border to the receptive field. We can see this if we trace the arrows in fig-ure 11.6 backward; our 2 × 2 output has a receptive field of 4 × 4, which in turn has a receptive field of 6 × 6. Two stacked 3 × 3 × 3 layers uses fewer parameters than a full 5 × 5 × 5 convolution would (and so is also faster to compute).

The output of our two stacked convolutions is fed into a 2 × 2 × 2 max pool, which means we’re taking a 6 × 6 × 6 effective field, throwing away seven-eighths of the data, and going with the one 5 × 5 × 5 field that produced the largest value.2 Now, those “discarded” input voxels still have a chance to contribute, since the max pool that’s one output voxel over has an overlapping input field, so it’s possible they’ll influence the final output that way.

Note that while we show the receptive field shrinking with each convolutional layer, we’re using padded convolutions, which add a virtual one-pixel border around the image. Doing so keeps our input and output image sizes the same.

The nn.ReLU layers are the same as the ones we looked at in chapter 6. Outputs greater than 0.0 will be left unchanged, and outputs less than 0.0 will be clamped to zero.

This block will be repeated multiple times to form our model’s backbone.

11.4.2 The full model

Let’s take a look at the full model implementation. We’ll skip the block definition, since we just saw that in listing 11.6.

Listing 11.7 model.py:13, class LunaModel

class LunaModel(nn.Module):
  def __init__(self, in_channels=1, conv_channels=8):
    super().__init__()
 
    self.tail_batchnorm = nn.BatchNorm3d(1)                           
 
    self.block1 = LunaBlock(in_channels, conv_channels)               
    self.block2 = LunaBlock(conv_channels, conv_channels * 2)         
    self.block3 = LunaBlock(conv_channels * 2, conv_channels * 4)     
    self.block4 = LunaBlock(conv_channels * 4, conv_channels * 8)     
 
    self.head_linear = nn.Linear(1152, 2)                             
    self.head_softmax = nn.Softmax(dim=1)                             

Tail

Backbone

Head

Here, our tail is relatively simple. We are going to normalize our input using nn.BatchNorm3d, which, as we saw in chapter 8, will shift and scale our input so that it has a mean of 0 and a standard deviation of 1. Thus, the somewhat odd Hounsfield unit (HU) scale that our input is in won’t really be visible to the rest of the network. This is a somewhat arbitrary choice; we know what our input units are, and we know the expected values of the relevant tissues, so we could probably implement a fixed normalization scheme pretty easily. It’s not clear which approach would be better.3

Our backbone is four repeated blocks, with the block implementation pulled out into the separate nn.Module subclass we saw earlier in listing 11.6. Since each block ends with a 2 × 2 × 2 max-pool operation, after 4 layers we will have decreased the resolution of the image 16 times in each dimension. Recall from chapter 10 that our data is returned in chunks that are 32 × 48 × 48, which will become 2 × 3 × 3 by the end of the backbone.

Finally, our tail is just a fully connected layer followed by a call to nn.Softmax. Softmax is a useful function for single-label classification tasks and has a few nice properties: it bounds the output between 0 and 1, it’s relatively insensitive to the absolute range of the inputs (only the relative values of the inputs matter), and it allows our model to express the degree of certainty it has in an answer.

The function itself is relatively simple. Every value from the input is used to exponentiate e, and the resulting series of values is then divided by the sum of all the results of exponentiation. Here’s what it looks like implemented in a simple fashion as a nonoptimized softmax implementation in pure Python:

>>> logits = [1, -2, 3]
>>> exp = [e ** x for x in logits]
>>> exp
[2.718, 0.135, 20.086]

>>> softmax = [x / sum(exp) for x in exp]
>>> softmax
[0.118, 0.006, 0.876]

Of course, we use the PyTorch version of nn.Softmax for our model, as it natively understands batches and tensors and will perform autograd quickly and as expected.

Complication: Converting from convolution to linear

Continuing on with our model definition, we come to a complication. We can’t just feed the output of self.block4 into a fully connected layer, since that output is a per-sample 2 × 3 × 3 image with 64 channels, and fully connected layers expect a 1D vector as input (well, technically they expect a batch of 1D vectors, which is a 2D array, but the mismatch remains either way). Let’s take a look at the forward method.

Listing 11.8 model.py:50, LunaModel.forward

def forward(self, input_batch):
  bn_output = self.tail_batchnorm(input_batch)

  block_out = self.block1(bn_output)
  block_out = self.block2(block_out)
  block_out = self.block3(block_out)
  block_out = self.block4(block_out)

  conv_flat = block_out.view(
    block_out.size(0),          
    -1,
  )
  linear_output = self.head_linear(conv_flat)

  return linear_output, self.head_softmax(linear_output)

The batch size

Note that before we pass data into a fully connected layer, we must flatten it using the view function. Since that operation is stateless (it has no parameters that govern its behavior), we can simply perform the operation in the forward function. This is somewhat similar to the functional interfaces we discussed in chapter 8. Almost every model that uses convolution and produces classifications, regressions, or other non-image outputs will have a similar component in the head of the network.

For the return value of the forward method, we return both the raw logits and the softmax-produced probabilities. We first hinted at logits in section 7.2.6: they are the numerical values produced by the network prior to being normalized into probabilities by the softmax layer. That might sound a bit complicated, but logits are really just the raw input to the softmax layer. They can have any real-valued input, and the softmax will squash them to the range 0-1.

We’ll use the logits when we calculate the nn.CrossEntropyLoss during training,4 and we’ll use the probabilities for when we want to actually classify the samples. This kind of slight difference between what’s used for training and what’s used in production is fairly common, especially when the difference between the two outputs is a simple, stateless function like softmax.

Initialization

Finally, let’s talk about initializing our network’s parameters. In order to get well-behaved performance out of our model, the network’s weights, biases, and other parameters need to exhibit certain properties. Let’s imagine a degenerate case, where all of the network’s weights are greater than 1 (and we do not have residual connections). In that case, repeated multiplication by those weights would result in layer outputs that became very large as data flowed through the layers of the network. Similarly, weights less than 1 would cause all layer outputs to become smaller and vanish. Similar considerations apply to the gradients in the backward pass.

Many normalization techniques can be used to keep layer outputs well behaved, but one of the simplest is to just make sure the network’s weights are initialized such that intermediate values and gradients become neither unreasonably small nor unreasonably large. As we discussed in chapter 8, PyTorch does not help us as much as it should here, so we need to do some initialization ourselves. We can treat the following _init_weights function as boilerplate, as the exact details aren’t particularly important.

Listing 11.9 model.py:30, LunaModel._init_weights

def _init_weights(self):
  for m in self.modules():
    if type(m) in {
      nn.Linear,
      nn.Conv3d,
    }:
      nn.init.kaiming_normal_(
        m.weight.data, a=0, mode='fan_out', nonlinearity='relu',
      )
      if m.bias is not None:
        fan_in, fan_out = 
          nn.init._calculate_fan_in_and_fan_out(m.weight.data)
        bound = 1 / math.sqrt(fan_out)
        nn.init.normal_(m.bias, -bound, bound)

11.5 Training and validating the model

Now it’s time to take the various pieces we’ve been working with and assemble them into something we can actually execute. This training loop should be familiar--we saw loops like figure 11.7 in chapter 5.

FIgure 11.7 The training and validation script we will implement in this chapter, with a focus on the nested loops over each epoch and batches in the epoch

The code is relatively compact (the doTraining function is only 12 statements; it’s longer here due to line-length limitations).

Listing 11.10 training.py:137, LunaTrainingApp.main

def main(self):
  # ... line 143
  for epoch_ndx in range(1, self.cli_args.epochs + 1):
    trnMetrics_t = self.doTraining(epoch_ndx, train_dl)
    self.logMetrics(epoch_ndx, 'trn', trnMetrics_t)
 
# ... line 165
def doTraining(self, epoch_ndx, train_dl):
  self.model.train()
  trnMetrics_g = torch.zeros(                 
    METRICS_SIZE,
    len(train_dl.dataset),
    device=self.device,
  )
 
  batch_iter = enumerateWithEstimate(         
    train_dl,
    "E{} Training".format(epoch_ndx),
    start_ndx=train_dl.num_workers,
  )
  for batch_ndx, batch_tup in batch_iter:
    self.optimizer.zero_grad()                

    loss_var = self.computeBatchLoss(         
      batch_ndx,
      batch_tup,
      train_dl.batch_size,
      trnMetrics_g
    )

    loss_var.backward()                       
    self.optimizer.step()                     

  self.totalTrainingSamples_count += len(train_dl.dataset)

  return trnMetrics_g.to('cpu')

Initializes an empty metrics array

Sets up our batch looping with time estimate

Frees any leftover gradient tensors

We’ll discuss this method in detail in the next section.

Actually updates the model weights

The main differences that we see from the training loops in earlier chapters are as follows:

  • The trnMetrics_g tensor collects detailed per-class metrics during training. For larger projects like ours, this kind of insight can be very nice to have.

  • We don’t directly iterate over the train_dl data loader. We use enumerateWithEstimate to provide an estimated time of completion. This isn’t crucial; it’s just a stylistic choice.

  • The actual loss computation is pushed into the computeBatchLoss method. Again, this isn’t strictly necessary, but code reuse is typically a plus.

We’ll discuss why we’ve wrapped enumerate with additional functionality in section 11.7.2; for now, assume it’s the same as enumerate(train_dl).

The purpose of the trnMetrics_g tensor is to transport information about how the model is behaving on a per-sample basis from the computeBatchLoss function to the logMetrics function. Let’s take a look at computeBatchLoss next. We’ll cover logMetrics after we’re done with the rest of the main training loop.

11.5.1 The computeBatchLoss function

The computeBatchLoss function is called by both the training and validation loops. As the name suggests, it computes the loss over a batch of samples. In addition, the function also computes and records per-sample information about the output the model is producing. This lets us compute things like the percentage of correct answers per class, which allows us to hone in on areas where our model is having difficulty.

Of course, the function’s core functionality is around feeding the batch into the model and computing the per-batch loss. We’re using CrossEntropyLoss (https:// pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss), just like in chapter 7. Unpacking the batch tuple, moving the tensors to the GPU, and invoking the model should all feel familiar after that earlier training work.

Listing 11.11 training.py:225, .computeBatchLoss

def computeBatchLoss(self, batch_ndx, batch_tup, batch_size, metrics_g):
  input_t, label_t, _series_list, _center_list = batch_tup

  input_g = input_t.to(self.device, non_blocking=True)
  label_g = label_t.to(self.device, non_blocking=True)

  logits_g, probability_g = self.model(input_g)

  loss_func = nn.CrossEntropyLoss(reduction='none')   
  loss_g = loss_func(
    logits_g,
    label_g[:,1],                                     
  )
  # ... line 238
  return loss_g.mean()                                

reduction=‘none’ gives the loss per sample.

Index of the one-hot-encoded class

Recombines the loss per sample into a single value

Here we are not using the default behavior to get a loss value averaged over the batch. Instead, we get a tensor of loss values, one per sample. This lets us track the individual losses, which means we can aggregate them as we wish (per class, for example). We’ll see that in action in just a moment. For now, we’ll return the mean of those per-sample losses, which is equivalent to the batch loss. In situations where you don’t want to keep statistics per sample, using the loss averaged over the batch is perfectly fine. Whether that’s the case is highly dependent on your project and goals.

Once that’s done, we’ve fulfilled our obligations to the calling function in terms of what’s required to do backpropagation and weight updates. Before we do that, however, we also want to record our per-sample stats for posterity (and later analysis). We’ll use the metrics_g parameter passed in to accomplish this.

Listing 11.12 training.py:26

METRICS_LABEL_NDX=0                                       
METRICS_PRED_NDX=1
METRICS_LOSS_NDX=2
METRICS_SIZE = 3

  # ... line 225
  def computeBatchLoss(self, batch_ndx, batch_tup, batch_size, metrics_g):
    # ... line 238
    start_ndx = batch_ndx * batch_size
    end_ndx = start_ndx + label_t.size(0)

    metrics_g[METRICS_LABEL_NDX, start_ndx:end_ndx] =    
      label_g[:,1].detach()                               
    metrics_g[METRICS_PRED_NDX, start_ndx:end_ndx] =     
      probability_g[:,1].detach()                         
    metrics_g[METRICS_LOSS_NDX, start_ndx:end_ndx] =     
      loss_g.detach()                                     

    return loss_g.mean()                                  

These named array indexes are declared at module-level scope

We use detach since none of our metrics need to hold on to gradients.

Again, this is the loss over the entire batch.

By recording the label, prediction, and loss for each and every training (and later, validation) sample, we have a wealth of detailed information we can use to investigate the behavior of our model. For now, we’re going to focus on compiling per-class statistics, but we could easily use this information to find the sample that is classified the most wrongly and start to investigate why. Again, for some projects, this kind of information will be less interesting, but it’s good to remember that you have these kinds of options available.

11.5.2 The validation loop is similar

The validation loop in figure 11.8 looks very similar to training but is somewhat simplified. The key difference is that validation is read-only. Specifically, the loss value returned is not used, and the weights are not updated.

FIgure 11.8 The training and validation script we will implement in this chapter, with a focus on the per-epoch validation loop

Nothing about the model should have changed between the start and end of the function call. In addition, it’s quite a bit faster due to the with torch.no_grad() context manager explicitly informing PyTorch that no gradients need to be computed.

Listing 11.13 training.py:137, LunaTrainingApp.main

def main(self):
  for epoch_ndx in range(1, self.cli_args.epochs + 1):
    # ... line 157
    valMetrics_t = self.doValidation(epoch_ndx, val_dl)
    self.logMetrics(epoch_ndx, 'val', valMetrics_t)

# ... line 203
def doValidation(self, epoch_ndx, val_dl):
  with torch.no_grad():
    self.model.eval()                  
    valMetrics_g = torch.zeros(
      METRICS_SIZE,
      len(val_dl.dataset),
      device=self.device,
    )

    batch_iter = enumerateWithEstimate(
      val_dl,
      "E{} Validation ".format(epoch_ndx),
      start_ndx=val_dl.num_workers,
    )
    for batch_ndx, batch_tup in batch_iter:
      self.computeBatchLoss(
        batch_ndx, batch_tup, val_dl.batch_size, valMetrics_g)
 
  return valMetrics_g.to('cpu')

Turns off training-time behavior

Without needing to update network weights (recall that doing so would violate the entire premise of the validation set; something we never want to do!), we don’t need to use the loss returned from computeBatchLoss, nor do we need to reference the optimizer. All that’s left inside the loop is the call to computeBatchLoss. Note that we are still collecting metrics in valMetrics_g as a side effect of the call, even though we aren’t using the overall per-batch loss returned by computeBatchLoss for anything.

11.6 Outputting performance metrics

The last thing we do per epoch is log our performance metrics for this epoch. As shown in figure 11.9, once we’ve logged metrics, we return to the training loop for the next epoch of training. Logging results and progress as we go is important, since if training goes off the rails (“does not converge” in the parlance of deep learning), we want to notice this is happening and stop spending time training a model that’s not working out. In less catastrophic cases, it’s good to be able to keep an eye on how your model behaves.

FIgure 11.9 The training and validation script we will implement in this chapter, with a focus on the metrics logging at the end of each epoch

Earlier, we were collecting results in trnMetrics_g and valMetrics_g for logging progress per epoch. Each of these two tensors now contains everything we need to compute our percent correct and average loss per class for our training and validation runs. Doing this per epoch is a common choice, though somewhat arbitrary. In future chapters, we’ll see how to manipulate the size of our epochs such that we get feedback about training progress at a reasonable rate.

11.6.1 The logMetrics function

Let’s talk about the high-level structure of the logMetrics function. The signature looks like this.

Listing 11.14 training.py:251, LunaTrainingApp.logMetrics

def logMetrics(
    self,
    epoch_ndx,
    mode_str,
    metrics_t,
    classificationThreshold=0.5,
):

We use epoch_ndx purely for display while logging our results. The mode_str argument tells us whether the metrics are for training or validation.

We consume either trnMetrics_t or valMetrics_t, which is passed in as the metrics _t parameter. Recall that both of those inputs are tensors of floating-point values that we filled with data during computeBatchLoss and then transferred back to the CPU right before we returned them from doTraining and doValidation. Both tensors have three rows and as many columns as we have samples (training samples or validation samples, depending). As a reminder, those three rows correspond to the following constants.

Listing 11.15 training.py:26

METRICS_LABEL_NDX=0     
METRICS_PRED_NDX=1
METRICS_LOSS_NDX=2
METRICS_SIZE = 3 

These are declared at module-level scope.

Tensor masking and Boolean indexing

Masked tensors are a common usage pattern that might be opaque if you have not encountered them before. You may be familiar with the NumPy concept called masked arrays; tensor and array masks behave the same way.

If you aren’t familiar with masked arrays, an excellent page in the NumPy documentation (http://mng.bz/XPra) describes the behavior well. PyTorch purposely uses the same syntax and semantics as NumPy.

 

 

Constructing masks

Next, we’re going to construct masks that will let us limit our metrics to only the nodule or non-nodule (aka positive or negative) samples. We will also count the total samples per class, as well as the number of samples we classified correctly.

Listing 11.16 training.py:264, LunaTrainingApp.logMetrics

negLabel_mask = metrics_t[METRICS_LABEL_NDX] <= classificationThreshold
negPred_mask = metrics_t[METRICS_PRED_NDX] <= classificationThreshold
 
posLabel_mask = ~negLabel_mask
posPred_mask = ~negPred_mask

While we don’t assert it here, we know that all of the values stored in metrics _t[METRICS_LABEL_NDX] belong to the set {0.0, 1.0} since we know that our nodule status labels are simply True or False. By comparing to classificationThreshold, which defaults to 0.5, we get an array of binary values where a True value corresponds to a non-nodule (aka negative) label for the sample in question.

We do a similar comparison to create the negPred_mask, but we must remember that the METRICS_PRED_NDX values are the positive predictions produced by our model and can be any floating-point value between 0.0 and 1.0, inclusive. That doesn’t change our comparison, but it does mean the actual value can be close to 0.5. The positive masks are simply the inverse of the negative masks.

Note While other projects can utilize similar approaches, it’s important to realize that we’re taking some shortcuts that are allowed because this is a binary classification problem. If your next project has more than two classes or has samples that belong to multiple classes at the same time, you’ll have to use more complicated logic to build similar masks.

Next, we use those masks to compute some per-label statistics and store them in a dictionary, metrics_dict.

Listing 11.17 training.py:270, LunaTrainingApp.logMetrics

neg_count = int(negLabel_mask.sum())                            
pos_count = int(posLabel_mask.sum())
 
neg_correct = int((negLabel_mask & negPred_mask).sum())
pos_correct = int((posLabel_mask & posPred_mask).sum())
 
metrics_dict = {}
metrics_dict['loss/all'] = 
  metrics_t[METRICS_LOSS_NDX].mean()
metrics_dict['loss/neg'] = 
  metrics_t[METRICS_LOSS_NDX, negLabel_mask].mean()
metrics_dict['loss/pos'] = 
  metrics_t[METRICS_LOSS_NDX, posLabel_mask].mean()
 
metrics_dict['correct/all'] = (pos_correct + neg_correct) 
  / np.float32(metrics_t.shape[1]) * 100                        
metrics_dict['correct/neg'] = neg_correct / np.float32(neg_count) * 100
metrics_dict['correct/pos'] = pos_correct / np.float32(pos_count) * 100

Converts to a normal Python integer

Avoids integer division by converting to np.float32

First we compute the average loss over the entire epoch. Since the loss is the single metric that is being minimized during training, we always want to be able to keep track of it. Then we limit the loss averaging to only those samples with a negative label using the negLabel_mask we just made. We do the same with the positive loss. Computing a per-class loss like this can be useful if one class is persistently harder to classify than another, since that knowledge can help drive investigation and improvements.

We’ll close out the calculations with determining the fraction of samples we classified correctly, as well as the fraction correct from each label. Since we will display these numbers as percentages in a moment, we also multiply the values by 100. Similar to the loss, we can use these numbers to help guide our efforts when making improvements. After the calculations, we then log our results with three calls to log.info.

Listing 11.18 training.py:289, LunaTrainingApp.logMetrics

log.info(
  ("E{} {:8} {loss/all:.4f} loss, "
     + "{correct/all:-5.1f}% correct, "
  ).format(
    epoch_ndx,
    mode_str,
    **metrics_dict,
  )
)
log.info(
  ("E{} {:8} {loss/neg:.4f} loss, "
     + "{correct/neg:-5.1f}% correct ({neg_correct:} of {neg_count:})"
  ).format(
    epoch_ndx,
    mode_str + '_neg',
    neg_correct=neg_correct,
    neg_count=neg_count,
    **metrics_dict,
  )
)
log.info(            
  # ... line 319
)

The ‘pos’ logging is similar to the ‘neg’ logging earlier.

The first log has values computed from all of our samples and is tagged /all, while the negative (non-nodule) and positive (nodule) values are tagged /neg and /pos, respectively. We don’t show the third logging statement for positive values here; it’s identical to the second except for swapping neg for pos in all cases.

11.7 Running the training script

Now that we’ve completed the core of the training.py script, we’ll actually start running it. This will initialize and train our model and print statistics about how well the training is going. The idea is to get this kicked off to run in the background while we’re covering the model implementation in detail. Hopefully we’ll have results to look at once we’re done.

We’re running this script from the main code directory; it should have subdirectories called p2ch11, util, and so on. The python environment used should have all the libraries listed in requirements.txt installed. Once those libraries are ready, we can run:

$ python -m p2ch11.training        
Starting LunaTrainingApp,
    Namespace(batch_size=256, channels=8, epochs=20, layers=3, num_workers=8)
<p2ch11.dsets.LunaDataset object at 0x7fa53a128710>: 495958 training samples
<p2ch11.dsets.LunaDataset object at 0x7fa537325198>: 55107 validation samples
Epoch 1 of 20, 1938/216 batches of size 256
E1 Training ----/1938, starting
E1 Training   16/1938, done at 2018-02-28 20:52:54, 0:02:57
...

This is the command line for Linux/Bash. Windows users will probably need to invoke Python differently, depending on the install method used.

As a reminder, we also provide a Jupyter Notebook that contains invocations of the training application.

Listing 11.19 code/p2_run_everything.ipynb

# In[5]:
run('p2ch11.prepcache.LunaPrepCacheApp')
 
# In[6]:
run('p2ch11.training.LunaTrainingApp', '--epochs=1')

If the first epoch seems to be taking a very long time (more than 10 or 20 minutes), it might be related to needing to prepare the cached data required by LunaDataset. See section 10.5.1 for details about the caching. The exercises for chapter 10 included writing a script to pre-stuff the cache in an efficient manner. We also provide the prepcache.py file to do the same thing; it can be invoked with python -m p2ch11 .prepcache. Since we repeat our dsets.py files per chapter, the caching will need to be repeated for every chapter. This is somewhat space and time inefficient, but it means we can keep the code for each chapter much more well contained. For your future projects, we recommend reusing your cache more heavily.

Once training is underway, we want to make sure we’re using the computing resources at hand the way we expect. An easy way to tell if the bottleneck is data loading or computation is to wait a few moments after the script starts to train (look for output like E1 Training 16/7750, done at...) and then check both top and nvidia-smi:

If the eight Python worker processes are consuming >80% CPU, then the cache probably needs to be prepared (we know this here because the authors have made sure there aren’t CPU bottlenecks in this project’s implementation; this won’t be generally true).

If nvidia-smi reports that GPU-Util is >80%, then you’re saturating your GPU. We’ll discuss some strategies for efficient waiting in section 11.7.2.

The intent is that the GPU is saturated; we want to use as much of that computing power as we can to complete epochs quickly. A single NVIDIA GTX 1080 Ti should complete an epoch in under 15 minutes. Since our model is relatively simple, it doesn’t take a lot of CPU preprocessing for the CPU to be the bottleneck. When working with models with greater depth (or more needed calculations in general), processing each batch will take longer, which will increase the amount of CPU processing we can do before the GPU runs out of work before the next batch of input is ready.

11.7.1 Needed data for training

If the number of samples is less than 495,958 for training or 55,107 for validation, it might make sense to do some sanity checking to be sure the full data is present and accounted for. For your future projects, make sure your dataset returns the number of samples that you expect.

First, let’s take a look at the basic directory structure of our data-unversioned/ part2/luna directory:

$ ls -1p data-unversioned/part2/luna/
subset0/
subset1/
...
subset9/

Next, let’s make sure we have one .mhd file and one .raw file for each series UID

$ ls -1p data-unversioned/part2/luna/subset0/
1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.mhd
1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.raw
1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.mhd
1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.raw
...

and that we have the overall correct number of files:

$ ls -1 data-unversioned/part2/luna/subset?/* | wc -l
1776
$ ls -1 data-unversioned/part2/luna/subset0/* | wc -l
178
...
$ ls -1 data-unversioned/part2/luna/subset9/* | wc -l
176

If all of these seem right but things still aren’t working, ask on Manning LiveBook (https://livebook.manning.com/book/deep-learning-with-pytorch/chapter-11) and hopefully someone can help get things sorted out.

11.7.2 Interlude: The enumerateWithEstimate function

Working with deep learning involves a lot of waiting. We’re talking about real-world, sitting around, glancing at the clock on the wall, a watched pot never boils (but you could fry an egg on the GPU), straight up boredom.

The only thing worse than sitting and staring at a blinking cursor that hasn’t moved for over an hour is flooding your screen with this:

2020-01-01 10:00:00,056 INFO training batch 1234
2020-01-01 10:00:00,067 INFO training batch 1235
2020-01-01 10:00:00,077 INFO training batch 1236
2020-01-01 10:00:00,087 INFO training batch 1237
...etc...

At least the quietly blinking cursor doesn’t blow out your scrollback buffer!

Fundamentally, while doing all this waiting, we want to answer the question “Do I have time to go refill my water glass?” along with follow-up questions about having time to

Brew a cup of coffee

Grab dinner

Grab dinner in Paris5

To answer these pressing questions, we’re going to use our enumerateWithEstimate function. Usage looks like the following:

>>> for i, _ in enumerateWithEstimate(list(range(234)), "sleeping"):
...   time.sleep(random.random())
...
11:12:41,892 WARNING sleeping ----/234, starting
11:12:44,542 WARNING sleeping    4/234, done at 2020-01-01 11:15:16, 0:02:35
11:12:46,599 WARNING sleeping    8/234, done at 2020-01-01 11:14:59, 0:02:17
11:12:49,534 WARNING sleeping   16/234, done at 2020-01-01 11:14:33, 0:01:51
11:12:58,219 WARNING sleeping   32/234, done at 2020-01-01 11:14:41, 0:01:59
11:13:15,216 WARNING sleeping   64/234, done at 2020-01-01 11:14:43, 0:02:01
11:13:44,233 WARNING sleeping  128/234, done at 2020-01-01 11:14:35, 0:01:53
11:14:40,083 WARNING sleeping ----/234, done at 2020-01-01 11:14:40
>>>

That’s 8 lines of output for over 200 iterations lasting about 2 minutes. Even given the wide variance of random.random(), the function had a pretty decent estimate after 16 iterations (in less than 10 seconds). For loop bodies with more constant timing, the estimates stabilize even more quickly.

In terms of behavior, enumerateWithEstimate is almost identical to the standard enumerate (the differences are things like the fact that our function returns a generator, whereas enumerate returns a specialized <enumerate object at 0x...>).

Listing 11.20 util.py:143, def enumerateWithEstimate

def enumerateWithEstimate(
    iter,
    desc_str,
    start_ndx=0,
    print_ndx=4,
    backoff=None,
    iter_len=None,
):
  for (current_ndx, item) in enumerate(iter):
    yield (current_ndx, item)

However, the side effects (logging, specifically) are what make the function interesting. Rather than get lost in the weeds trying to cover every detail of the implementation, if you’re interested, you can consult the function docstring (https://github .com/deep-learning-with-pytorch/dlwpt-code/blob/master/util/util.py#L143) to get information about the function parameters and desk-check the implementation.

Deep learning projects can be very time intensive. Knowing when something is expected to finish means you can use your time until then wisely, and it can also clue you in that something isn’t working properly (or an approach is unworkable) if the expected time to completion is much larger than expected.

11.8 Evaluating the model: Getting 99.7% correct means we’re done, right?

Let’s take a look at some (abridged) output from our training script. As a reminder, we’ve run this with the command line python -m p2ch11.training:

E1 Training ----/969, starting
...
E1 LunaTrainingApp
E1 trn      2.4576 loss,  99.7% correct
...
E1 val      0.0172 loss,  99.8% correct
...

After one epoch of training, both the training and validation set show at least 99.7% correct results. That’s an A+! Time for a round of high-fives, or at least a satisfied nod and smile. We just solved cancer! ... Right?

Well, no.

Let’s take a closer (less-abridged) look at that epoch 1 output:

E1 LunaTrainingApp
E1 trn      2.4576 loss,  99.7% correct,
E1 trn_neg  0.1936 loss,  99.9% correct (494289 of 494743)
E1 trn_pos  924.34 loss,   0.2% correct (3 of 1215)
...
E1 val      0.0172 loss,  99.8% correct,
E1 val_neg  0.0025 loss, 100.0% correct (494743 of 494743)
E1 val_pos  5.9768 loss,   0.0% correct (0 of 1215)

On the validation set, we’re getting non-nodules 100% correct, but the actual nodules are 100% wrong. The network is just classifying everything as not-a-nodule! The value 99.7% just means only approximately 0.3% of the samples are nodules.

After 10 epochs, the situation is only marginally better:

E10 LunaTrainingApp
E10 trn      0.0024 loss,  99.8% correct
E10 trn_neg  0.0000 loss, 100.0% correct
E10 trn_pos  0.9915 loss,   0.0% correct
E10 val      0.0025 loss,  99.7% correct
E10 val_neg  0.0000 loss, 100.0% correct
E10 val_pos  0.9929 loss,   0.0% correct

The classification output remains the same--none of the nodule (aka positive) samples are correctly identified. It’s interesting that we’re starting to see some decrease in the val_pos loss, however, while not seeing a corresponding increase in the val_neg loss. This implies that the network is learning something. Unfortunately, it’s learning very, very slowly.

Even worse, this particular failure mode is the most dangerous in the real world! We want to avoid the situation where we classify a tumor as an innocuous structure, because that would not facilitate a patient getting the evaluation and eventual treatment they might need. It’s important to understand the consequences for misclassification for all your projects, as that can have a large impact on how you design, train, and evaluate your model. We’ll discuss this more in the next chapter.

Before we get to that, however, we need to upgrade our tooling to make the results easier to understand. We’re sure you love to squint at columns of numbers as much as anyone, but pictures are worth a thousand words. Let’s graph some of these metrics.

11.9 Graphing training metrics with TensorBoard

We’re going to use a tool called TensorBoard as a quick and easy way to get our training metrics out of our training loop and into some pretty graphs. This will allow us to follow the trends of those metrics, rather than only look at the instantaneous values per epoch. It gets much, much easier to know whether a value is an outlier or just the latest in a trend when you’re looking at a visual representation.

“Hey, wait,” you might be thinking, “isn’t TensorBoard part of the TensorFlow project? What’s it doing here in my PyTorch book?”

Well, yes, it is part of another deep learning framework, but our philosophy is “use what works.” There’s no reason to restrict ourselves by not using a tool just because it’s bundled with another project we’re not using. Both the PyTorch and TensorBoard devs agree, because they collaborated to add official support for TensorBoard into PyTorch. TensorBoard is great, and it’s got some easy-to-use PyTorch APIs that let us hook data from just about anywhere into it for quick and easy display. If you stick with deep learning, you’ll probably be seeing (and using) a lot of TensorBoard.

In fact, if you’ve been running the chapter examples, you should already have some data on disk ready and waiting to be displayed. Let’s see how to run TensorBoard, and look at what it can show us.

11.9.1 Running TensorBoard

By default, our training script will write metrics data to the runs/ subdirectory. If you list the directory content, you might see something like this during your Bash shell session:

$ ls -lA runs/p2ch11/
total 24
drwxrwxr-x 2 elis elis 4096 Sep 15 13:22 2020-01-01_12.55.27-trn-dlwpt/  
drwxrwxr-x 2 elis elis 4096 Sep 15 13:22 2020-01-01_12.55.27-val-dlwpt/  
drwxrwxr-x 2 elis elis 4096 Sep 15 15:14 2020-01-01_13.31.23-trn-dwlpt/  
drwxrwxr-x 2 elis elis 4096 Sep 15 15:14 2020-01-01_13.31.23-val-dwlpt/  

The single-epoch run from earlier

The more recent 10-epoch training run

To get the tensorboard program, install the tensorflow (https://pypi.org/project/ tensorflow) Python package. Since we’re not actually going to use TensorFlow proper, it’s fine if you install the default CPU-only package. If you have another version of TensorBoard installed already, using that is fine too. Either make sure the appropriate directory is on your path, or invoke it with ../path/to/tensorboard --logdir runs/. It doesn’t really matter where you invoke it from, as long as you use the --logdir argument to point it at where your data is stored. It’s a good idea to segregate your data into separate folders, as TensorBoard can get a bit unwieldy once you get over 10 or 20 experiments. You’ll have to decide the best way to do that for each project as you go. Don’t be afraid to move data around after the fact if you need to.

Let’s start TensorBoard now:

$ tensorboard --logdir runs/
2020-01-01 12:13:16.163044: I tensorflow/core/platform/cpu_feature_guard.cc:140]
    Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA  1((CO17-2))
TensorBoard 1.14.0 at http://localhost:6006/ (Press CTRL+C to quit)

These messages might be different or not present for you; that’s fine.

Once that’s done, you should be able to point your browser at http://localhost:6006 and see the main dashboard.6 Figure 11.10 shows us what that looks like.

FIgure 11.10 The main TensorBoard UI, showing a paired set of training and validation runs

Along the top of the browser window, you should see the orange header. The right side of the header has the typical widgets for settings, a link to the GitHub repository, and the like. We can ignore those for now. The left side of the header has items for the data types we’ve provided. You should have at least the following:

  • Scalars (the default tab)

  • Histograms

  • Precision-Recall Curves (shown as PR Curves)

You might see Distributions as well as the second UI tab (to the right of Scalars in figure 11.10). We won’t use or discuss those here. Make sure you’ve selected Scalars by clicking it.

On the left is a set of controls for display options, as well as a list of runs that are present. The smoothing option can be useful if you have particularly noisy data; it will calm things down so that you can pick out the overall trend. The original non-smoothed data will still be visible in the background as a faded line in the same color. Figure 11.11 shows this, although it might be difficult to discern when printed in black and white.

FIgure 11.11 The TensorBoard sidebar with Smoothing set to 0.6 and two runs selected for display

Depending on how many times you’ve run the training script, you might have multiple runs to select from. With too many runs being rendered, the graphs can get overly noisy, so don’t hesitate to deselect runs that aren’t of interest at the moment.

If you want to permanently remove a run, the data can be deleted from disk while TensorBoard is running. You can do this to get rid of experiments that crashed, had bugs, didn’t converge, or are so old they’re no longer interesting. The number of runs can grow pretty quickly, so it can be helpful to prune it often and to rename runs or move runs that are particularly interesting to a more permanent directory so they don’t get deleted by accident. To remove both the train and validation runs, execute the following (after changing the chapter, date, and time to match the run you want to remove):

$ rm -rf runs/p2ch11/2020-01-01_12.02.15_*

Keep in mind that removing runs will cause the runs that are later in the list to move up, which will result in them being assigned new colors.

OK, let’s get to the point of TensorBoard: the pretty graphs! The main part of the screen should be filled with data from gathering training and validation metrics, as shown in figure 11.12.

FIgure 11.12 The main TensorBoard data display area showing us that our results on actual nodules are downright awful

That’s much easier to parse and absorb than E1 trn_pos 924.34 loss, 0.2% correct (3 of 1215)! Although we’re going to save discussion of what these graphs are telling us for section 11.10, now would be a good time to make sure it’s clear what these numbers correspond to from our training program. Take a moment to cross-reference the numbers you get by mousing over the lines with the numbers spit out by training.py during the same training run. You should see a direct correspondence between the Value column of the tooltip and the values printed during training. Once you’re comfortable and confident that you understand exactly what TensorBoard is showing you, let’s move on and discuss how to get these numbers to appear in the first place.

11.9.2 Adding TensorBoard support to the metrics logging function

We are going to use the torch.utils.tensorboard module to write data in a format that TensorBoard will consume. This will allow us to write metrics for this and any other project quickly and easily. TensorBoard supports a mix of NumPy arrays and PyTorch tensors, but since we don’t have any reason to put our data into NumPy arrays, we’ll use PyTorch tensors exclusively.

The first thing we need do is to create our SummaryWriter objects (which we imported from torch.utils.tensorboard). The only parameter we’re going to pass in is log_dir, which we will initialize to something like runs/p2ch11/2020-01-01_12 .55.27-trn-dlwpt. We can add a comment argument to our training script to change dlwpt to something more informative; use python -m p2ch11.training --help for more information.

We create two writers, one each for the training and validation runs. Those writers will be reused for every epoch. When the SummaryWriter class gets initialized, it also creates the log_dir directories as a side effect. These directories show up in TensorBoard and can clutter the UI with empty runs if the training script crashes before any data gets written, which can be common when you’re experimenting with something. To avoid writing too many empty junk runs, we wait to instantiate the SummaryWriter objects until we’re ready to write data for the first time. This function is called from logMetrics().

Listing 11.21 training.py:127, .initTensorboardWriters

def initTensorboardWriters(self):
  if self.trn_writer is None:
    log_dir = os.path.join('runs', self.cli_args.tb_prefix, self.time_str)
 
    self.trn_writer = SummaryWriter(
      log_dir=log_dir + '-trn_cls-' + self.cli_args.comment)
    self.val_writer = SummaryWriter(
      log_dir=log_dir + '-val_cls-' + self.cli_args.comment)

If you recall, the first epoch is kind of a mess, with the early output in the training loop being essentially random. When we save the metrics from that first batch, those random results end up skewing things a bit. Recall from figure 11.11 that TensorBoard has smoothing to remove noise from the trend lines, which helps somewhat.

Another approach could be to skip metrics entirely for the first epoch’s training data, although our model trains quickly enough that it’s still useful to see the first epoch’s results. Feel free to change this behavior as you see fit; the rest of part 2 will continue with this pattern of including the first, noisy training epoch.

tip If you end up doing a lot of experiments that result in exceptions or killing the training script relatively quickly, you might be left with a number of junk runs cluttering up your runs/ directory. Don’t be afraid to clean those out!

Writing scalars to TensorBoard

Writing scalars is straightforward. We can take the metrics_dict we’ve already constructed and pass in each key/value pair to the writer.add_scalar method. The torch.utils.tensorboard.SummaryWriter class has the add_scalar method (http:// mng.bz/RAqj) with the following signature.

Listing 11.22 PyTorch torch/utils/tensorboard/writer.py:267

def add_scalar(self, tag, scalar_value, global_step=None, walltime=None):
    # ...

The tag parameter tells TensorBoard which graph we’re adding values to, and the scalar_value parameter is our data point’s Y-axis value. The global_step parameter acts as the X-axis value.

Recall that we updated the totalTrainingSamples_count variable inside the doTraining function. We’ll use totalTrainingSamples_count as the X-axis of our TensorBoard plots by passing it in as the global_step parameter. Here’s what that looks like in our code.

Listing 11.23 training.py:323, LunaTrainingApp.logMetrics

for key, value in metrics_dict.items():
  writer.add_scalar(key, value, self.totalTrainingSamples_count)

Note that the slashes in our key names (such as 'loss/all') result in TensorBoard grouping the charts by the substring before the '/'.

The documentation suggests that we should be passing in the epoch number as the global_step parameter, but that results in some complications. By using the number of training samples presented to the network, we can do things like change the number of samples per epoch and still be able to compare those future graphs to the ones we’re creating now. Saying that a model trains in half the number of epochs is meaningless if each epoch takes four times as long! Keep in mind that this might not be standard practice, however; expect to see a variety of values used for the global step.

11.10 Why isn’t the model learning to detect nodules?

Our model is clearly learning something--the loss trend lines are consistent as epochs increase, and the results are repeatable. There is a disconnect, however, between what the model is learning and what we want it to learn. What’s going on? Let’s use a quick metaphor to illustrate the problem.

Imagine that a professor gives students a final exam consisting of 100 True/False questions. The students have access to previous versions of this professor’s tests going back 30 years, and every time there are only one or two questions with a True answer. The other 98 or 99 are False, every time.

Assuming that the grades aren’t on a curve and instead have a typical scale of 90% correct or better being an A, and so on, it is trivial to get an A+: just mark every question as False! Let’s imagine that this year, there is only one True answer. A student like the one on the left in figure 11.13 who mindlessly marked every answer as False would get a 99% on the final but wouldn’t really demonstrate that they had learned anything (beyond how to cram from old tests, of course). That’s basically what our model is doing right now.

FIgure 11.13 A professor giving two students the same grade, despite different levels of knowledge. Question 9 is the only question with an answer of True.

Contrast that with a student like the one on the right who also got 99% of the questions correct, but did so by answering two questions with True. Intuition tells us that the student on the right in figure 11.13 probably has a much better grasp of the material than the all-False student. Finding the one True question while only getting one answer wrong is pretty difficult! Unfortunately, neither our students’ grades nor our model’s grading scheme reflect this gut feeling.

We have a similar situation, where 99.7% of the answers to “Is this candidate a nodule?” are “Nope.” Our model is taking the easy way out and answering False on every question.

Still, if we look back at our model’s numbers more closely, the loss on the training and validation sets is decreasing! The fact that we’re getting any traction at all on the cancer-detection problem should give us hope. It will be the work of the next chapter to realize this potential. We’ll start chapter 12 by introducing some new, relevant terminology, and then we’ll come up with a better grading scheme that doesn’t lend itself to being gamed quite as easily as what we’ve done so far.

11.11 Conclusion

We’ve come a long way this chapter--we now have a model and a training loop, and are able to consume the data we produced in the last chapter. Our metrics are being logged to the console as well as graphed visually.

While our results aren’t usable yet, we’re actually closer than it might seem. In chapter 12, we will improve the metrics we’re using to track our progress, and use them to inform the changes we need to make to get our model producing reasonable results.

11.12 Exercises

  1. Implement a program that iterates through a LunaDataset instance by wrapping it in a DataLoader instance, while timing how long it takes to do so. Compare these times to the times from the exercises in chapter 10. Be aware of the state of the cache when running the script.

    1. What impact does setting num_workers=... to 0, 1, and 2 have?

    2. What are the highest values your machine will support for a given combination of batch_size=... and num_workers=... without running out of memory?

  2. Reverse the sort order of noduleInfo_list. How does that change the behavior of the model after one epoch of training?

  3. Change logMetrics to alter the naming scheme of the runs and keys that are used in TensorBoard.

    1. Experiment with different forward-slash placement for keys passed in to writer.add_scalar.

    2. Have both training and validation runs use the same writer, and add the trn or val string to the name of the key.

    3. Customize the naming of the log directory and keys to suit your taste.

11.13 Summary

  • Data loaders can be used to load data from arbitrary datasets in multiple processes. This allows otherwise-idle CPU resources to be devoted to preparing data to feed to the GPU.

  • Data loaders load multiple samples from a dataset and collate them into a batch. PyTorch models expect to process batches of data, not individual samples.

  • Data loaders can be used to manipulate arbitrary datasets by changing the relative frequency of individual samples. This allows for “after-market” tweaks to a dataset, though it might make more sense to change the dataset implementation directly.

  • We will use PyTorch’s torch.optim.SGD (stochastic gradient descent) optimizer with a learning rate of 0.001 and a momentum of 0.99 for the majority of part 2. These values are also reasonable defaults for many deep learning projects.

  • Our initial model for classification will be very similar to the model we used in chapter 8. This lets us get started with a model that we have reason to believe will be effective. We can revisit the model design if we think it’s the thing preventing our project from performing better.

  • The choice of metrics that we monitor during training is important. It is easy to accidentally pick metrics that are misleading about how the model is performing. Using the overall percentage of samples classified correctly is not useful for our data. Chapter 12 will detail how to evaluate and choose better metrics.

  • TensorBoard can be used to display a wide range of metrics visually. This makes it much easier to consume certain forms of information (particularly trend data) as they change per epoch of training.


1.Any shell, really, but if you’re using a non-Bash shell, you already knew that.

2.Remember that we’re actually working in 3D, despite the 2D figure.

3.Which is why there’s an exercise to experiment with both in the next chapter!

4.There are numerical stability benefits for doing so. Propagating gradients accurately through an exponential calculated using 32-bit floating-point numbers can be problematic.

5.If getting dinner in France doesn’t involve an airport, feel free to substitute “Paris, Texas” to make the joke work; https://en.wikipedia.org/wiki/Paris_(disambiguation).

6.If you’re running training on a different computer from your browser, you’ll need to replace localhost with the appropriate hostname or IP address.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset