DataLoader
s to load dataIn the previous chapters, we set the stage for our cancer-detection project. We covered medical details of lung cancer, took a look at the main data sources we will use for our project, and transformed our raw CT scans into a PyTorch Dataset
instance. Now that we have a dataset, we can easily consume our training data. So let’s do that!
We’re going to do two main things in this chapter. We’ll start by building the nodule classification model and training loop that will be the foundation that the rest of part 2 uses to explore the larger project. To do that, we’ll use the Ct
and LunaDataset
classes we implemented in chapter 10 to feed DataLoader
instances. Those instances, in turn, will feed our classification model with data via training and validation loops.
We’ll finish the chapter by using the results from running that training loop to introduce one of the hardest challenges in this part of the book: how to get high-quality results from messy, limited data. In later chapters, we’ll explore the specific ways in which our data is limited, as well as mitigate those limitations.
Let’s recall our high-level roadmap from chapter 9, shown here in figure 11.1. Right now, we’ll work on producing a model capable of performing step 4: classification. As a reminder, we will classify candidates as nodules or non-nodules (we’ll build another classifier to attempt to tell malignant nodules from benign ones in chapter 14). That means we’re going to assign a single, specific label to each sample that we present to the model. In this case, those labels are “nodule” and “non-nodule,” since each sample represents a single candidate.
Getting an early end-to-end version of a meaningful part of your project is a great milestone to reach. Having something that works well enough for the results to be evaluated analytically let’s you move forward with future changes, confident that you are improving your results with each change--or at least that you’re able to set aside any changes and experiments that don’t work out! Expect to have to do a lot of experimentation when working on your own projects. Getting the best results will usually require considerable tinkering and tweaking.
But before we can get to the experimental phase, we must lay our foundation. Let’s see what our part 2 training loop looks like in figure 11.2: it should seem generally familiar, given that we saw a similar set of core steps in chapter 5. Here we will also use a validation set to evaluate our training progress, as discussed in section 5.5.3.
The basic structure of what we’re going to implement is as follows:
Loop over a semi-arbitrarily chosen number of epochs.
LunaDataset
.As we go through the code for the chapter, keep an eye out for two main differences between the code we’re producing here and what we used for a training loop in part 1. First, we’ll put more structure around our program, since the project as a whole is quite a bit more complicated than what we did in earlier chapters. Without that extra structure, the code can get messy quickly. And for this project, we will have our main training application use a number of well-contained functions, and we will further separate code for things like our dataset into self-contained Python modules.
Make sure that for your own projects, you match the level of structure and design to the complexity level of your project. Too little structure, and it will become difficult to perform experiments cleanly, troubleshoot problems, or even describe what you’re doing! Conversely, too much structure means you’re wasting time writing infrastructure that you don’t need and most likely slowing yourself down by having to conform to it after all that plumbing is in place. Plus it can be tempting to spend time on infrastructure as a procrastination tactic, rather than digging into the hard work of making actual progress on your project. Don’t fall into that trap!
The other big difference between this chapter’s code and part 1 will be a focus on collecting a variety of metrics about how training is progressing. Being able to accurately determine the impact of changes on training is impossible without having good metrics logging. Without spoiling the next chapter, we’ll also see how important it is to collect not just metrics, but the right metrics for the job. We’ll lay the infrastructure for tracking those metrics in this chapter, and we’ll exercise that infrastructure by collecting and displaying the loss and percent of samples correctly classified, both overall and per class. That’s enough to get us started, but we’ll cover a more realistic set of metrics in chapter 12.
One of the big structural differences from earlier training work we’ve done in this book is that part 2 wraps our work in a fully fledged command-line application. It will parse command-line arguments, have a full-featured --help
command, and be easy to run in a wide variety of environments. All this will allow us to easily invoke the training routines from both Jupyter and a Bash shell.1
Our application’s functionality will be implemented via a class so that we can instantiate the application and pass it around if we feel the need. This can make testing, debugging, or invocation from other Python programs easier. We can invoke the application without needing to spin up a second OS-level process (we won’t do explicit unit testing in this book, but the structure we create can be helpful for real projects where that kind of testing is appropriate).
One way to take advantage of being able to invoke our training by either function call or OS-level process is to wrap the function invocations into a Jupyter Notebook so the code can easily be called from either the native CLI or the browser.
# In[2]:w def run(app, *argv): argv = list(argv) argv.insert(0, '--num-workers=4') ❶ log.info("Running: {}({!r}).main()".format(app, argv)) app_cls = importstr(*app.rsplit('.', 1)) ❷ app_cls(argv).main() log.info("Finished: {}.{!r}).main()".format(app, argv)) # In[6]: run('p2ch11.training.LunaTrainingApp', '--epochs=1')
❶ We assume you have a four-core, eight-thread CPU. Change the 4 if needed.
❷ This is a slightly cleaner call to __import__.
Note The training here assumes that you’re on a workstation that has a four-core, eight-thread CPU, 16 GB of RAM, and a GPU with 8 GB of RAM. Reduce --batch-size
if your GPU has less RAM, and --num-workers
if you have fewer CPU cores, or less CPU RAM.
Let’s get some semistandard boilerplate code out of the way. We’ll start at the end of the file with a pretty standard if main
stanza that instantiates the application object and invokes the main
method.
if __name__ == '__main__': LunaTrainingApp().main()
From there, we can jump back to the top of the file and have a look at the application class and the two functions we just called, __init__
and main
. We’ll want to be able to accept command-line arguments, so we’ll use the standard argparse
library (https://docs .python.org/3/library/argparse.html) in the application’s __init__
function. Note that we can pass in custom arguments to the initializer, should we wish to do so. The main
method will be the primary entry point for the core logic of the application.
class LunaTrainingApp: def __init__(self, sys_argv=None): if sys_argv is None: ❶ sys_argv = sys.argv[1:] parser = argparse.ArgumentParser() parser.add_argument('--num-workers', help='Number of worker processes for background data loading', default=8, type=int, ) # ... line 63 self.cli_args = parser.parse_args(sys_argv) self.time_str = datetime.datetime.now().strftime('%Y-%m-%d_%H.%M.%S') ❷ # ... line 137 def main(self): log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
❶ If the caller doesn’t provide arguments, we get them from the command line.
❷ We’ll use the timestamp to help identify training runs.
This structure is pretty general and could be reused for future projects. In particular, parsing arguments in __init__
allows us to configure the application separately from invoking it.
If you check the code for this chapter on the book’s website or GitHub, you might notice some extra lines mentioning TensorBoard
. Ignore those for now; we’ll discuss them in detail later in the chapter, in section 11.9.
Before we can begin iterating over each batch in our epoch, some initialization work needs to happen. After all, we can’t train a model if we haven’t even instantiated one yet! We need to do two main things, as we can see in figure 11.3. The first, as we just mentioned, is to initialize our model and optimizer; and the second is to initialize our Dataset
and DataLoader
instances. LunaDataset
will define the randomized set of samples that will make up our training epoch, and our DataLoader
instance will perform the work of loading the data out of our dataset and providing it to our application.
For this section, we are treating the details of LunaModel
as a black box. In section 11.4, we will detail the internal workings. You are welcome to explore changes to the implementation to better meet our goals for the model, although that’s probably best done after finishing at least chapter 12.
Let’s see what our starting point looks like.
class LunaTrainingApp: def __init__(self, sys_argv=None): # ... line 70 self.use_cuda = torch.cuda.is_available() self.device = torch.device("cuda" if self.use_cuda else "cpu") self.model = self.initModel() self.optimizer = self.initOptimizer() def initModel(self): model = LunaModel() if self.use_cuda: log.info("Using CUDA; {} devices.".format(torch.cuda.device_count())) if torch.cuda.device_count() > 1: ❶ model = nn.DataParallel(model) ❷ model = model.to(self.device) ❸ return model def initOptimizer(self): return SGD(self.model.parameters(), lr=0.001, momentum=0.99)
❸ Sends model parameters to the GPU
If the system used for training has more than one GPU, we will use the nn.DataParallel
class to distribute the work between all of the GPUs in the system and then collect and resync parameter updates and so on. This is almost entirely transparent in terms of both the model implementation and the code that uses that model.
Assuming that self.use_cuda
is true, the call self.model.to(device)
moves the model parameters to the GPU, setting up the various convolutions and other calculations to use the GPU for the heavy numerical lifting. It’s important to do so before constructing the optimizer, since, otherwise, the optimizer would be left looking at the CPU-based parameter objects rather than those copied to the GPU.
For our optimizer, we’ll use basic stochastic gradient descent (SGD; https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) with momentum. We first saw this optimizer in chapter 5. Recall from part 1 that many different optimizers are available in PyTorch; while we won’t cover most of them in any detail, the official documentation (https://pytorch.org/docs/stable/optim.html#algorithms) does a good job of linking to the relevant papers.
Using SGD is generally considered a safe place to start when it comes to picking an optimizer; there are some problems that might not work well with SGD, but they’re relatively rare. Similarly, a learning rate of 0.001 and a momentum of 0.9 are pretty safe choices. Empirically, SGD with those values has worked reasonably well for a wide range of projects, and it’s easy to try a learning rate of 0.01 or 0.0001 if things aren’t working well right out of the box.
That’s not to say any of those values is the best for our use case, but trying to find better ones is getting ahead of ourselves. Systematically trying different values for learning rate, momentum, network size, and other similar configuration settings is called a hyperparameter search. There are other, more glaring issues we need to address first in the coming chapters. Once we address those, we can begin to fine-tune these values. As we mentioned in the section “Testing other optimizers” in chapter 5, there are also other, more exotic optimizers we might choose; but other than perhaps swapping torch.optim.SGD
for torch.optim.Adam
, understanding the trade-offs involved is a topic too advanced for this book.
The LunaDataset
class that we built in the last chapter acts as the bridge between whatever Wild West data we have and the somewhat more structured world of tensors that the PyTorch building blocks expect. For example, torch.nn.Conv3d
(https:// pytorch.org/docs/stable/nn.html#conv3d) expects five-dimensional input: (N, C, D, H, W): number of samples, channels per sample, depth, height, and width. Quite different from the native 3D our CT provides!
You may recall the ct_t.unsqueeze(0)
call in LunaDataset.__getitem__
from the last chapter; it provides the fourth dimension, a “channel” for our data. Recall from chapter 4 that an RGB image has three channels, one each for red, green, and blue. Astronomical data could have dozens, one each for various slices of the electromagnetic spectrum--gamma rays, X-rays, ultraviolet light, visible light, infrared, microwaves, and/or radio waves. Since CT scans are single-intensity, our channel dimension is only size 1.
Also recall from part 1 that training on single samples at a time is typically an inefficient use of computing resources, because most processing platforms are capable of more parallel calculations than are required by a model to process a single training or validation sample. The solution is to group sample tuples together into a batch tuple, as in figure 11.4, allowing multiple samples to be processed at the same time. The fifth dimension (N) differentiates multiple samples in the same batch.
Conveniently, we don’t have to implement any of this batching: the PyTorch DataLoader
class will handle all of the collation work for us. We’ve already built the bridge from the CT scans to PyTorch tensors with our LunaDataset
class, so all that remains is to plug our dataset into a data loader.
def initTrainDl(self): train_ds = LunaDataset( ❶ val_stride=10, isValSet_bool=False, ) batch_size = self.cli_args.batch_size if self.use_cuda: batch_size *= torch.cuda.device_count() train_dl = DataLoader( ❷ train_ds, batch_size=batch_size, ❸ num_workers=self.cli_args.num_workers, pin_memory=self.use_cuda, ❹ ) return train_dl # ... line 137 def main(self): train_dl = self.initTrainDl() val_dl = self.initValDl() ❺
❸ Batching is done automatically.
❹ Pinned memory transfers to GPU quickly.
❺ The validation data loader is very similar to training.
In addition to batching individual samples, data loaders can also provide parallel loading of data by using separate processes and shared memory. All we need to do is specify num_workers=...
when instantiating the data loader, and the rest is taken care of behind the scenes. Each worker process produces complete batches as in figure 11.4. This helps make sure hungry GPUs are well fed with data. Our validation_ds
and validation_dl
instances look similar, except for the obvious isValSet_bool=True
.
When we iterate, like for batch_tup in self.train_dl:
, we won’t have to wait for each Ct
to be loaded, samples to be taken and batched, and so on. Instead, we’ll get the already loaded batch_tup
immediately, and a worker process will be freed up in the background to begin loading another batch to use on a later iteration. Using the data-loading features of PyTorch can help speed up most projects, because we can overlap data loading and processing with GPU calculation.
The possible design space for a convolutional neural network capable of detecting tumors is effectively infinite. Luckily, considerable effort has been spent over the past decade or so investigating effective models for image recognition. While these have largely focused on 2D images, the general architecture ideas transfer well to 3D, so there are many tested designs that we can use as a starting point. This helps because although our first network architecture is unlikely to be our best option, right now we are only aiming for “good enough to get us going.”
We will base the network design on what we used in chapter 8. We will have to update the model somewhat because our input data is 3D, and we will add some complicating details, but the overall structure shown in figure 11.5 should feel familiar. Similarly, the work we do for this project will be a good base for your future projects, although the further you get from classification or segmentation projects, the more you’ll have to adapt this base to fit. Let’s dissect this architecture, starting with the four repeated blocks that make up the bulk of the network.
Classification models often have a structure that consists of a tail, a backbone (or body), and a head. The tail is the first few layers that process the input to the network. These early layers often have a different structure or organization than the rest of the network, as they must adapt the input to the form expected by the backbone. Here we use a simple batch normalization layer, though often the tail contains convolutional layers as well. Such convolutional layers are often used to aggressively downsample the size of the image; since our image size is already small, we don’t need to do that here.
Next, the backbone of the network typically contains the bulk of the layers, which are usually arranged in series of blocks. Each block has the same (or at least a similar) set of layers, though often the size of the expected input and the number of filters changes from block to block. We will use a block that consists of two 3 × 3 convolutions, each followed by an activation, with a max-pooling operation at the end of the block. We can see this in the expanded view of figure 11.5 labeled Block[block1]
. Here’s what the implementation of the block looks like in code.
class LunaBlock(nn.Module): def __init__(self, in_channels, conv_channels): super().__init__() self.conv1 = nn.Conv3d( in_channels, conv_channels, kernel_size=3, padding=1, bias=True, ) self.relu1 = nn.ReLU(inplace=True) 1((CO5-1)) self.conv2 = nn.Conv3d( conv_channels, conv_channels, kernel_size=3, padding=1, bias=True, ) self.relu2 = nn.ReLU(inplace=True) ❶ self.maxpool = nn.MaxPool3d(2, 2) def forward(self, input_batch): block_out = self.conv1(input_batch) block_out = self.relu1(block_out) ❶ block_out = self.conv2(block_out) block_out = self.relu2(block_out) ❶ return self.maxpool(block_out)
❶ These could be implemented as calls to the functional API instead.
Finally, the head of the network takes the output from the backbone and converts it into the desired output form. For convolutional networks, this often involves flattening the intermediate output and passing it to a fully connected layer. For some networks, it makes sense to also include a second fully connected layer, although that is usually more appropriate for classification problems in which the imaged objects have more structure (think about cars versus trucks having wheels, lights, grill, doors, and so on) and for projects with a large number of classes. Since we are only doing binary classification, and we don’t seem to need the additional complexity, we have only a single flattening layer.
Using a structure like this can be a good first building block for a convolutional network. There are more complicated designs out there, but for many projects they’re overkill in terms of both implementation complexity and computational demands. It’s a good idea to start simple and add complexity only when there’s a demonstrable need for it.
We can see the convolutions of our block represented in 2D in figure 11.6. Since this is a small portion of a larger image, we ignore padding here. (Note that the ReLU activation function is not shown, as applying it does not change the image sizes.)
Let’s walk through the information flow between our input voxels and a single voxel of output. We want to have a strong sense of how our output will respond when the inputs change. It might be a good idea to review chapter 8, particularly sections 8.1 through 8.3, just to make sure you’re 100% solid on the basic mechanics of convolutions.
We’re using 3 × 3 × 3 convolutions in our block. A single 3 × 3 × 3 convolution has a receptive field of 3 × 3 × 3, which is almost tautological. Twenty-seven voxels are fed in, and one comes out.
It gets interesting when we use two 3 × 3 × 3 convolutions stacked back to back. Stacking convolutional layers allows the final output voxel (or pixel) to be influenced by an input further away than the size of the convolutional kernel suggests. If that output voxel is fed into another 3 × 3 × 3 kernel as one of the edge voxels, then some of the inputs to the first layer will be outside of the 3 × 3 × 3 area of input to the second. The final output of those two stacked layers has an effective receptive field of 5 × 5 × 5. That means that when taken together, the stacked layers act as similar to a single convolutional layer with a larger size.
Put another way, each 3 × 3 × 3 convolutional layer adds an additional one-voxel-per-edge border to the receptive field. We can see this if we trace the arrows in fig-ure 11.6 backward; our 2 × 2 output has a receptive field of 4 × 4, which in turn has a receptive field of 6 × 6. Two stacked 3 × 3 × 3 layers uses fewer parameters than a full 5 × 5 × 5 convolution would (and so is also faster to compute).
The output of our two stacked convolutions is fed into a 2 × 2 × 2 max pool, which means we’re taking a 6 × 6 × 6 effective field, throwing away seven-eighths of the data, and going with the one 5 × 5 × 5 field that produced the largest value.2 Now, those “discarded” input voxels still have a chance to contribute, since the max pool that’s one output voxel over has an overlapping input field, so it’s possible they’ll influence the final output that way.
Note that while we show the receptive field shrinking with each convolutional layer, we’re using padded convolutions, which add a virtual one-pixel border around the image. Doing so keeps our input and output image sizes the same.
The nn.ReLU
layers are the same as the ones we looked at in chapter 6. Outputs greater than 0.0 will be left unchanged, and outputs less than 0.0 will be clamped to zero.
This block will be repeated multiple times to form our model’s backbone.
Let’s take a look at the full model implementation. We’ll skip the block definition, since we just saw that in listing 11.6.
class LunaModel(nn.Module): def __init__(self, in_channels=1, conv_channels=8): super().__init__() self.tail_batchnorm = nn.BatchNorm3d(1) ❶ self.block1 = LunaBlock(in_channels, conv_channels) ❷ self.block2 = LunaBlock(conv_channels, conv_channels * 2) ❷ self.block3 = LunaBlock(conv_channels * 2, conv_channels * 4) ❷ self.block4 = LunaBlock(conv_channels * 4, conv_channels * 8) ❷ self.head_linear = nn.Linear(1152, 2) ❸ self.head_softmax = nn.Softmax(dim=1) ❸
Here, our tail is relatively simple. We are going to normalize our input using nn.BatchNorm3d
, which, as we saw in chapter 8, will shift and scale our input so that it has a mean of 0 and a standard deviation of 1. Thus, the somewhat odd Hounsfield unit (HU) scale that our input is in won’t really be visible to the rest of the network. This is a somewhat arbitrary choice; we know what our input units are, and we know the expected values of the relevant tissues, so we could probably implement a fixed normalization scheme pretty easily. It’s not clear which approach would be better.3
Our backbone is four repeated blocks, with the block implementation pulled out into the separate nn.Module
subclass we saw earlier in listing 11.6. Since each block ends with a 2 × 2 × 2 max-pool operation, after 4 layers we will have decreased the resolution of the image 16 times in each dimension. Recall from chapter 10 that our data is returned in chunks that are 32 × 48 × 48, which will become 2 × 3 × 3 by the end of the backbone.
Finally, our tail is just a fully connected layer followed by a call to nn.Softmax
. Softmax is a useful function for single-label classification tasks and has a few nice properties: it bounds the output between 0 and 1, it’s relatively insensitive to the absolute range of the inputs (only the relative values of the inputs matter), and it allows our model to express the degree of certainty it has in an answer.
The function itself is relatively simple. Every value from the input is used to exponentiate e
, and the resulting series of values is then divided by the sum of all the results of exponentiation. Here’s what it looks like implemented in a simple fashion as a nonoptimized softmax implementation in pure Python:
>>> logits = [1, -2, 3] >>> exp = [e ** x for x in logits] >>> exp [2.718, 0.135, 20.086] >>> softmax = [x / sum(exp) for x in exp] >>> softmax [0.118, 0.006, 0.876]
Of course, we use the PyTorch version of nn.Softmax
for our model, as it natively understands batches and tensors and will perform autograd quickly and as expected.
Continuing on with our model definition, we come to a complication. We can’t just feed the output of self.block4
into a fully connected layer, since that output is a per-sample 2 × 3 × 3 image with 64 channels, and fully connected layers expect a 1D vector as input (well, technically they expect a batch of 1D vectors, which is a 2D array, but the mismatch remains either way). Let’s take a look at the forward
method.
def forward(self, input_batch): bn_output = self.tail_batchnorm(input_batch) block_out = self.block1(bn_output) block_out = self.block2(block_out) block_out = self.block3(block_out) block_out = self.block4(block_out) conv_flat = block_out.view( block_out.size(0), ❶ -1, ) linear_output = self.head_linear(conv_flat) return linear_output, self.head_softmax(linear_output)
Note that before we pass data into a fully connected layer, we must flatten it using the view
function. Since that operation is stateless (it has no parameters that govern its behavior), we can simply perform the operation in the forward
function. This is somewhat similar to the functional interfaces we discussed in chapter 8. Almost every model that uses convolution and produces classifications, regressions, or other non-image outputs will have a similar component in the head of the network.
For the return value of the forward
method, we return both the raw logits and the softmax-produced probabilities. We first hinted at logits in section 7.2.6: they are the numerical values produced by the network prior to being normalized into probabilities by the softmax layer. That might sound a bit complicated, but logits are really just the raw input to the softmax layer. They can have any real-valued input, and the softmax will squash them to the range 0-1.
We’ll use the logits when we calculate the nn.CrossEntropyLoss
during training,4 and we’ll use the probabilities for when we want to actually classify the samples. This kind of slight difference between what’s used for training and what’s used in production is fairly common, especially when the difference between the two outputs is a simple, stateless function like softmax.
Finally, let’s talk about initializing our network’s parameters. In order to get well-behaved performance out of our model, the network’s weights, biases, and other parameters need to exhibit certain properties. Let’s imagine a degenerate case, where all of the network’s weights are greater than 1 (and we do not have residual connections). In that case, repeated multiplication by those weights would result in layer outputs that became very large as data flowed through the layers of the network. Similarly, weights less than 1 would cause all layer outputs to become smaller and vanish. Similar considerations apply to the gradients in the backward pass.
Many normalization techniques can be used to keep layer outputs well behaved, but one of the simplest is to just make sure the network’s weights are initialized such that intermediate values and gradients become neither unreasonably small nor unreasonably large. As we discussed in chapter 8, PyTorch does not help us as much as it should here, so we need to do some initialization ourselves. We can treat the following _init_weights
function as boilerplate, as the exact details aren’t particularly important.
def _init_weights(self): for m in self.modules(): if type(m) in { nn.Linear, nn.Conv3d, }: nn.init.kaiming_normal_( m.weight.data, a=0, mode='fan_out', nonlinearity='relu', ) if m.bias is not None: fan_in, fan_out = nn.init._calculate_fan_in_and_fan_out(m.weight.data) bound = 1 / math.sqrt(fan_out) nn.init.normal_(m.bias, -bound, bound)
Now it’s time to take the various pieces we’ve been working with and assemble them into something we can actually execute. This training loop should be familiar--we saw loops like figure 11.7 in chapter 5.
The code is relatively compact (the doTraining
function is only 12 statements; it’s longer here due to line-length limitations).
def main(self): # ... line 143 for epoch_ndx in range(1, self.cli_args.epochs + 1): trnMetrics_t = self.doTraining(epoch_ndx, train_dl) self.logMetrics(epoch_ndx, 'trn', trnMetrics_t) # ... line 165 def doTraining(self, epoch_ndx, train_dl): self.model.train() trnMetrics_g = torch.zeros( ❶ METRICS_SIZE, len(train_dl.dataset), device=self.device, ) batch_iter = enumerateWithEstimate( ❷ train_dl, "E{} Training".format(epoch_ndx), start_ndx=train_dl.num_workers, ) for batch_ndx, batch_tup in batch_iter: self.optimizer.zero_grad() ❸ loss_var = self.computeBatchLoss( ❹ batch_ndx, batch_tup, train_dl.batch_size, trnMetrics_g ) loss_var.backward() ❺ self.optimizer.step() ❺ self.totalTrainingSamples_count += len(train_dl.dataset) return trnMetrics_g.to('cpu')
❶ Initializes an empty metrics array
❷ Sets up our batch looping with time estimate
❸ Frees any leftover gradient tensors
❹ We’ll discuss this method in detail in the next section.
❺ Actually updates the model weights
The main differences that we see from the training loops in earlier chapters are as follows:
The trnMetrics_g
tensor collects detailed per-class metrics during training. For larger projects like ours, this kind of insight can be very nice to have.
We don’t directly iterate over the train_dl
data loader. We use enumerateWithEstimate
to provide an estimated time of completion. This isn’t crucial; it’s just a stylistic choice.
The actual loss computation is pushed into the computeBatchLoss
method. Again, this isn’t strictly necessary, but code reuse is typically a plus.
We’ll discuss why we’ve wrapped enumerate
with additional functionality in section 11.7.2; for now, assume it’s the same as enumerate(train_dl)
.
The purpose of the trnMetrics_g
tensor is to transport information about how the model is behaving on a per-sample basis from the computeBatchLoss
function to the logMetrics
function. Let’s take a look at computeBatchLoss
next. We’ll cover logMetrics
after we’re done with the rest of the main training loop.
The computeBatchLoss
function is called by both the training and validation loops. As the name suggests, it computes the loss over a batch of samples. In addition, the function also computes and records per-sample information about the output the model is producing. This lets us compute things like the percentage of correct answers per class, which allows us to hone in on areas where our model is having difficulty.
Of course, the function’s core functionality is around feeding the batch into the model and computing the per-batch loss. We’re using CrossEntropyLoss
(https:// pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss), just like in chapter 7. Unpacking the batch tuple, moving the tensors to the GPU, and invoking the model should all feel familiar after that earlier training work.
def computeBatchLoss(self, batch_ndx, batch_tup, batch_size, metrics_g): input_t, label_t, _series_list, _center_list = batch_tup input_g = input_t.to(self.device, non_blocking=True) label_g = label_t.to(self.device, non_blocking=True) logits_g, probability_g = self.model(input_g) loss_func = nn.CrossEntropyLoss(reduction='none') ❶ loss_g = loss_func( logits_g, label_g[:,1], ❷ ) # ... line 238 return loss_g.mean() ❸
❶ reduction=‘none’ gives the loss per sample.
❷ Index of the one-hot-encoded class
❸ Recombines the loss per sample into a single value
Here we are not using the default behavior to get a loss value averaged over the batch. Instead, we get a tensor of loss values, one per sample. This lets us track the individual losses, which means we can aggregate them as we wish (per class, for example). We’ll see that in action in just a moment. For now, we’ll return the mean of those per-sample losses, which is equivalent to the batch loss. In situations where you don’t want to keep statistics per sample, using the loss averaged over the batch is perfectly fine. Whether that’s the case is highly dependent on your project and goals.
Once that’s done, we’ve fulfilled our obligations to the calling function in terms of what’s required to do backpropagation and weight updates. Before we do that, however, we also want to record our per-sample stats for posterity (and later analysis). We’ll use the metrics_g
parameter passed in to accomplish this.
METRICS_LABEL_NDX=0 ❶ METRICS_PRED_NDX=1 METRICS_LOSS_NDX=2 METRICS_SIZE = 3 # ... line 225 def computeBatchLoss(self, batch_ndx, batch_tup, batch_size, metrics_g): # ... line 238 start_ndx = batch_ndx * batch_size end_ndx = start_ndx + label_t.size(0) metrics_g[METRICS_LABEL_NDX, start_ndx:end_ndx] = ❷ label_g[:,1].detach() ❷ metrics_g[METRICS_PRED_NDX, start_ndx:end_ndx] = ❷ probability_g[:,1].detach() ❷ metrics_g[METRICS_LOSS_NDX, start_ndx:end_ndx] = ❷ loss_g.detach() ❷ return loss_g.mean() ❸
❶ These named array indexes are declared at module-level scope
❷ We use detach since none of our metrics need to hold on to gradients.
❸ Again, this is the loss over the entire batch.
By recording the label, prediction, and loss for each and every training (and later, validation) sample, we have a wealth of detailed information we can use to investigate the behavior of our model. For now, we’re going to focus on compiling per-class statistics, but we could easily use this information to find the sample that is classified the most wrongly and start to investigate why. Again, for some projects, this kind of information will be less interesting, but it’s good to remember that you have these kinds of options available.
The validation loop in figure 11.8 looks very similar to training but is somewhat simplified. The key difference is that validation is read-only. Specifically, the loss value returned is not used, and the weights are not updated.
Nothing about the model should have changed between the start and end of the function call. In addition, it’s quite a bit faster due to the with torch.no_grad()
context manager explicitly informing PyTorch that no gradients need to be computed.
def main(self):
for epoch_ndx in range(1, self.cli_args.epochs + 1):
# ... line 157
valMetrics_t = self.doValidation(epoch_ndx, val_dl)
self.logMetrics(epoch_ndx, 'val', valMetrics_t)
# ... line 203
def doValidation(self, epoch_ndx, val_dl):
with torch.no_grad():
self.model.eval() ❶
valMetrics_g = torch.zeros(
METRICS_SIZE,
len(val_dl.dataset),
device=self.device,
)
batch_iter = enumerateWithEstimate(
val_dl,
"E{} Validation ".format(epoch_ndx),
start_ndx=val_dl.num_workers,
)
for batch_ndx, batch_tup in batch_iter:
self.computeBatchLoss(
batch_ndx, batch_tup, val_dl.batch_size, valMetrics_g)
return valMetrics_g.to('cpu')
❶ Turns off training-time behavior
Without needing to update network weights (recall that doing so would violate the entire premise of the validation set; something we never want to do!), we don’t need to use the loss returned from computeBatchLoss
, nor do we need to reference the optimizer. All that’s left inside the loop is the call to computeBatchLoss
. Note that we are still collecting metrics in valMetrics_g
as a side effect of the call, even though we aren’t using the overall per-batch loss returned by computeBatchLoss
for anything.
The last thing we do per epoch is log our performance metrics for this epoch. As shown in figure 11.9, once we’ve logged metrics, we return to the training loop for the next epoch of training. Logging results and progress as we go is important, since if training goes off the rails (“does not converge” in the parlance of deep learning), we want to notice this is happening and stop spending time training a model that’s not working out. In less catastrophic cases, it’s good to be able to keep an eye on how your model behaves.
Earlier, we were collecting results in trnMetrics_g
and valMetrics_g
for logging progress per epoch. Each of these two tensors now contains everything we need to compute our percent correct and average loss per class for our training and validation runs. Doing this per epoch is a common choice, though somewhat arbitrary. In future chapters, we’ll see how to manipulate the size of our epochs such that we get feedback about training progress at a reasonable rate.
Let’s talk about the high-level structure of the logMetrics
function. The signature looks like this.
def logMetrics( self, epoch_ndx, mode_str, metrics_t, classificationThreshold=0.5, ):
We use epoch_ndx
purely for display while logging our results. The mode_str
argument tells us whether the metrics are for training or validation.
We consume either trnMetrics_t
or valMetrics_t
, which is passed in as the metrics _t
parameter. Recall that both of those inputs are tensors of floating-point values that we filled with data during computeBatchLoss
and then transferred back to the CPU right before we returned them from doTraining
and doValidation
. Both tensors have three rows and as many columns as we have samples (training samples or validation samples, depending). As a reminder, those three rows correspond to the following constants.
METRICS_LABEL_NDX=0 ❶
METRICS_PRED_NDX=1
METRICS_LOSS_NDX=2
METRICS_SIZE = 3
❶ These are declared at module-level scope.
Next, we’re going to construct masks that will let us limit our metrics to only the nodule or non-nodule (aka positive or negative) samples. We will also count the total samples per class, as well as the number of samples we classified correctly.
negLabel_mask = metrics_t[METRICS_LABEL_NDX] <= classificationThreshold negPred_mask = metrics_t[METRICS_PRED_NDX] <= classificationThreshold posLabel_mask = ~negLabel_mask posPred_mask = ~negPred_mask
While we don’t assert
it here, we know that all of the values stored in metrics _t[METRICS_LABEL_NDX]
belong to the set {0.0, 1.0}
since we know that our nodule status labels are simply True
or False
. By comparing to classificationThreshold
, which defaults to 0.5, we get an array of binary values where a True
value corresponds to a non-nodule (aka negative) label for the sample in question.
We do a similar comparison to create the negPred_mask
, but we must remember that the METRICS_PRED_NDX
values are the positive predictions produced by our model and can be any floating-point value between 0.0 and 1.0, inclusive. That doesn’t change our comparison, but it does mean the actual value can be close to 0.5. The positive masks are simply the inverse of the negative masks.
Note While other projects can utilize similar approaches, it’s important to realize that we’re taking some shortcuts that are allowed because this is a binary classification problem. If your next project has more than two classes or has samples that belong to multiple classes at the same time, you’ll have to use more complicated logic to build similar masks.
Next, we use those masks to compute some per-label statistics and store them in a dictionary, metrics_dict
.
neg_count = int(negLabel_mask.sum()) ❶ pos_count = int(posLabel_mask.sum()) neg_correct = int((negLabel_mask & negPred_mask).sum()) pos_correct = int((posLabel_mask & posPred_mask).sum()) metrics_dict = {} metrics_dict['loss/all'] = metrics_t[METRICS_LOSS_NDX].mean() metrics_dict['loss/neg'] = metrics_t[METRICS_LOSS_NDX, negLabel_mask].mean() metrics_dict['loss/pos'] = metrics_t[METRICS_LOSS_NDX, posLabel_mask].mean() metrics_dict['correct/all'] = (pos_correct + neg_correct) / np.float32(metrics_t.shape[1]) * 100 ❷ metrics_dict['correct/neg'] = neg_correct / np.float32(neg_count) * 100 metrics_dict['correct/pos'] = pos_correct / np.float32(pos_count) * 100
❶ Converts to a normal Python integer
❷ Avoids integer division by converting to np.float32
First we compute the average loss over the entire epoch. Since the loss is the single metric that is being minimized during training, we always want to be able to keep track of it. Then we limit the loss averaging to only those samples with a negative label using the negLabel_mask
we just made. We do the same with the positive loss. Computing a per-class loss like this can be useful if one class is persistently harder to classify than another, since that knowledge can help drive investigation and improvements.
We’ll close out the calculations with determining the fraction of samples we classified correctly, as well as the fraction correct from each label. Since we will display these numbers as percentages in a moment, we also multiply the values by 100. Similar to the loss, we can use these numbers to help guide our efforts when making improvements. After the calculations, we then log our results with three calls to log.info
.
log.info(
("E{} {:8} {loss/all:.4f} loss, "
+ "{correct/all:-5.1f}% correct, "
).format(
epoch_ndx,
mode_str,
**metrics_dict,
)
)
log.info(
("E{} {:8} {loss/neg:.4f} loss, "
+ "{correct/neg:-5.1f}% correct ({neg_correct:} of {neg_count:})"
).format(
epoch_ndx,
mode_str + '_neg',
neg_correct=neg_correct,
neg_count=neg_count,
**metrics_dict,
)
)
log.info( ❶
# ... line 319
)
❶ The ‘pos’ logging is similar to the ‘neg’ logging earlier.
The first log has values computed from all of our samples and is tagged /all
, while the negative (non-nodule) and positive (nodule) values are tagged /neg
and /pos
, respectively. We don’t show the third logging statement for positive values here; it’s identical to the second except for swapping neg for pos in all cases.
Now that we’ve completed the core of the training.py script, we’ll actually start running it. This will initialize and train our model and print statistics about how well the training is going. The idea is to get this kicked off to run in the background while we’re covering the model implementation in detail. Hopefully we’ll have results to look at once we’re done.
We’re running this script from the main code directory; it should have subdirectories called p2ch11, util, and so on. The python
environment used should have all the libraries listed in requirements.txt installed. Once those libraries are ready, we can run:
$ python -m p2ch11.training ❶
Starting LunaTrainingApp,
Namespace(batch_size=256, channels=8, epochs=20, layers=3, num_workers=8)
<p2ch11.dsets.LunaDataset object at 0x7fa53a128710>: 495958 training samples
<p2ch11.dsets.LunaDataset object at 0x7fa537325198>: 55107 validation samples
Epoch 1 of 20, 1938/216 batches of size 256
E1 Training ----/1938, starting
E1 Training 16/1938, done at 2018-02-28 20:52:54, 0:02:57
...
❶ This is the command line for Linux/Bash. Windows users will probably need to invoke Python differently, depending on the install method used.
As a reminder, we also provide a Jupyter Notebook that contains invocations of the training application.
# In[5]: run('p2ch11.prepcache.LunaPrepCacheApp') # In[6]: run('p2ch11.training.LunaTrainingApp', '--epochs=1')
If the first epoch seems to be taking a very long time (more than 10 or 20 minutes), it might be related to needing to prepare the cached data required by LunaDataset
. See section 10.5.1 for details about the caching. The exercises for chapter 10 included writing a script to pre-stuff the cache in an efficient manner. We also provide the prepcache.py file to do the same thing; it can be invoked with python -m p2ch11 .prepcache
. Since we repeat our dsets.py files per chapter, the caching will need to be repeated for every chapter. This is somewhat space and time inefficient, but it means we can keep the code for each chapter much more well contained. For your future projects, we recommend reusing your cache more heavily.
Once training is underway, we want to make sure we’re using the computing resources at hand the way we expect. An easy way to tell if the bottleneck is data loading or computation is to wait a few moments after the script starts to train (look for output like E1 Training 16/7750, done at...
) and then check both top
and nvidia-smi
:
If the eight Python worker processes are consuming >80% CPU, then the cache probably needs to be prepared (we know this here because the authors have made sure there aren’t CPU bottlenecks in this project’s implementation; this won’t be generally true).
If nvidia-smi
reports that GPU-Util
is >80%, then you’re saturating your GPU. We’ll discuss some strategies for efficient waiting in section 11.7.2.
The intent is that the GPU is saturated; we want to use as much of that computing power as we can to complete epochs quickly. A single NVIDIA GTX 1080 Ti should complete an epoch in under 15 minutes. Since our model is relatively simple, it doesn’t take a lot of CPU preprocessing for the CPU to be the bottleneck. When working with models with greater depth (or more needed calculations in general), processing each batch will take longer, which will increase the amount of CPU processing we can do before the GPU runs out of work before the next batch of input is ready.
If the number of samples is less than 495,958 for training or 55,107 for validation, it might make sense to do some sanity checking to be sure the full data is present and accounted for. For your future projects, make sure your dataset returns the number of samples that you expect.
First, let’s take a look at the basic directory structure of our data-unversioned/ part2/luna directory:
$ ls -1p data-unversioned/part2/luna/ subset0/ subset1/ ... subset9/
Next, let’s make sure we have one .mhd file and one .raw file for each series UID
$ ls -1p data-unversioned/part2/luna/subset0/ 1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.mhd 1.3.6.1.4.1.14519.5.2.1.6279.6001.105756658031515062000744821260.raw 1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.mhd 1.3.6.1.4.1.14519.5.2.1.6279.6001.108197895896446896160048741492.raw ...
and that we have the overall correct number of files:
$ ls -1 data-unversioned/part2/luna/subset?/* | wc -l 1776 $ ls -1 data-unversioned/part2/luna/subset0/* | wc -l 178 ... $ ls -1 data-unversioned/part2/luna/subset9/* | wc -l 176
If all of these seem right but things still aren’t working, ask on Manning LiveBook (https://livebook.manning.com/book/deep-learning-with-pytorch/chapter-11) and hopefully someone can help get things sorted out.
Working with deep learning involves a lot of waiting. We’re talking about real-world, sitting around, glancing at the clock on the wall, a watched pot never boils (but you could fry an egg on the GPU), straight up boredom.
The only thing worse than sitting and staring at a blinking cursor that hasn’t moved for over an hour is flooding your screen with this:
2020-01-01 10:00:00,056 INFO training batch 1234 2020-01-01 10:00:00,067 INFO training batch 1235 2020-01-01 10:00:00,077 INFO training batch 1236 2020-01-01 10:00:00,087 INFO training batch 1237 ...etc...
At least the quietly blinking cursor doesn’t blow out your scrollback buffer!
Fundamentally, while doing all this waiting, we want to answer the question “Do I have time to go refill my water glass?” along with follow-up questions about having time to
Grab dinner in Paris5
To answer these pressing questions, we’re going to use our enumerateWithEstimate
function. Usage looks like the following:
>>> for i, _ in enumerateWithEstimate(list(range(234)), "sleeping"): ... time.sleep(random.random()) ... 11:12:41,892 WARNING sleeping ----/234, starting 11:12:44,542 WARNING sleeping 4/234, done at 2020-01-01 11:15:16, 0:02:35 11:12:46,599 WARNING sleeping 8/234, done at 2020-01-01 11:14:59, 0:02:17 11:12:49,534 WARNING sleeping 16/234, done at 2020-01-01 11:14:33, 0:01:51 11:12:58,219 WARNING sleeping 32/234, done at 2020-01-01 11:14:41, 0:01:59 11:13:15,216 WARNING sleeping 64/234, done at 2020-01-01 11:14:43, 0:02:01 11:13:44,233 WARNING sleeping 128/234, done at 2020-01-01 11:14:35, 0:01:53 11:14:40,083 WARNING sleeping ----/234, done at 2020-01-01 11:14:40 >>>
That’s 8 lines of output for over 200 iterations lasting about 2 minutes. Even given the wide variance of random.random()
, the function had a pretty decent estimate after 16 iterations (in less than 10 seconds). For loop bodies with more constant timing, the estimates stabilize even more quickly.
In terms of behavior, enumerateWithEstimate
is almost identical to the standard enumerate
(the differences are things like the fact that our function returns a generator, whereas enumerate
returns a specialized <enumerate object at 0x...>
).
def enumerateWithEstimate( iter, desc_str, start_ndx=0, print_ndx=4, backoff=None, iter_len=None, ): for (current_ndx, item) in enumerate(iter): yield (current_ndx, item)
However, the side effects (logging, specifically) are what make the function interesting. Rather than get lost in the weeds trying to cover every detail of the implementation, if you’re interested, you can consult the function docstring (https://github .com/deep-learning-with-pytorch/dlwpt-code/blob/master/util/util.py#L143) to get information about the function parameters and desk-check the implementation.
Deep learning projects can be very time intensive. Knowing when something is expected to finish means you can use your time until then wisely, and it can also clue you in that something isn’t working properly (or an approach is unworkable) if the expected time to completion is much larger than expected.
Let’s take a look at some (abridged) output from our training script. As a reminder, we’ve run this with the command line python -m p2ch11.training
:
E1 Training ----/969, starting ... E1 LunaTrainingApp E1 trn 2.4576 loss, 99.7% correct ... E1 val 0.0172 loss, 99.8% correct ...
After one epoch of training, both the training and validation set show at least 99.7% correct results. That’s an A+! Time for a round of high-fives, or at least a satisfied nod and smile. We just solved cancer! ... Right?
Let’s take a closer (less-abridged) look at that epoch 1 output:
E1 LunaTrainingApp E1 trn 2.4576 loss, 99.7% correct, E1 trn_neg 0.1936 loss, 99.9% correct (494289 of 494743) E1 trn_pos 924.34 loss, 0.2% correct (3 of 1215) ... E1 val 0.0172 loss, 99.8% correct, E1 val_neg 0.0025 loss, 100.0% correct (494743 of 494743) E1 val_pos 5.9768 loss, 0.0% correct (0 of 1215)
On the validation set, we’re getting non-nodules 100% correct, but the actual nodules are 100% wrong. The network is just classifying everything as not-a-nodule! The value 99.7% just means only approximately 0.3% of the samples are nodules.
After 10 epochs, the situation is only marginally better:
E10 LunaTrainingApp E10 trn 0.0024 loss, 99.8% correct E10 trn_neg 0.0000 loss, 100.0% correct E10 trn_pos 0.9915 loss, 0.0% correct E10 val 0.0025 loss, 99.7% correct E10 val_neg 0.0000 loss, 100.0% correct E10 val_pos 0.9929 loss, 0.0% correct
The classification output remains the same--none of the nodule (aka positive) samples are correctly identified. It’s interesting that we’re starting to see some decrease in the val_pos
loss, however, while not seeing a corresponding increase in the val_neg
loss. This implies that the network is learning something. Unfortunately, it’s learning very, very slowly.
Even worse, this particular failure mode is the most dangerous in the real world! We want to avoid the situation where we classify a tumor as an innocuous structure, because that would not facilitate a patient getting the evaluation and eventual treatment they might need. It’s important to understand the consequences for misclassification for all your projects, as that can have a large impact on how you design, train, and evaluate your model. We’ll discuss this more in the next chapter.
Before we get to that, however, we need to upgrade our tooling to make the results easier to understand. We’re sure you love to squint at columns of numbers as much as anyone, but pictures are worth a thousand words. Let’s graph some of these metrics.
We’re going to use a tool called TensorBoard as a quick and easy way to get our training metrics out of our training loop and into some pretty graphs. This will allow us to follow the trends of those metrics, rather than only look at the instantaneous values per epoch. It gets much, much easier to know whether a value is an outlier or just the latest in a trend when you’re looking at a visual representation.
“Hey, wait,” you might be thinking, “isn’t TensorBoard part of the TensorFlow project? What’s it doing here in my PyTorch book?”
Well, yes, it is part of another deep learning framework, but our philosophy is “use what works.” There’s no reason to restrict ourselves by not using a tool just because it’s bundled with another project we’re not using. Both the PyTorch and TensorBoard devs agree, because they collaborated to add official support for TensorBoard into PyTorch. TensorBoard is great, and it’s got some easy-to-use PyTorch APIs that let us hook data from just about anywhere into it for quick and easy display. If you stick with deep learning, you’ll probably be seeing (and using) a lot of TensorBoard.
In fact, if you’ve been running the chapter examples, you should already have some data on disk ready and waiting to be displayed. Let’s see how to run TensorBoard, and look at what it can show us.
By default, our training script will write metrics data to the runs/ subdirectory. If you list the directory content, you might see something like this during your Bash shell session:
$ ls -lA runs/p2ch11/ total 24 drwxrwxr-x 2 elis elis 4096 Sep 15 13:22 2020-01-01_12.55.27-trn-dlwpt/ ❶ drwxrwxr-x 2 elis elis 4096 Sep 15 13:22 2020-01-01_12.55.27-val-dlwpt/ ❶ drwxrwxr-x 2 elis elis 4096 Sep 15 15:14 2020-01-01_13.31.23-trn-dwlpt/ ❷ drwxrwxr-x 2 elis elis 4096 Sep 15 15:14 2020-01-01_13.31.23-val-dwlpt/ ❷
❶ The single-epoch run from earlier
❷ The more recent 10-epoch training run
To get the tensorboard
program, install the tensorflow
(https://pypi.org/project/ tensorflow) Python package. Since we’re not actually going to use TensorFlow proper, it’s fine if you install the default CPU-only package. If you have another version of TensorBoard installed already, using that is fine too. Either make sure the appropriate directory is on your path, or invoke it with ../path/to/tensorboard --logdir runs/
. It doesn’t really matter where you invoke it from, as long as you use the --logdir
argument to point it at where your data is stored. It’s a good idea to segregate your data into separate folders, as TensorBoard can get a bit unwieldy once you get over 10 or 20 experiments. You’ll have to decide the best way to do that for each project as you go. Don’t be afraid to move data around after the fact if you need to.
$ tensorboard --logdir runs/
2020-01-01 12:13:16.163044: I tensorflow/core/platform/cpu_feature_guard.cc:140]❶
Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 1((CO17-2))
TensorBoard 1.14.0 at http://localhost:6006/ (Press CTRL+C to quit)
❶ These messages might be different or not present for you; that’s fine.
Once that’s done, you should be able to point your browser at http://localhost:6006 and see the main dashboard.6 Figure 11.10 shows us what that looks like.
Along the top of the browser window, you should see the orange header. The right side of the header has the typical widgets for settings, a link to the GitHub repository, and the like. We can ignore those for now. The left side of the header has items for the data types we’ve provided. You should have at least the following:
You might see Distributions as well as the second UI tab (to the right of Scalars in figure 11.10). We won’t use or discuss those here. Make sure you’ve selected Scalars by clicking it.
On the left is a set of controls for display options, as well as a list of runs that are present. The smoothing option can be useful if you have particularly noisy data; it will calm things down so that you can pick out the overall trend. The original non-smoothed data will still be visible in the background as a faded line in the same color. Figure 11.11 shows this, although it might be difficult to discern when printed in black and white.
Depending on how many times you’ve run the training script, you might have multiple runs to select from. With too many runs being rendered, the graphs can get overly noisy, so don’t hesitate to deselect runs that aren’t of interest at the moment.
If you want to permanently remove a run, the data can be deleted from disk while TensorBoard is running. You can do this to get rid of experiments that crashed, had bugs, didn’t converge, or are so old they’re no longer interesting. The number of runs can grow pretty quickly, so it can be helpful to prune it often and to rename runs or move runs that are particularly interesting to a more permanent directory so they don’t get deleted by accident. To remove both the train
and validation
runs, execute the following (after changing the chapter, date, and time to match the run you want to remove):
$ rm -rf runs/p2ch11/2020-01-01_12.02.15_*
Keep in mind that removing runs will cause the runs that are later in the list to move up, which will result in them being assigned new colors.
OK, let’s get to the point of TensorBoard: the pretty graphs! The main part of the screen should be filled with data from gathering training and validation metrics, as shown in figure 11.12.
That’s much easier to parse and absorb than E1 trn_pos 924.34 loss, 0.2% correct (3 of 1215)
! Although we’re going to save discussion of what these graphs are telling us for section 11.10, now would be a good time to make sure it’s clear what these numbers correspond to from our training program. Take a moment to cross-reference the numbers you get by mousing over the lines with the numbers spit out by training.py during the same training run. You should see a direct correspondence between the Value column of the tooltip and the values printed during training. Once you’re comfortable and confident that you understand exactly what TensorBoard is showing you, let’s move on and discuss how to get these numbers to appear in the first place.
We are going to use the torch.utils.tensorboard
module to write data in a format that TensorBoard will consume. This will allow us to write metrics for this and any other project quickly and easily. TensorBoard supports a mix of NumPy arrays and PyTorch tensors, but since we don’t have any reason to put our data into NumPy arrays, we’ll use PyTorch tensors exclusively.
The first thing we need do is to create our SummaryWriter
objects (which we imported from torch.utils.tensorboard
). The only parameter we’re going to pass in is log_dir
, which we will initialize to something like runs/p2ch11/2020-01-01_12 .55.27-trn-dlwpt
. We can add a comment argument to our training script to change dlwpt
to something more informative; use python -m p2ch11.training --help
for more information.
We create two writers, one each for the training and validation runs. Those writers will be reused for every epoch. When the SummaryWriter
class gets initialized, it also creates the log_dir
directories as a side effect. These directories show up in TensorBoard and can clutter the UI with empty runs if the training script crashes before any data gets written, which can be common when you’re experimenting with something. To avoid writing too many empty junk runs, we wait to instantiate the SummaryWriter
objects until we’re ready to write data for the first time. This function is called from logMetrics()
.
def initTensorboardWriters(self): if self.trn_writer is None: log_dir = os.path.join('runs', self.cli_args.tb_prefix, self.time_str) self.trn_writer = SummaryWriter( log_dir=log_dir + '-trn_cls-' + self.cli_args.comment) self.val_writer = SummaryWriter( log_dir=log_dir + '-val_cls-' + self.cli_args.comment)
If you recall, the first epoch is kind of a mess, with the early output in the training loop being essentially random. When we save the metrics from that first batch, those random results end up skewing things a bit. Recall from figure 11.11 that TensorBoard has smoothing to remove noise from the trend lines, which helps somewhat.
Another approach could be to skip metrics entirely for the first epoch’s training data, although our model trains quickly enough that it’s still useful to see the first epoch’s results. Feel free to change this behavior as you see fit; the rest of part 2 will continue with this pattern of including the first, noisy training epoch.
tip If you end up doing a lot of experiments that result in exceptions or killing the training script relatively quickly, you might be left with a number of junk runs cluttering up your runs/ directory. Don’t be afraid to clean those out!
Writing scalars is straightforward. We can take the metrics_dict
we’ve already constructed and pass in each key/value pair to the writer.add_scalar
method. The torch.utils.tensorboard.SummaryWriter
class has the add_scalar
method (http:// mng.bz/RAqj) with the following signature.
def add_scalar(self, tag, scalar_value, global_step=None, walltime=None): # ...
The tag
parameter tells TensorBoard which graph we’re adding values to, and the scalar_value
parameter is our data point’s Y-axis value. The global_step
parameter acts as the X-axis value.
Recall that we updated the totalTrainingSamples_count
variable inside the doTraining
function. We’ll use totalTrainingSamples_count
as the X-axis of our TensorBoard plots by passing it in as the global_step
parameter. Here’s what that looks like in our code.
for key, value in metrics_dict.items(): writer.add_scalar(key, value, self.totalTrainingSamples_count)
Note that the slashes in our key names (such as 'loss/all'
) result in TensorBoard grouping the charts by the substring before the '/'
.
The documentation suggests that we should be passing in the epoch number as the global_step
parameter, but that results in some complications. By using the number of training samples presented to the network, we can do things like change the number of samples per epoch and still be able to compare those future graphs to the ones we’re creating now. Saying that a model trains in half the number of epochs is meaningless if each epoch takes four times as long! Keep in mind that this might not be standard practice, however; expect to see a variety of values used for the global step.
Our model is clearly learning something--the loss trend lines are consistent as epochs increase, and the results are repeatable. There is a disconnect, however, between what the model is learning and what we want it to learn. What’s going on? Let’s use a quick metaphor to illustrate the problem.
Imagine that a professor gives students a final exam consisting of 100 True/False questions. The students have access to previous versions of this professor’s tests going back 30 years, and every time there are only one or two questions with a True answer. The other 98 or 99 are False, every time.
Assuming that the grades aren’t on a curve and instead have a typical scale of 90% correct or better being an A, and so on, it is trivial to get an A+: just mark every question as False! Let’s imagine that this year, there is only one True answer. A student like the one on the left in figure 11.13 who mindlessly marked every answer as False would get a 99% on the final but wouldn’t really demonstrate that they had learned anything (beyond how to cram from old tests, of course). That’s basically what our model is doing right now.
Contrast that with a student like the one on the right who also got 99% of the questions correct, but did so by answering two questions with True. Intuition tells us that the student on the right in figure 11.13 probably has a much better grasp of the material than the all-False student. Finding the one True question while only getting one answer wrong is pretty difficult! Unfortunately, neither our students’ grades nor our model’s grading scheme reflect this gut feeling.
We have a similar situation, where 99.7% of the answers to “Is this candidate a nodule?” are “Nope.” Our model is taking the easy way out and answering False on every question.
Still, if we look back at our model’s numbers more closely, the loss on the training and validation sets is decreasing! The fact that we’re getting any traction at all on the cancer-detection problem should give us hope. It will be the work of the next chapter to realize this potential. We’ll start chapter 12 by introducing some new, relevant terminology, and then we’ll come up with a better grading scheme that doesn’t lend itself to being gamed quite as easily as what we’ve done so far.
We’ve come a long way this chapter--we now have a model and a training loop, and are able to consume the data we produced in the last chapter. Our metrics are being logged to the console as well as graphed visually.
While our results aren’t usable yet, we’re actually closer than it might seem. In chapter 12, we will improve the metrics we’re using to track our progress, and use them to inform the changes we need to make to get our model producing reasonable results.
Implement a program that iterates through a LunaDataset
instance by wrapping it in a DataLoader
instance, while timing how long it takes to do so. Compare these times to the times from the exercises in chapter 10. Be aware of the state of the cache when running the script.
Reverse the sort order of noduleInfo_list
. How does that change the behavior of the model after one epoch of training?
Change logMetrics
to alter the naming scheme of the runs and keys that are used in TensorBoard.
Data loaders can be used to load data from arbitrary datasets in multiple processes. This allows otherwise-idle CPU resources to be devoted to preparing data to feed to the GPU.
Data loaders load multiple samples from a dataset and collate them into a batch. PyTorch models expect to process batches of data, not individual samples.
Data loaders can be used to manipulate arbitrary datasets by changing the relative frequency of individual samples. This allows for “after-market” tweaks to a dataset, though it might make more sense to change the dataset implementation directly.
We will use PyTorch’s torch.optim.SGD
(stochastic gradient descent) optimizer with a learning rate of 0.001 and a momentum of 0.99 for the majority of part 2. These values are also reasonable defaults for many deep learning projects.
Our initial model for classification will be very similar to the model we used in chapter 8. This lets us get started with a model that we have reason to believe will be effective. We can revisit the model design if we think it’s the thing preventing our project from performing better.
The choice of metrics that we monitor during training is important. It is easy to accidentally pick metrics that are misleading about how the model is performing. Using the overall percentage of samples classified correctly is not useful for our data. Chapter 12 will detail how to evaluate and choose better metrics.
TensorBoard can be used to display a wide range of metrics visually. This makes it much easier to consume certain forms of information (particularly trend data) as they change per epoch of training.
1.Any shell, really, but if you’re using a non-Bash shell, you already knew that.
2.Remember that we’re actually working in 3D, despite the 2D figure.
3.Which is why there’s an exercise to experiment with both in the next chapter!
4.There are numerical stability benefits for doing so. Propagating gradients accurately through an exponential calculated using 32-bit floating-point numbers can be problematic.
5.If getting dinner in France doesn’t involve an airport, feel free to substitute “Paris, Texas” to make the joke work; https://en.wikipedia.org/wiki/Paris_(disambiguation).
6.If you’re running training on a different computer from your browser, you’ll need to replace localhost with the appropriate hostname or IP address.