Chapter 8. Deep Learning for Medicine

As we saw in the previous chapter, the ability to extract meaningful information from visual datasets can prove useful for analyzing microscopy images. This capability for handling visual data is similarly useful for medical applications. Much of modern medicine requires doctors to critically analyze medical scans. Deep learning tools could potentially make this analysis easier and faster (but perhaps less interpretable).

Let’s learn more. We’ll start by giving you a brief overview of earlier computational techniques for medicine. We’ll discuss some of the limitations of such methods, then we’ll start running over the current set of deep learning–powered techniques for medicine. We’ll explain how these new techniques might allow us to bypass some of the fundamental limitations of older techniques. We’ll end the chapter with a discussion of some of the ethical considerations of applying deep learning to medicine.

Computer-Aided Diagnostics

Designing computer-aided diagnostic systems has been a major focus of AI research since the advent of the field. The earliest attempts at this1 used hand-curated knowledge bases. In these systems, expert doctors would be solicited to write down causal inference rules (see, for example, Figure 8-1).

There was basic support for uncertainty handling through certainty factors.

MYCIN was an early expert system used to diagnose bacterial infections. This is an example of a MYCIN rule for inference
Figure 8-1. MYCIN was an early expert system used to diagnose bacterial infections. This is an example of a MYCIN rule for inference (adapted from the University of Surrey).

These rules were combined using a logical engine. A number of efficient inference techniques were designed that could effectively combine large databases of rules. Such systems were traditionally called “expert systems.”

What Happened to Expert Systems?

Although expert systems achieved some notable successes, the construction of these systems required considerable effort. Rules had to be painstakingly solicited from experts and curated by trained “knowledge engineers.” While some expert systems achieved striking results in limited domains, on the whole they were too brittle to use widely. That said, expert systems had a strong impact on much of computer science, and hosts of modern technologies (SQL, XML, Bayesian networks and more) draw inspiration from expert system technologies.

If you’re a developer, it’s good to pause and consider this. Although expert systems were once a blindingly hot technology, they currently exist primarily as a historical curiosity. It’s very likely that most of today’s hot technologies will one day end up in the curiosity heap of computer science history. This is a feature, not a bug, of computer science. The field reinvents itself rapidly, so we can trust that the replacements for today’s technologies will tick some crucial boxes that today’s tools can’t. At the same time, as with expert systems, we can rest assured that the algorithmic fundamentals of today’s technology will live on in tomorrow’s tools.

Expert systems for medicine had a good run. Some of them were deployed widely and adopted internationally as well.2 However, these systems failed to achieve significant traction with everyday doctors and nurses. One problem was that they were very finicky and hard to use. They also required their users to be able to pass in patient information in a highly structured format. Given that computers had barely penetrated standard clinics at the time, requiring highly specialized training for doctors and nurses proved to be too big an ask.

Probabilistic Diagnoses with Bayesian Networks

Another major problem with expert system tools was that they could only provide deterministic predictions. These deterministic predictions didn’t leave much room for uncertainty. What if the doctor was seeing a tricky patient where the diagnosis wasn’t clear? For a time, it seemed that if expert systems could be modified to account for uncertainties, this would allow them to achieve success.

This basic insight triggered a host of work on Bayesian networks for clinical diagnoses. (One of the authors of this book spent a year working on such a system as an undergrad.) However, these systems suffered from many of the same limitations as the expert systems. It was still necessary to solicit structural knowledge from doctors, and designers of Bayesian clinical networks faced the additional challenge of soliciting meaningful probabilities from doctors. This process added significant overhead to the process of adoption.

In addition, training a Bayesian network can be complicated. Different types of Bayesian network require different algorithms. Contrast this with deep learning algorithms, where gradient descent techniques work on almost all networks you can find. Robustness of learning is often what enables widespread adoption. This basic insight triggered a host of work on Bayesian networks for clinical diagnoses. (See Figure 8-2 for a simple example of a Bayesian network.)

A simple example of a Bayesian network for inferring whether the grass is wet at a given spot.
Figure 8-2. A simple example of a Bayesian network for inferring whether the grass is wet at a given spot. (Source: Wikimedia.)

Ease of Use Drives Adoption

Expert systems and Bayesian networks both failed to win broad adoption. At least part of the reason for this failure was that both these systems had pretty terrible developer experiences. From the developer’s standpoint, designing either a Bayesian network or an expert system required constantly keeping a doctor in the development loop. In addition, the effectiveness of the system depended critically on the ability of the development team to extract valuable insights from doctors.

Contrast this with deep networks. For a given data type (images, molecules, text, etc.) and a given learning task, there are a set of standard metrics at hand. The developer needs only to follow best statistical practices (as taught by this or another book) in order to build a functional system. The dependence on expert knowledge is considerably reduced. This gain in simplicity no doubt accounts for part of the reason deep networks have gained much broader adoption.

Electronic Health Record Data

Traditionally, hospitals maintained paper charts for their patients. These charts would record the tests, medications, and other treatments of the patient, allowing doctors to track the patient’s health with a quick glance at the chart. Unfortunately, paper health records had a host of difficulties associated with them. Transferring records between hospitals required a major amount of work, and it wasn’t easy to index or search paper health record data.

For this reason, there has been a major push over the last few decades in a number of countries to move from paper records to electronic health records (EHRs). In the US, the adoption of the Affordable Care Act significantly accelerated their adoption, and most major US health providers now store their patient records on EHR systems.

The broad adoption of EHR systems has spurred a boom in research on machine learning systems that work with EHR data. These systems aim to use large datasets of patient records to train models that will be capable of predicting things such as patient outcomes or risks. In many ways, these EHR models are the intellectual successors of the expert systems and Bayesian networks we just learned about. Like these earlier systems, EHR models seek to aid the process of diagnosis. However, while earlier systems sought to aid doctors in making real-time diagnoses, these newer systems content themselves (mostly) with working on the backend.

A number of projects have attempted to learn robust models from EHR data. While there have been some notable successes, learning on EHR data remains challenging for practitioners. Due to privacy concerns, there aren’t many large public EHR datasets available. As a result, only a small group of elite researchers have been able to design these systems thus far. In addition, EHR data tends to be very messy. Since human doctors and nurses manually enter information, most EHR data suffers from missing fields and all sorts of different conventions. Creating robust models that deal with the missing data has proven challenging.

ICD-10 Codes

ICD-10 is a set of “codes” for patient diseases and symptoms. These standard codes have found broad adoption in recent years because they allow insurers and governmental agencies to set standard practices, treatments, and treatment prices for diseases.

The ICD-10 codes “quantize” (make discrete) the high-dimensional continuous space of human disease. By standardizing, they allow doctors to compare and group patients. It’s worth noting that for this reason, such codes will likely prove relevant to developers of EHR systems and models. If you’re designing the data warehouse for a new EHR system, make sure you think about where you’re going to put your codes!

Fast Healthcare Interoperability Resources (FHIR)

The Fast Healthcare Interoperability Resources (FHIR) specification was developed to represent clinical data in a standard and flexible format.3 Recent work from Google demonstrated how raw EHR data can be transformed into FHIR format automatically.4 The use of this format enables the development of standard deep architectures that can be applied to arbitrary EHR data, which means standard open source tools for this data can be used in a plug-and-play fashion. This work is still in early stages, but it represents exciting progress for the field. Although standardization may appear boring at first blush, it’s the foundation for future advances since it means that larger datasets can be worked with productively.

However, this state of affairs is starting to change. Improved tools, both for preprocessing and for learning, have started to enable effective learning to occur on EHR systems. The DeepPatient system trains a denoising autoencoder on patient medical records to create a patient representation which it then uses to predict patient outcomes.5 In this system, a patient’s record is transformed from a set of unordered textual information into a vector. This strategy of transforming disparate data types into vectors has been widely successful throughout deep learning and seems poised to offer meaningful improvements in EHR systems as well. A number of models based on EHR systems have sprouted in the literature, many of which are starting to incorporate the latest tools of deep learning, such as recurrent networks or reinforcement learning. While models with these latest bells and whistles are still maturing, they’re very exciting and provide pointers to where the field is likely to head over the next few years.

What About Unsupervised Learning?

Through most of this book, we’ve primarily demonstrated supervised learning methods. There’s also a whole class of “unsupervised” learning methods that don’t share the same dependency on supervised training data. We haven’t really introduced unsupervised learning as a concept yet, but the basic idea is that we no longer have labels associated with data points. For example, imagine we have a set of EHR records but no patient outcome data. What can we do?

The simplest answer is that we can cluster the records. For a toy example, imagine we have “twin” patients whose EHR records are identical. It seems reasonable to predict that the outcomes of these two patients will be similar. Unsupervised learning techniques such as k-means or autoencoders implement somewhat more sophisticated forms of this basic intuition. You’ll see a sophisticated example of an unsupervised algorithm later in Chapter 9.

Unsupervised techniques can yield some compelling insights, but these methods can be hit-or-miss at times. While there have been some compelling use cases, such as DeepPatient, on the whole unsupervised methods are still finicky enough that they have yet to see wide usage. If you’re a researcher, though, working on ways to stabilize unsupervised learning remains a compelling (and challenging) open problem.

The Dangers of Large Patient EHR Databases?

A number of large institutions are moving toward having all their patients in EHR systems. What happens when these large datasets are standardized (perhaps in a format such as FHIR) and made interoperable? On the positive side, it might be possible then to support applications such as searching for patients that have a particular disease phenotype. Such focused search capabilities may help doctors find treatments more effectively for patients, and especially for patients with rare diseases.

However, it doesn’t take much imagination to see how large patient databases could be put to malicious use. For example, insurers could use patient outcome systems to preemptively deny insurance to higher-risk patients, or top surgeons seeking to maintain high patient survival rates could avoid operating on patients that the system marks as high risk. How do we guard against these dangers?

Many of the questions that machine learning systems raise can’t be addressed with the tools of machine learning. Rather, it’s likely that the answers to these questions will rest with legislation that forbids predatory behavior on the part of doctors, insurers, and others.

Do EHRs Really Help Doctors?

While EHRs obviously aid in the design of learning algorithms, there’s less compelling evidence that EHRs actually improve life for doctors. Part of the challenge is that today’s EHRs require significant manual data entry on the part of doctors. For patients, this has created a new familiar dynamic in which the doctor spends the majority of a consultation looking at the computer rather than looking at the actual patient.

This state of affairs has left both patients and doctors unhappy.6 Doctors feel burned out because they spend the majority of their time doing clerical data entry rather than patient care, and patients feel ignored. One hope for the next generation of deep learning–powered systems is that this imbalance could be improved by future products.

Note, however, that there’s a real chance that the next generation of deep learning tools could prove equally unfriendly and unhelpful for doctors. The designers of EHR systems didn’t aim to make unfriendly systems either.

Deep Radiology

Radiology is the science of using medical scans to diagnose disease. There are a variety of different scans that doctors use, such as MRI scans, ultrasounds, X-rays, and CT scans. For each of these, the challenge is to diagnose the state of the patient from the given scan imagery. This looks like a challenge well suited for convolutional learning methods. As we have seen in the previous chapters, deep learning methods are capable of learning sophisticated functions from image data. Much of modern radiology (the mechanical parts at least) consists of classifying and handling complex medical image data. The use of scans has a long and storied history in medicine (see Figure 8-4 for an example of an early X-ray).

In this section, we’ll quickly introduce a number of different types of scans and briefly cover some deep learning applications. Many of these applications are qualitatively similar. They start by obtaining a large enough dataset of scans from a medical institution. These scans are used to train a convolutional architecture (see Figure 8-3). Often, the architecture is a standard VGG or ResNet architecture, but sometimes with some tweaks to the core structure. The trained model often (at least according to perhaps naive statistics) has strong performance on the task in question.

Figure 8-3. This diagram draws out some standard convolutional architectures (VGG-19, Resnet-34, the number indicates the number of convolutions applied). These architectures are standard for image tasks and are commonly used for medical applications.

These advances have led to some perhaps inflated expectations. Some high-profile AI scientists—most notably, Geoff Hinton—have commented that deep learning for radiology will advance so far that it will no longer be worth training new radiologists in the near future.7 Is this actually right? There have been a string of recent advances in which deep learning systems have achieved what appears like near-human performance. However, these results come with many caveats, and these systems are often brittle in unknown fashions.

Our opinion is that the risk of direct 1-1 replacement of doctors remains low, but there is a real risk of systematic displacement. What does this mean? New startups are working to invent new business models in which deep learning systems do the large majority of scan analysis, with only a few doctors remaining.

Is Deep Learning Actually Learning Medicine?

Significant analysis has gone into scrutinizing what deep models actually learn in medical imagery. Unfortunately, in many cases, it looks like the deep models succeed in picking up nonmedical factors in the imagery. For example, the model might implicitly learn to identify the scanning center in which a particular medical scan was taken. Since particular centers are often used for more serious or less serious patients, the model might at first glance look as though it had succeeded in learning useful medicine, but would in fact be generally useless.

What can be done in such cases? The jury is still out on the question, but a couple of early approaches are emerging. The first is to use the growing literature on model interpretability to scrutinize carefully what the model is learning. In Chapter 10, we will delve into a number of methods for model interpretability.

The other approach is to conduct prospective trials deploying the models in clinics. Prospective trials remain the gold standard for testing proposed medical interventions, and it is likely they will remain so for deep learning techniques as well.

X-Ray Scans and CT Scans

Informally, an X-ray scan—radiography, if we’re being precise—involves using X-rays to view some internal structure in the body (Figure 8-4). Computed tomography (CT) scans are a variant of X-ray scans in which the X-ray source and detectors rotate around the object being imaged, allowing for 3D images.

The first medical X-ray taken by Wilhelm Röntgen of his wife Anna Bertha Ludwig's hand. The science of X-rays has come a long way since this first photograph, and there's a chance that deep learning will take it much further yet!
Figure 8-4. The first medical X-ray taken by Wilhelm Röntgen of his wife Anna Bertha Ludwig’s hand. The science of X-rays has come a long way since this first photograph, and there’s a chance that deep learning will take it much further yet!

A common misconception is that X-ray scans are only capable of imaging “hard” objects such as bones. This turns out to be quite false. CT scans are routinely used to image tissues in the body such as the brain (Figure 8-5), and backscatter X-rays are often used in airports to image travelers at security checkpoints. Mammograms use low-energy X-rays to scan breast tissue as well.

CT scan of a human brain from bottom to top. Note the capacity of CT scans to provide 3D information.
Figure 8-5. CT scan of a human brain from bottom to top. Note the capacity of CT scans to provide 3D information. (Source: Wikimedia.)

It’s worth noting that all X-ray scans are known to be linked to cancer, so a common goal is to minimize the exposure of patients to radiation by limiting the number of scans required. This risk is more marked for CT scans, which have to expose the patient for longer time periods in order to gather sufficient data. A wide variety of signal processing algorithms have been designed to reduce the number of scans required for CT. Some exciting recent work has started to use deep learning to further tune this reconstruction process so even fewer scans are required.

However, most uses of deep learning in the space are used to classify scans. For example, convolutional networks have been used to classify Alzheimer progression from CT brain images.8 Other work has claimed the ability to diagnose pneumonia from chest X-ray scans at near physician-level accuracy.9 Deep learning has similarly been used to achieve strong classification accuracy on mammography.10

Human-Level Accuracy Is Tricky!

When a paper claims that its system achieves near-human accuracy, it’s worth pausing to consider what that means. Usually, the authors of the paper choose some metric (say, ROC AUC), and a group of external physicians works to annotate the chosen test set for the study. The accuracy of the model on this test set is then compared against that of the “average” physician (often the mean or median of the physician scores).

This is a fairly complex process, and there are a number of ways in which this comparison can go wrong. First, the choice of metric can play a role—all too commonly, varying the choice of metric can lead to differences. Good analyses will consider multiple different metrics to ensure that the conclusions are robust to such variance.

Another point to note is that there’s considerable variation between doctors themselves. It’s worth checking to make sure that your choice of “average” is robust. A better metric might be to ask whether your algorithm is capable of beating the “best” doctor in the panel.

A third issue is that it can be extremely tricky to make sure that the test set isn’t “polluted.” (See our warning in Chapter 7.) Subtle forms of pollution can occur in which the scans from the same patient accidentally end up in both the training and test sets. If your model has very high accuracy, it’s worth double- and triple-checking for such leakages. All of us have been guilty of making mistakes on these pipelines in the past.

Finally, “human-level accuracy” doesn’t often mean much. As we’ve noted, some expert systems and Bayesian networks achieved human-level accuracy on limited tasks but failed to have a broad impact on medicine. The reason is that doctors perform a whole range of tasks which are tightly wound together. A given doctor may underperform a deep network on scan reading, but may be able to offer a much better diagnosis using other information. It’s worth remembering that these tasks are often synthetic, and may not match best physician practices. Prospective clinical trials using deep learning systems live with consenting patients will be needed to more accurately gauge the effectiveness of these techniques.

Histology

Histology is the study of tissues, often viewed through microscopic scans. We won’t say too much about it, because the issues that confront designers of deep histology systems are a subset of the issues that deep microscopy faces. Take a look back at that chapter to learn more. We’ll note simply that deep learning models have achieved strong performance on histology studies.

MRI Scans

Magnetic resonance imaging (MRI) is another form of scan commonly used by doctors. Instead of X-rays, it uses strong magnetic fields to do its imaging. One advantage of MRI scans is therefore limited radiation exposure. However, these scans often require patients to lie within a noisy and cramped MRI machine, an experience which may be considerably more unpleasant for patients than X-ray scans.

Like CT, MRI is capable of assembling 3D images. And as with CT scans, a number of deep learning studies have sought to ease this reconstruction process. Some early studies claim that deep learning techniques can improve on traditional signal processing methods to reconstruct MRI images with reduced scan times. In addition, as with other scanning techniques, a number of studies have sought to use deep networks for classifying, segmenting, and processing MRI images with some strong successes.

Deep Learning for Signal Processing?

For both CT scans and MRI scans, we’ve mentioned in passing that deep networks have been used to help reconstruct images more effectively. Both of these applications are examples of the broader trend of using deep learning in signal processing. We’ve already seen some of this in passing; deep learning methods for super-resolution microscopy also fall within this general framework.

Such work on improving signal processing techniques is very exciting from a fundamental perspective, since signal processing is a highly mathematical, developed field. The fact that deep learning offers new directions here is in itself groundbreaking! However, it’s also worth noting that traditional signal processing algorithms often offer very strong baselines. As a result, unlike with image classification, deep methods don’t yet offer breakthrough accuracy improvements in this area. However, this is a field of continued and active research. It won’t at all be surprising if work on deep signal processing ends up being even more influential than simple image processing in the long run due to the very wide range of applications for such techniques.

It’s worth noting that there are many other types of scans doctors use. Given the explosion in deep learning applications powered by strong open source tools, it’s a good bet that for each such scan type, there’s a study or two attempting to use deep learning for the task. For example, deep learning has been applied to ultrasounds, electrocardiogram (ECG) scans, skin cancer detection, and more.

Convolutional networks are an extraordinarily powerful tool, because so much human activity revolves around processing complex visual information. In addition, the growth of open source frameworks has meant that researchers worldwide have joined the race to apply deep learning techniques on new types of images. In many ways, this type of research is relatively straightforward (on the computational end, at least), as standard tools can be applied without too much fuss. If you’re reading this while employed at a company, it’s these same properties of deep learning that likely make it interesting to you as a practitioner.

Learning Models as Therapeutics

So far in this chapter, we’ve seen that learning models can be effective assistants to doctors, helping aid the process of diagnosis and scan understanding. However, there’s some exciting evidence that learning models can move past being assistants to doctors to being therapeutic instruments in their own right.

How could this possibly work? One of the greatest powers of deep learning is that it is now feasible for the first time to build practical software that operates on perceptual data. For this reason, machine learning systems could potentially serve as “eyes” and “ears” to differently abled patients. A visual system could help patients with visual impairments more effectively navigate the world. An audio processing system could help patients with hearing impairments more effectively navigate the world. These systems face a number of challenges that other deep models don’t, since they have to operate effectively in real time. All the models we’ve considered so far in this book have been batch systems, suited for deployment on a backend server, not models fit for deployment on a live embedded device. There’s a whole host of challenges in dealing with machine learning in production which we won’t get into here, but we encourage interested readers to dive into the subject more deeply.

We also note that there’s a separate class of software-driven therapeutics that make uses of the powerful effects of modern software on the human brain. A groundswell of recent research has shown that modern software applications such as Facebook, Google, WeChat, and the like can be highly addictive. These apps are designed with bright colors and intended to hit many of the same centers in our brains as casino slot machines. There’s growing recognition that digital addictionis a real problem facing many patients.11 This is a broad area beyond the scope of this book, but we note that there’s evidence that this power of modern software can be used for good too. Some software apps have been developed that use the psychological effects of modern apps as therapeutic interventions for patients struggling with depression or other conditions.

Diabetic Retinopathy

So far in this chapter, we have discussed applications of deep learning to medicine in a theoretical sense. In this section, we’ll roll up our sleeves and get our hands dirty with a practical example. In particular, we’re going to build a model to help diagnose diabetic retinopathy patient progression.

Diabetic retinopathy is a condition in which diabetes damages the health of the eyes. It is a major cause of blindness, especially in the developing world. The fundus is the interior area of the eye that’s opposite to the lens. A common strategy for diagnosis of diabetic retinopathy is for doctors to view an image of the patient’s fundus and label it manually. Significant work has gone into “fundus photography,” which develops techniques to capture patient fundus images (see Figure 8-6).

Figure 8-6. An image of a patient fundus from a patient who has undergone scatter laser surgery treatment for diabetic retinopathy. (Source: Wikimedia.)

The learning challenge for diabetic retinopathy is to design an algorithm that can classify a patient’s disease progress given an image of the patients’ fundus. At present, making such predictions requires skilled doctors or technicians. The hope is that a machine learning system could accurately predict disease progression from patient fundus images. This could provide patients with a cheap method of understanding their risk, which they could use before consulting a more expensive expert doctor for a diagnosis.

In addition, unlike EHR data, fundus images don’t contain much sensitive information about patients, which makes it easier to gather large fundus image datasets. For these reasons, a number of machine learning studies and challenges have been conducted on diabetic retinopathy datasets. In particular, Kaggle sponsored a contest aimed at creating good diabetic retinopathy models and put together a dataset of high-resolution fundus images. In the remainder of this section, you will learn how to use DeepChem to build a diabetic retinopathy classifier on the Kaggle Diabetic Retinopathy (DR) dataset.

Obtaining the Kaggle Diabetic Retinopathy Dataset

The terms of the Kaggle challenge prohibit us from mirroring the data directly on the DeepChem servers. For this reason, you will need to download the data manually from Kaggle’s site. You will have to register an account with Kaggle and download the dataset through their API. The full dataset is quite large (80 GB), so you might choose to download a subset of the data if your internet connection can’t handle the full download.

See the GitHub repository associated with this book for more information on downloading this dataset. The image loading functions here require that the training data is structured in a particular directory structure. Details on this directory format are in the GitHub repo.

The first step to working with this data is to preprocess and load the raw data. In particular, we crop each image to focus on its center square containing the retina. We then resize this center square to be of size 512 by 512.

Dealing with High-Resolution Images

Many image datasets in medicine and science will feature very high-resolution images. While it may be tempting to train deep learning models directly on these high-resolution images, this is usually computationally challenging. One problem is that most modern GPUs have limited memory. That means training very high-resolution models may not be feasible on standard hardware. In addition, most image processing systems (for now) expect their input images to have a fixed shape. This means that high-resolution images from different cameras will have to be cropped to fit within standard shapes.

Luckily, it turns out that cropping and resizing images is usually not terribly damaging to the performance of machine learning systems. It’s also common to do more thorough data augmentation, in which a number of perturbed images are automatically generated from each source image. In this particular case study, we performed a few standard data augmentations. We encourage you to dig into the augmentation code since it may prove a useful tool for your own projects.

The core data is stored in a set of directories on disk. We use DeepChem’s ImageLoader class to load these images from disk. If you’re interested, you can look through this loading and preprocessing code in detail, but we’ve wrapped it into a convenience helper function. In the style of the MoleculeNet loaders, this function also does a random training, validation, and test split:

train, valid, test = load_images_DR(split='random', seed=123)

Now that we have the data for this learning task, let’s build a convolutional architecture to learn from this dataset. The architecture for this task is fairly standard and resembles other architectures you’ve already seen in this book, so we don’t replicate it here. Here’s the invocation of the object wrapper for the underlying convolutional network:

# Define and build model
model = DRModel(
    n_init_kernel=32,
    batch_size=32,
    learning_rate=1e-5,
    augment=True,
    model_dir='./test_model')

This code sample defines a diabetic retinopathy convolutional network in DeepChem. As we will see later, training this model will take some heavy computation. For that reason, we recommend that you download our pretrained model from the DeepChem website and use that for your early exploration. We have already trained this model on the full Kaggle Diabetic Retinopathy dataset and stored its weights for your convenience. You can use the following commands to download and store the model (note that the first command should be entered on a single line, with no space after the +/+):

wget https://s3-us-west-1.amazonaws.com/deepchem.io/featurized_datasets
  /DR_model.tar.gz 
mv DR_model.tar.gz test_model/
cd test_model
tar -zxvf DR_model.tar.gz
cd ..

You can then restore the trained model weights as follows:

model.build()
model.restore(checkpoint="./test_model/model-84384")

We are restoring a particular pretrained “checkpoint” from this model. We provide more details on the restoration process and the full scripts used to achieve it in the code repository associated with this book. With the pretrained model in place, we can compute some basic statistics upon it:

metrics = [
    dc.metrics.Metric(DRAccuracy, mode='classification'),
    dc.metrics.Metric(QuadWeightedKappa, mode='classification')
]

There are a number of metrics that are useful for evaluating diabetic retinopathy models. Here we use, DRAccuracy, which is simply the model accuracy (percent of labels which are correct), and Cohen’s Kappa, a statistic used to measure agreement between two classifiers. This is useful because the diabetic retinopathy learning task is a multiclass learning problem.

Let’s evaluate our pretrained model on the test set with our metrics:

model.evaluate(test, metrics)

This produces the following results:

computed_metrics: [0.9339595787076572]
computed_metrics: [0.8494075470551462]

The basic model gets 93.4% accuracy on our test set. Not bad! (It’s important to note that this isn’t the same as the Kaggle test set—we’ve simply partitioned Kaggle’s training set into train/valid/test sets for our experimentation. You’re welcome to try submitting your trained model to Kaggle for evaluation on their test set, though.) Now, what if you’re interested in training the full model from scratch? This will take about a day or two’s training on a good GPU system, but is straightforward enough to do:

for i in range(10):
  model.fit(train, nb_epoch=10)
  model.evaluate(train, metrics)
  model.evaluate(valid, metrics)
  model.evaluate(valid, cm)
  model.evaluate(test, metrics)
  model.evaluate(test, cm)

We train the model for 100 epochs, pausing periodically to print out results from the model. If you’re running this job, we recommend making sure that your machine won’t shut down or go to sleep halfway through the job. There’s nothing as irritating as losing a large job to a sleep screen!

Conclusion

In many ways, the application of machine learning to medicine has the potential to have greater impact than many of the other applications we’ve seen so far. These other applications may have shifted what you do at work, but machine learning healthcare systems will soon change your personal healthcare experiences, along with the experiences of millions if not billions of others. For this reason, it’s worth pausing and thinking through some of the ethical repercussions.

Ethical Considerations

Training data for these systems will likely be biased for the foreseeable future. It’s likely that the training data will be drawn from the medical systems of developed economies, and as a result it’s possible that the models constructed will be considerably less accurate for portions of the world that currently lack robust medical systems.

In addition, gathering data on patients is itself fraught with potential ethical issues. Medicine has a long and troubled history of experimenting without consent, especially with people from marginalized groups. Consider the case of Henrietta Lacks, an African-American a cancer patient in 1950s Baltimore. A cell line cultivated from a tissue sample of Ms. Lacks’s tumor (“HeLa”) became a standard biological tool and was used in thousands of research papers—yet none of the proceeds from this research ever reached her family. Ms. Lacks’s physician did not inform the family of the samples he’d taken, or obtain consent. Her family did not learn about the HeLa cell line till the 1970s, when they were contacted by medical researchers seeking to draw additional samples.

How could this situation repeat itself in the deep learning era? The medical records of a patient could possibly be used to train a learning system without the consent of the patient or their family. Or, perhaps more realistically, the patient or the family could be induced to sign away the rights to their data at the bedside in the hopes of a last-minute cure.

There’s something disturbing about these scenarios. None of us would care to learn that our beloved family members’ rights have been violated by institutional medicine or profit-seeking startups. How can we seek to prevent these ethical violations from occurring? If you’re involved in data gathering efforts, pause and ask where the data is coming from. Were all relevant laws appropriately respected? If you’re a scientist or developer at a company or research institution, you will have valuable skills that give you leverage within the organization. If you take a stand, you will influence others in the organization to stand with you. And if your organization refuses to listen, you have valuable skills that will enable you to find a job with an organization that holds itself to high ethical standards.

Job Losses

Most of the fields considered in other chapters in this book are relatively niche scientific disciplines. Thus, the potential of significant advances in the field causing job losses doesn’t really exist. Rather, it’s to be expected that job growth in these fields will occur as these relatively niche areas will suddenly become accessible to a much wider pool of developers and scientists.

Healthcare and medicine are different. Healthcare is one of the largest industries worldwide, with millions of doctors, nurses, technicians, and more serving the needs of the world’s population. What happens as significant fractions of this workforce are confronted with deep learning tools?

Much of medicine is deeply human. Having a trusted primary care provider who you can be sure is looking out for your best interests makes a profound difference to an ill patient. It’s very possible that for many patients, care experience could actually improve as much of the busywork is automated out.

In the US, healthcare reform in 2010 (the Affordable Care Act) accelerated the use of EHR systems throughout the American medical system. Many doctors have reported feeling that these EHR systems are deeply unfriendly, requiring many unnecessary administrative actions. Part of this is due simply to poor software design, worsened by regulatory capture that makes it difficult for healthcare institutions to shift to better alternatives. But some of it is due to limitations of present-day software. Use of deep learning systems to allow for more intelligent information handling could lower the burden on doctors, enabling them to spend more time with patients.

In addition, most countries in the world have healthcare systems that don’t match those in the United States and Europe. The increasing availability of open source tools and accessible datasets will provide governments and entrepreneurs in the rest of the world the tools they need to serve their constituents.

Summary

In this chapter, you’ve learned about the history of applying machine learning methods to problems in medicine. We started by giving you an overview of classical methods such as expert systems and Bayesian networks, then shifted into more modern work on electronic health records and medical scans. We ended the chapter with an in-depth case study on training a classifier that predicts diabetic retinopathy patient progression. We also commented in a number of asides about the challenges that learning systems for healthcare face. We’ll return to some of these challenges in Chapter 10, where we discuss the interpretability of deep learning systems.

1 See Dendral or Mycin on Wikipedia for more information.

2 Asabere, Nana Yaw. “mMes: A Mobile Medical Expert System for Health Institutions in Ghana.” International Journal of Science and Technology no.6. (June 2012). https://pdfs.semanticscholar.org/ed35/ec162c5916f317162e11e390440bdb1b55b2.pdf.

3 Mandel, JC, et al. “SMART on FHIR: A Standards-Based, Interoperable Apps Platform for Electronic Health Records.” https://doi.org/10.1093/jamia/ocv189. 2016.

4 Rajkomar, Alvin et al. “Scalable and Accurate Deep Learning with Electronic Health Records.” NPJ Digital Medicine. https://arxiv.org/pdf/1801.07860.pdf. 2018.

5 Miotto, Riccardo, Li Li, Brian A. Kidd and Joel T. Dudley. “Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.” https://doi.org/10.1038/srep26094. 2016.

6 Gawande, Atul. “Why Doctors Hate Their Computers.” The New Yorker. https://www.newyorker.com/magazine/2018/11/12/why-doctors-hate-their-computers. 2018.

7 “AI, Radiology and the Future of Work.” The Economist. https://econ.st/2HrRDuz. 2018.

8 Gao, Xiaohong W., Rui Hui, and Zengmin Tian. “Classification of CT Brain Images Based on Deep Learning Networks.” https://doi.org/10.1016/j.cmpb.2016.10.007. 2017.

9 Pranav Rajpurkar et al. “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning.” https://arxiv.org/pdf/1711.05225.pdf. 2017.

10 Ribli, Dezso et al. “Detecting and Classifying Lesions in Mammograms with Deep Learning.” https://doi.org/10.1038/s41598-018-22437-z. 2018.

11 See Digital Addict on Wikipedia for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset