Chapter 5. Biophysical Machine Learning

In this chapter, we will explore how to use deep learning for understanding biophysical systems. In particular, we will explore in depth the problem of predicting how small drug-like molecules bind to a protein of interest in the human body.

This problem is of fundamental interest in drug discovery. Modulating a single protein in a targeted fashion can often have a significant therapeutic impact. The breakthrough cancer drug Imatinib tightly binds with BCR-ABL, for example, which is part of the reason for its efficacy. For other diseases, it can be challenging to find a single protein target with the same efficacy, but the abstraction remains useful nevertheless. There are so many mechanisms at play in the human body that finding an effective mental model can be crucial.

Drugs Don’t Just Target a Single Protein

As we’ve discussed, it can be extraordinarily useful to reduce the problem of designing a drug for a disease to the problem of designing a drug that interacts tightly with a given protein. But it’s extremely important to realize that in reality, any given drug is going to interact with many different subsystems in the body. The study of such multifaceted interactions is broadly called polypharmacology.

At present, computational methods for dealing with polypharmacology are still relatively undeveloped, so the gold standard for testing for polypharmacological effects remains animal and human experimentation. As computational techniques mature, this state of affairs may shift over the next few years.

Our goal therefore is to design learning algorithms that can effectively predict when a given molecule is going to interact with a given protein. How can we do this? For starters, we might borrow some of the techniques from the previous chapter on molecular machine learning and try to create a protein-specific model. Such a model would, given a dataset of molecules that either bind or don’t bind to a given protein, learn to predict for new molecules whether they bind or not. This idea isn’t actually terrible, but requires a good amount of data for the system at hand. Ideally, we’d have an algorithm that could work without a large amount of data for a new protein.

The trick, it turns out, is to use the physics of the protein. As we will discuss in the next section, quite a bit is known about the physical structure of protein molecules. In particular, it’s possible to create snapshots of the 3D state of proteins using modern experimental techniques. These 3D snapshots can be fed into learning algorithms and used to predict binding. It’s also possible to take snapshots (better known as structures) of the interaction of proteins with smaller molecules (often called ligands). If this discussion seems abstract to you right now, don’t worry. You’ll be seeing plenty of practical code in this chapter.

We’ll begin this chapter with a deeper overview of proteins and their function in biology. We will then shift into computer science and introduce some algorithms for featurizing protein systems which can transform biophysical systems into vectors or tensors for use in learning. In the last part of the chapter, we will work through an in-depth case study on constructing a protein–ligand binding interaction model. For experimentation,we will introduce the PDBBind dataset, which contains a collection of experimentally determined protein–ligand structures. We will demonstrate how to featurize this dataset with DeepChem. We will then build some models, both deep and simpler, on these featurized datasets and study their performance.

Why Is It Called Biophysics?

It is often said that all of biology is based on chemistry, and all of chemistry is based on physics. At first glance, biology and physics may seem far removed from one another. But as we will discuss at greater length later in this chapter, physical laws are at the heart of all biological mechanisms. In addition, much of the study of protein structure depends critically on the use of experimental techniques refined in physics. Manipulating nanoscale machines (for that’s what proteins actually are) requires considerable physical sophistication, on both the theoretical and the practical side.

It’s also interesting to note that the deep learning algorithms we will discuss in this chapter bear significant similarities to deep learning architectures used for studying systems from particle physics or for physical simulations. Such topics are outside the scope of this book, but we encourage interested readers to explore them further.

Protein Structures

Proteins are tiny machines that do most of the work in a cell. Despite their small size, they can be very complicated. A typical protein is made of thousands of atoms arranged in precise ways.

To understand any machine, you must know what parts it is made of and how they are put together. You cannot hope to understand a car until you know it has wheels on the bottom, an empty space in the middle to hold passengers, and doors through which the passengers can enter and exit. The same is true of a protein. To understand how it works, you must know exactly how it is put together.

Furthermore, you need to know how it interacts with other molecules. Few machines operate in isolation. A car interacts with the passengers it carries, the road it drives on, and the energy source that allows it to move. This applies to most proteins as well. They act on molecules (for example, to catalyze a chemical reaction), are acted upon by others (for example, to regulate their activity), and draw energy from still others. All these interactions depend on the specific positioning of atoms in the two molecules. To understand them, you must know how the atoms are arranged in 3D space.

Unfortunately, you can’t just look at a protein under a microscope; they are far too small for that. Instead, scientists have had to invent complex and ingenious methods for determining the structures of proteins. At present there are three such methods: X-ray crystallography, nuclear magnetic resonance (NMR for short), and cryo-electron microscopy (cryo-EM for short).

X-ray crystallography is the oldest method, and still the most widely used. Roughly 90% of all known protein structures were determined with this method. Crystallography involves growing a crystal of the protein of interest (many molecules of the protein all tightly packed together in a regular repeating pattern). X-rays are then shone on the crystal, the scattered light is measured, and the results are analyzed to work out the structure of the individual molecules. Despite its success, this method has many limitations. It is slow and expensive even in the best cases. Many proteins do not form crystals, making crystallography impossible. Packing the protein into a crystal may alter its structure, so the result might be different from its structure in a living cell. Many proteins are flexible and can take on a range of structures, but crystallography only produces a single unmoving snapshot. But even with these limitations, it is a remarkably powerful and important tool.

NMR is the second most common method. It operates on proteins in solution, so there is no need to grow a crystal. This makes it an important alternative for proteins that cannot be crystallized. Unlike crystallography, which produces a single fixed snapshot, NMR produces an ensemble of structures representing the range of shapes the protein can take on in solution. This is a very important benefit, since it gives information about how the protein can move. Unfortunately, NMR has its own limitations. It requires a highly concentrated solution, so it is mostly limited to small, highly soluble proteins.

In recent years, cryo-EM has emerged as a third option for determining protein structures. It involves rapidly freezing protein molecules, then imaging them with an electron microscope. Each image is far too low-resolution to make out the precise details; but by combining many different images, one can produce a final structure whose resolution is much higher than any individual electron microscope image. After decades of steady improvements to the methods and technologies, cryo-EM has finally begun to approach atomic resolution. Unlike crystallography and NMR, it works for large proteins that do not crystallize. This will probably make it a very important technique in the years to come.

The Protein Data Bank (PDB) is the primary repository for known protein structures. At present it contains over 142,000 structures, like the one in Figure 5-1. That may seem like a lot, but it is far less than we really want. The number of known proteins is orders of magnitude larger, with more being discovered all the time. For any protein you want to study, there is a good chance that its structure is still unknown. And you really want many structures for each protein, not just one. Many proteins can exist in multiple functionally different states (for example, “active” and “inactive” states), so you want to know the structure of each state. If a protein binds to other molecules, you want a separate structure with the protein bound to each one so you can see exactly how they bind. The PDB is a fantastic resource, but the field as a whole is still in its “low data” stage. We have far less data than we want, and a major challenge is figuring out how to make the most of what we have. That is likely to remain true for decades.

A crystal structure of the CapD protein from Bacillus anthracis, the anthrax pathogen. Determining the structures of bacterial proteins can be a powerful tool for antibiotic design. More generally, identifying the structure of a therapeutically relevant protein is one of the key steps in modern drug discovery.
Figure 5-1. A crystal structure of the CapD protein from Bacillus anthracis, the anthrax pathogen. Determining the structures of bacterial proteins can be a powerful tool for antibiotic design. More generally, identifying the structure of a therapeutically relevant protein is one of the key steps in modern drug discovery.

Protein Sequences

So far in this chapter, we’ve primarily discussed protein structures, but we haven’t yet said much about what proteins are made of atomically. Proteins are built out of fundamental building blocks called amino acids. These amino acids are sets of molecules that share a common core, but have different “side chains” attached (Figure 5-2). These different side chains alter the behavior of the protein.

A protein is a chain of amino acids linked one to the next to the next (Figure 5-3). The start of the amino acid chain is typically referred to as the N-terminus, while the end of the chain is called the C-terminus. Small chains of amino acids are commonly called peptides, while longer chains are called proteins. Peptides are too small to have complex 3D structures, but the structures of proteins can be very complicated.

Amino acids are the building blocks of protein structures. This diagram represents the chemical structures of a number of commonly seen amino acids.
Figure 5-2. Amino acids are the building blocks of protein structures. This diagram represents the chemical structures of a number of commonly seen amino acids. (Adapted from Wikimedia.)
A chain of four amino acids, with the N-terminus on the left and the C-terminus on the right.
Figure 5-3. A chain of four amino acids, with the N-terminus on the left and the C-terminus on the right. (Source: Wikipedia.)

It’s worth noting that while most proteins take a rigid shape, there are also intrinsically disordered proteins which have regions that refuse to take rigid shapes (Figure 5-4).

A snapshot of the SUMO-1 protein. The central core of the protein has structure, while the N-terminal and C-terminal regions are disordered. Intrinsically disordered proteins such as SUMO-1 are challenging to handle computationally.
Figure 5-4. A snapshot of the SUMO-1 protein. The central core of the protein has structure, while the N-terminal and C-terminal regions are disordered. Intrinsically disordered proteins such as SUMO-1 are challenging to handle computationally.

In the remainder of this chapter, we will primarily deal with proteins that have rigid, 3D shapes. Dealing with floppy structures with no set shape is still challenging for modern computational techniques.

Can’t We Predict 3D Protein Structure Computationally?

After reading this section, you might wonder why we don’t use algorithms to directly predict the structure of interesting protein molecules rather than depending on complex physical experiments. It’s a good question, and there have in fact been decades of work on the computational prediction of protein structure.

There are two main approaches to predicting protein structures. The first is called homology modeling. Protein sequences and structures are the product of billions of years of evolution. If two proteins are near relatives (the technical term is “homologs”) that only recently diverged from each other, they probably have similar structures. To predict a protein’s structure by homology modeling, you first look for a homolog whose structure is already known, then try to adjust it based on differences between the sequences of the two proteins. Homology modeling works reasonably well for determining the overall shape of a protein, but it often gets details wrong. And of course, it requires that you already know the structure of a homologous protein.

The other main approach is physical modeling. Using knowledge of the laws of physics, you try to explore many different conformations the protein might take on and predict which one will be most stable. This method requires enormous amounts of computing time. Until about a decade ago, it simply was impossible. Even today it is only practical for small, fast-folding proteins. Furthermore, it requires physical approximations to speed up the calculation, and those reduce the accuracy of the result. Physical modeling will often predict the right structure, but not always.

A Short Primer on Protein Binding

We’ve discussed a good amount about protein structure so far in this chapter, but we haven’t said much about how proteins interact with other molecules (Figure 5-5). In practice, proteins often bind to small molecules. Sometimes that binding behavior is central to the protein’s function: the main role for a given protein can involve binding to particular molecules. For example, signaling transduction in cells often passes messages via the mechanism of a protein binding to another molecule. Other times, the molecule binding to the protein is foreign: possibly a drug we’ve created to manipulate the protein, possibly a toxin that interferes with its function.

A signal transduced via a protein embedded in a cell's membrane.
Figure 5-5. A signal transduced via a protein embedded in a cell’s membrane. (Source: Wikimedia.)

Understanding the details of how, where, and when molecules bind to proteins is critical to understanding their functions and developing drugs. If we can coopt the signaling mechanisms of cells in the human body, we can induce desired medical responses in the body.

Protein binding involves lots of very specific interactions, which makes it hard to predict computationally. A tiny change in the positions of just a few atoms can determine whether or not a molecule binds to a protein. Furthermore, many proteins are flexible and constantly moving. A protein might be able to bind a molecule when it’s in certain conformations, but not when it’s in others. Binding in turn may cause further changes to a protein’s conformation, and thus to its function.

In the remainder of this chapter, we will use the challenge of understanding protein binding as a motivating computational example. We will delve in depth into current deep learning and machine learning approaches for making predictions about binding events.

Biophysical Featurizations

As we discussed in the previous chapter, one of the crucial steps in applying machine learning to a new domain is figuring out how to transform (or featurize) training data to a format suitable for learning algorithms. We’ve discussed a number of techniques for featurizing individual small molecules. Could we perhaps adapt these techniques for use in biophysical systems?

Unfortunately, the behaviors of biophysical systems are critically constrained by their 3D structures, so the 2D techniques from previous chapters miss crucial information. As a result, we will discuss a pair of new featurization techniques in this chapter. The first featurization technique, the grid featurization, explicitly searches a 3D structure for the presence of critical physical interactions such as hydrogen bonds and salt bridges (more on these later), which are known to play an important role in determining protein structure. The advantage of this technique is that we can rely upon a wealth of known facts about protein physics. The weakness, of course, is that we are bound by known physics and lessen the chance that our algorithms will be able to detect new physics.

The alternative featurization technique is the atomic featurization, which simply provides a processed representation of the 3D positions and identities of all atoms in the system. This makes the challenge for the learning algorithm considerably harder, since it must learn to identify critical physical interactions, but it also makes it feasible for learning algorithms to detect new patterns of interesting behavior.

PDB Files and Their Pitfalls

Protein structures are often stored in PDB files. Such files are simply text files that contain descriptions of the atoms in the structure and their positions in coordinate space relative to one another. Featurization algorithms typically rely on libraries that read in PDB files and store them into in-memory data structures. So far so good, right?

Unfortunately, PDB files are often malformed. The reason lies in the underlying physics. Often, an experiment will fail to have adequate resolution to completely specify a portion of the protein’s structure. Such regions are left unspecified in the PDB file, so it’s common to find that many atoms or even entire substructures of the protein are missing from the core structure.

Libraries such as DeepChem will often attempt to do the “right” thing and algorithmically fill in such missing regions. It’s important to note that this cleanup is only approximate, and there’s still no entirely satisfactory replacement to having an expert human peer at the protein structure (in a suitable viewing program) and point out issues. Hopefully, software tooling to handle these errors will improve over the next few years and the need for expert guidance will lessen.

Grid Featurization

By converting biophysical structures into vectors, we can use machine learning algorithms to make predictions about them. It stands to reason that it would be useful to have a featurization algorithm for processing protein–ligand systems. However, it’s quite nontrivial to devise such an algorithm. Ideally, a featurization technique would need to have significant knowledge about the chemistry of such systems baked into it by design, so it could pull out useful features.

These features might include, for example, counts of noncovalent bonds between the protein and ligand, such as hydrogen bonds or other interactions. (Most protein–ligand systems don’t have covalent bonds between the protein and ligand.)

Luckily for us, DeepChem has such a featurizer available. Its RdkitGridFeaturizer summarizes a set of relevant chemical information into a brief vector for use in learning algorithms. While it’s not necessary to understand the underlying science in depth to use the featurizer, it will still be useful to have a basic understanding of the underlying physics. So, before we dive into a description of what the grid featurizer computes, we will first review some of the pertinent biophysics of macromolecular complexes.

While reading this section, it may be useful to refer back to the discussion of basic chemical interactions in the previous chapter. Ideas such as covalent and noncovalent bonds will pop up quite a bit.

The grid featurizer searches for the presence of such chemical interactions within a given structure and constructs a feature vector that contains counts of these interactions. We will say more about how this is done algorithmically later in the chapter.

Hydrogen bonds

When a hydrogen atom is covalently bonded to a more electronegative atom such as oxygen or nitrogen, the shared electrons spend most of their time closer to the more electronegative atom. This leaves the hydrogen with a net positive charge. If that positively charged hydrogen then gets close to another atom with a net negative charge, they are attracted to each other. That is a hydrogen bond (Figure 5-6).

A rendered example of a hydrogen bond. Excess negative charge on the oxygen interacts with excess positive charge on the hydrogen, creating a bonding interaction.
Figure 5-6. A rendered example of a hydrogen bond. Excess negative charge on the oxygen interacts with excess positive charge on the hydrogen, creating a bonding interaction. (Source: Wikimedia.)

Because hydrogen atoms are so small, they can get very close to other atoms, leading to a strong electrostatic attraction. This makes hydrogen bonds one of the strongest noncovalent interactions. They are a critical form of interaction that often stabilizes molecular systems. For example, water’s unique properties are due in large part to the network of hydrogen bonds that form between water molecules.

The RdkitGridFeaturizer attempts to count the hydrogen bonds present in a structure by checking for pairs of protein/ligand atoms of the right types that are suitably close to one another. This requires applying a cutoff to the distance, which is somewhat arbitrary. In reality there is not a sharp division between atoms being bonded and not bonded. This may lead to some misidentified interactions, but empirically, a simple cutoff tends to work reasonably well.

Salt bridges

A salt bridge is a noncovalent attraction between two amino acids, where one has a positive charge and the other has a negative charge (see Figure 5-7). It combines both ionic bonding and hydrogen bonding. Although these bonds are relatively weak, they can help stabilize the structure of a protein by providing an interaction between distant amino acids in the protein’s sequence.

An illustration of a salt bridge between glutamic acid and lysine. The salt bridge is a combination of an ionic-style electrostatic interaction and a hydrogen bond and serves to stabilize the structure.
Figure 5-7. An illustration of a salt bridge between glutamic acid and lysine. The salt bridge is a combination of an ionic-style electrostatic interaction and a hydrogen bond and serves to stabilize the structure. (Source: Wikimedia.)

The grid featurizer attempts to detect salt bridges by explicitly checking for pairs of amino acids (such as glutamic acid and lysine) that are known to form such interactions, and that are in close physical proximity in the 3D structure of the protein.

Pi-stacking interactions

Pi-stacking interactions are a form of noncovalent interaction between aromatic rings (Figure 5-8). These are flat, ring-shaped structures that appear in many biological molecules, including DNA and RNA. They also appear in the side chains of some amino acids, including phenylalanine, tyrosine, and tryptophan.

An aromatic ring in the benzene molecule. Such ring structures are known for their exceptional stability. In addition, aromatic rings have all their atoms lying in a plane. Heterogeneous rings, in contrast, don't have their atoms occupying the same plane.
Figure 5-8. An aromatic ring in the benzene molecule. Such ring structures are known for their exceptional stability. In addition, aromatic rings have all their atoms lying in a plane. Heterogeneous rings, in contrast, don’t have their atoms occupying the same plane.

Roughly speaking, pi-stacking interactions occur when two aromatic rings “stack” on top of each other. Figure 5-9 shows some of the ways in which two benzene rings can interact. Such stacking interactions, like salt bridges, can help stabilize various macromolecular structures. Importantly, pi-stacking interactions can be found in ligand-protein interactions, since aromatic rings are often found in small molecules. The grid featurizer counts these interactions by detecting the presence of aromatic rings and checking for the distances between their centroids and the angles between their two planes.

Various noncovalent aromatic ring interactions. In the displaced interaction, the centers of the two aromatic rings are slightly displaced from one another. In edge-to-face interactions, one aromatic ring's edge stacks on another's face. The sandwich configuration has two rings stacked directly, but is less energetically favorable than displaced or edge-to-face interactions since regions with the same charge interact closely.
Figure 5-9. Various noncovalent aromatic ring interactions. In the displaced interaction, the centers of the two aromatic rings are slightly displaced from one another. In edge-to-face interactions, one aromatic ring’s edge stacks on another’s face. The sandwich configuration has two rings stacked directly, but is less energetically favorable than displaced or edge-to-face interactions since regions with the same charge interact closely.

At this point, you might be wondering why this type of interaction is called pi-stacking. The name refers to pi-bonds, a form of covalent chemical bond where the electron orbitals of two covalently bonded atoms overlap. In an aromatic ring, all the atoms in the ring participate in a joint pi-bond. This joint bond accounts for the stability of the aromatic ring and also explains many of its unique chemical properties.

For those readers who aren’t chemists, don’t worry too much if this material doesn’t make too much sense just yet. DeepChem abstracts away many of these implementation details, so you won’t need to worry much about pi-stacking on a regular basis when developing. However, it is useful to know that these interactions exist and play a major role in the underlying chemistry.

Intricate Geometries and Snapshots

In this section, we’ve introduced a number of interactions in terms of static geometric configurations. It’s very important to realize that bonds are dynamic entities, and that in real physical systems, bonds will stretch, snap, break, and reform with dizzying speed. Keep this in mind, and note that when someone says a salt bridge exists, what they really mean is that in some statistically average sense, a salt bridge is likely present more often than not at a particular location.

Fingerprints

From the previous chapter, you may recall the use of circular fingerprints. These fingerprints count the number of fragments of a given type in the molecule, then use a hash function to fit these fragment counts into a fixed-length vector. Such fragment counts can be used for 3D molecular complexes as well. Although merely counting the fragments is often insufficient to compute the geometry of the system, the knowledge of present fragments can nevertheless be useful for machine learning systems. This might perhaps be due to the fact that the presence of certain fragments can be strongly indicative of some molecular events.

Some implementation details

To search for chemical features such as hydrogen bonds, the dc.feat.RdkitGridFeaturizer needs to be able to effectively work with the geometry of the molecule. DeepChem uses the RDKit library to load each molecule, protein, and ligand, into a common in-memory object. These molecules are then transformed into NumPy arrays that contain the positions of all the atoms in space. For example, a molecule with N atoms can be represented as a NumPy array of shape (N, 3), where each row represents the position of an atom in 3D space.

Then, performing a (crude) detection of a hydrogen bond simply requires looking at all pairs of atoms that could conceivably form a hydrogen bond (such as oxygen and hydrogen) that are sufficiently close to one another. The same computational strategy is used for detecting other kinds of bonds.  For handling aromatic structures, there’s a bit of special code to detect the presence of aromatic rings in the structure and compute their centroids.

Atomic Featurization

At the end of the previous section, we gave a brief overview of how features such as hydrogen bonds are computed by the RdkitGridFeaturizer. Most operations transform a molecule with N atoms into a NumPy array of shape (N, 3) and then perform a variety of extra computations starting from these arrays.

You can easily imagine that featurization for a given molecule could simply involve computing this (N, 3) array and passing it to a suitable machine learning algorithm. The model could then learn for itself what features were important, rather than relying on a human to select them and code them by hand.

In fact, this turns out to work—with a couple of extra steps. The (N, 3) position array doesn’t distinguish atom types, so you also need to provide another array that lists the atomic number of each atom. As a second implementation-driven note, computing pairwise distances between two position arrays of shape (N, 3) can be very computationally expensive. It’s useful to create “neighbor lists” in a preprocessing step, where the neighbor list maintains a list of neighboring atoms close to any given atom.

DeepChem provides a dc.feat.ComplexNeighborListFragmentAtomicCoordinates featurizer that handles much of this for you. We will not discuss it further in this chapter, but it’s good to know that it exists as another option.

The PDBBind Case Study

With this introduction in place, let’s start tinkering with some code samples for handling biophysical datasets. We will start by introducing the PDBBind dataset and the problem of binding free energy prediction. We will then provide code examples of how to featurize the PDBBind dataset and demonstrate how to build machine learning models for it. We will end the case study with a discussion of how to evaluate the results.

PDBBind Dataset

The PDBBind dataset contains a large number of biomolecular crystal structures and their binding affinities. There’s a bit of jargon there, so let’s stop and unpack it. A biomolecule is any molecule of biological interest. That includes not just proteins, but also nucleic acids (such as DNA and RNA), lipids, and smaller drug-like molecules. Much of the richness of biomolecular systems results from the interactions of various biomolecules with one another (as we’ve discussed at length already). A binding affinity is the experimentally measured affinity of two molecules to form a complex, with the two molecules interacting. If it is energetically favorable to form such a complex, the molecules will spend more time in that configuration as opposed to another one.

The PDBBind dataset has gathered structures of a number of biomolecular complexes. The large majority of these are protein–ligand complexes, but the dataset also contains protein–protein, protein–nucleic acid, and nucleic acid–ligand complexes. For our purposes, we will focus on the protein–ligand subset. The full dataset contains close to 15,000 such complexes, with the “refined” and “core” sets containing smaller but cleaner subsets of complexes. Each complex is annotated with an experimental measurement of the binding affinity for the complex. The learning challenge for the PDBBind dataset is to predict the binding affinity for a complex given the protein–ligand structure.

The data for PDBBind is gathered from the Protein Data Bank. Note that the data in the PDB (and consequently PDBBind) is highly heterogeneous! Different research groups have different experimental setups, and there can be high experimental variance between different measurements by different groups. For this reason, we will primarily use the filtered refined subset of the PDBBind dataset for doing our experimental work.

Dynamics Matter!

In this case study, we treat the protein and ligand as a frozen snapshot. Note that this is very unphysical! The protein and ligand are actually in rapid movement, and the ligand will move in and out of the protein’s binding pocket. In addition, the protein may not even have one fixed binding pocket. For some proteins, there are a number of different sites where potential ligands interact.

All these combinations of factors mean that our models will have relatively limited accuracy. If we had more data, it might be possible that strong learning models could learn to account for these factors, but with more limited datasets, it is challenging to do so.

You should probably note this information. The design of better biophysical deep learning models to accurately account for the thermodynamic behavior of these systems remains a major open problem.

What If You Don’t Have a Structure?

Drug discovery veterans might pause for a minute here. The fact is that it’s typically much harder experimentally to determine the structure of a complex than it is to measure a binding affinity. This makes sense intuitively. A binding affinity is a single number for a given biomolecular complex, while a structure is a rich 3D snapshot. Predicting the binding affinity from the structure might feel a little bit like putting the cart before the horse!

There are a couple of answers to this (fair) accusation. The first is that the problem of determining the binding affinity of a biomolecular system is a physically interesting problem in its own right. Checking that we can accurately predict such binding affinities is a worthy test problem to benchmark our machine learning methods and can serve as a stepping stone to designing deep architectures capable of understanding sophisticated biophysical systems.

The second answer is that we can use existing computational tooling, such as “docking” software, to predict approximate structures for a protein–ligand complex given that we have a structure for the protein in isolation. While predicting a protein structure directly is a formidable challenge, it’s somewhat easier to predict the structure of a protein–ligand complex when you already have the protein structure. So, you might well be able to make useful predictions with the system we will create in this case study by pairing it with a docking engine. Indeed, DeepChem has support for this use case, but we will not delve into this more advanced feature in this chapter.

In general, when doing machine learning, it can be particularly useful to take a look at individual data points or files within a dataset. In the code repository associated with this book, we’ve included the PDB file for a protein–ligand complex called 2D3U. It contains information about the amino acids (also called residues) of the protein.In addition, the PDB file contains the coordinates of each atom in 3D space. The units for these coordinates are angstroms (1 angstrom is 10-10 meters). The origin of this coordinate system is arbitrarily set to help in visualizing the protein, and is often set at the centroid of the protein. We recommend taking a minute to open this file in a text editor and take a look.

Why Is an Amino Acid Called a Residue?

As you spend more time working with biophysical data, you will commonly come across the terminology of an amino acid being called a residue. This refers to the chemistry of how proteins are formed. When two amino acids are joined together in a growing chain, an oxygen and two hydrogens are removed. A “residue” is what remains of an amino acid after this reaction takes place.

It can be very hard to understand the contents of a PDB file, so let’s visualize a protein. We will use the NGLview visualization package, which integrates well with Jupyter notebooks. In the notebook associated with this chapter in the code repository, you will be able to manipulate and interact with the visualized protein. For now, Figure 5-10 shows a visualization of a protein–ligand complex (2D3U) generated within the Jupyter notebook.

A visualization of the 2D3U protein–ligand complex from the PDBBind dataset. Note that the protein is represented in cartoon format for ease of visualization, and that the ligand (near the top-right corner) is represented in ball-and-stick format for full detail.
Figure 5-10. A visualization of the 2D3U protein–ligand complex from the PDBBind dataset. Note that the protein is represented in cartoon format for ease of visualization, and that the ligand (near the top-right corner) is represented in ball-and-stick format for full detail.

Notice how the ligand rests in a sort of “pocket” in the protein. You can see this more clearly by rotating the visualization to look at it from different sides.

Protein Visualization Tools

Given the importance of visualizing proteins to work with them, there are a number of protein visualization tools available. While NGLview has amazing Jupyter integration, it’s more common to see other tools, such as VMD, PyMOL, or Chimera, in use by professional drug discoverers. Note, however, that these tools are often not fully open source, and may not feature a developer-friendly API. Nevertheless, if you plan to spend significant time working with protein structures, using one of these more established tools is probably still worth the trade-off.

Featurizing the PDBBind Dataset

Let’s start by building a RdkitGridFeaturizerobject that we can inspect:

import deepchem as dc
grid_featurizer = dc.feat.RdkitGridFeaturizer(
                     voxel_width=2.0, 
                     feature_types=['hbond', 'salt_bridge', 'pi_stack', 
                                     'cation_pi', 'ecfp', 'splif'], 
                     sanitize=True, flatten=True)

There are a number of options here, so let’s pause and consider what they mean. The sanitize=True flag asks the featurizer to try to clean up any structures it is given. Recall from our earlier discussion that structures are often malformed. The sanitization step will attempt to fix any obvious errors that it detects. Setting flatten=True asks the featurizer to output a one-dimensional feature vector for each input structure.

The feature_types flag sets the types of biophysical and chemical features that the RdkitGridFeaturizer will attempt to detect in input structures. Note the presence of many of the chemical bonds we discussed earlier in the chapter: hydrogen bonds, salt bridges, etc. Finally, the option voxel_width=2.0 sets the size of the voxels making up the grid to 2 angstroms. The RdkitGridFeaturizer converts a protein to a voxelized representation for use in extracting useful features. For each spatial voxel, it counts biophysical features and also computes a local fingerprint vector. The RdkitGridFeaturizer computes two different types of fingerprints, the ECFP and SPLIF fingerprints.

Voxelization

What is voxelization? Broadly put, a voxel is the 3D analogue of a pixel (see Figure 5-11). Just as pixelized representations of images can be extraordinarily useful for handling imaging data, voxelized representations can be critical when working with 3D data.

A voxelized representation of a sphere. Note how each voxel represents a spatial cube of input.
Figure 5-11. A voxelized representation of a sphere. Note how each voxel represents a spatial cube of input.

We are now ready to load the PDBBind dataset. We don’t actually need to use the featurizer we just defined: MoleculeNet will take care of it for us. If we include the flag featurizer="grid" when loading the dataset, it will perform grid featurization automatically:

tasks, datasets, transformers = dc.molnet.load_pdbbind(
        featurizer="grid", split="random", subset="core")
train_dataset, valid_dataset, test_dataset = datasets

With this snippet, we’ve loaded and featurized the core subset of PDBBind. On a fast computer, this should run within 10 minutes. Featurizing the refined subset will take a couple of hours on a modern server.

Now that we have the data in hand, let’s start building some machine learning models. We’ll first train a classical model called a random forest:

from sklearn.ensemble import RandomForestRegressor                 
sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

As an alternative, we will also try building a neural network for predicting protein–ligand binding. We can use the dc.models.MultitaskRegressor class to build an MLP with two hidden layers. We set the widths of the hidden layers to 2,000 and 1,000, respectively, and use 50% dropout to reduce overfitting.:

n_features = train_dataset.X.shape[1]
model = dc.models.MultitaskRegressor(
        n_tasks=len(pdbbind_tasks),
        n_features=n_features,
        layer_sizes=[2000, 1000],
        dropouts=0.5,
        learning_rate=0.0003)
model.fit(train_dataset, nb_epoch=250)

Baseline Models

Deep learning models are tricky to optimize correctly at times. It’s easy for even experienced practitioners to make errors when tuning a deep model. For this reason, it’s critical to construct a baseline model that is more robust, even if it perhaps has lower performance.

Random forests are very useful choices for baselines. They often offer strong performance on learning tasks with relatively small amounts of tuning. A random forest classifier constructs many “decision tree” classifiers, each using only a random subset of the available features, then combines the individual decisions of these classifiers via a majority vote.

Scikit-learn is an invaluable package for constructing simple machine learning baselines. We will use scikit-learn to construct a baseline model in this chapter, using the RdkitGridFeaturizer to featurize complexes as inputs to random forests.

Now that we have a trained model, we can proceed to checking its accuracy. In order to evaluate the accuracy of the model, we have to first define a suitable metric. Let’s use the Pearson R2 score. This is a number between –1 and 1, where 0 indicates no correlation between the true and predicted labels, and 1 indicates perfect correlation:

metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)

Let’s now evaluate the accuracy of the models on the training and test datasets according to this metric. The code to do so is again shared between both models:

print("Evaluating model")
train_scores = model.evaluate(train_dataset, [metric], transformers)
test_scores = model.evaluate(test_dataset, [metric], transformers)

print("Train scores")
print(train_scores)

print("Test scores")
print(test_scores)

Many Architectures Can Have Similar Effects

In this section, we provide code examples of how to use an MLP with grid featurization to model protein–ligand structures in DeepChem. It’s important to note that there are a number of alternative deep architectures that have similar effects. There’s been a line of work on using 3D convolutional networks to predict protein–ligand binding interactions using voxel-based featurizations. Other work has used variants of the graph convolutions we saw in the previous chapter to handle macromolecular complexes.

What are the differences between these architectures? So far, it looks like most of them have similar predictive power. We use grid featurization because there’s a tuned implementation in DeepChem, but other models may serve your needs as well. Future versions of DeepChem will likely include additional architectures for this purpose.

For the random forest, this reports a training set score of 0.979 but a test set score of only 0.133. It does an excellent job of reproducing the training data but a very poor job of predicting the test data. Apparently it is overfitting quite badly.

In comparison, the neural network has a training set score of 0.990 and a test set score of 0.359. It does slightly better on the training set and much better on the test set. There clearly is still overfitting going on, but the amount is reduced and the overall ability of the model to predict new data is much higher.

Knowing the correlation coefficient is a powerful first step toward understanding the model we’ve built, but it’s always useful to directly visualize how our predictions correlate with actual experimental data. Figure 5-12 shows the true versus predicted labels for each of the models when run on the test set. We immediately see how the neural network’s predictions are much more closely correlated with the true data than are the random forest’s.

True versus predicted labels for the two models when run on the test set.
Figure 5-12. True versus predicted labels for the two models when run on the test set.

Conclusion

In this chapter you’ve learned about applying deep learning to biophysical systems, and in particular to the problem of predicting the binding affinity of protein–ligand systems. You might be curious how general the skills you’ve learned are. Could you apply these same models and techniques to understand other biophysical datasets? Let’s do a quick survey and tour.

Protein–protein and protein–DNA systems follow the same basic physics as protein–ligand systems at a high level. The same hydrogen bonds, salt bridges, pi-stacking interactions, and so on play a critical role. Could we just reuse the code from this chapter to analyze such systems? The answer turns out to be a little complicated. Many of the physical interactions that drive protein–ligand interactions are driven by charged dynamics. Protein–protein dynamics may, on the other hand, be driven more by bulk hydrophobic interactions. We won’t dig deeply into the meaning of these interactions, but they have a different character qualitatively than protein–ligand interactions to some degree. This could mean that RdkitGridFeaturizer wouldn’t do a good job of characterizing these interactions. On the other hand, it’s possible that the atomic convolutional models might do a better job of handling these systems, since much less about the physics of interactions is hardcoded into these deep models.

That said, there remains a significant problem of scale. The atomic convolutional models are quite slow to train and require a great deal of memory. Scaling these models to handle larger protein–protein systems would require additional work on the engineering end. The DeepChem development team is hard at work on these and other challenges, but more time may be required before these efforts reach fruition.

Antibody–antigen interactions are another form of critical biophysical interaction. Antibodies are Y-shaped proteins (see Figure 5-13) that have a variable “antigen-binding site” used to bind antigens. Here, antigens are molecules associated with a particular pathogen. Cells grown in culture can be harnessed to create antibodies that target specific antigens. If all the cells in a culture are clones of a given cell, then the antibodies produced will be identical. Such “monoclonal antibodies” have recently found wide therapeutic use.

Diagram of antibody–antigen interaction
Figure 5-13. Diagram of antibody–antigen interaction. (Source: Wikimedia.)

The design of antibodies has primarily been an experimental science until now. Part of this is due to the challenges involved in getting a 3D antibody structure. However, modeling the complex antigen–antibody binding site has also proven a challenge. Some of the techniques we’ve covered in this chapter may find fruitful use in antibody–antigen binding modeling over the next few years.

We’ve also alluded to the importance of dynamics in understanding protein physics. Could we not do deep learning directly on protein simulations to understand which ligands could bind to the protein? In principle yes, but formidable challenges remain. Some companies are actively working on this challenge, but good open source tooling is not yet available.

In Chapter 11, we will return to some biophysical techniques and show how these models can be very useful for drug discovery work. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset