Chapter 9. Generative Models

All the problems we have looked at so far involve, in some way, translating from inputs to outputs. You create a model that takes an input and produces an output. Then you train it on input samples from a dataset, optimizing it to produce the best output for each one.

Generative models are different. Instead of taking a sample as input, they produce a sample as output. You might train the model on a library of photographs of cats, and it would learn to produce new images that look like cats. Or, to give a more relevant example, you might train it on a library of known drug molecules, and it would learn to generate new “drug-like” molecules for use as candidates in a virtual screen. Formally speaking, a generative model is trained on a collection of samples that are drawn from some (possibly unknown, probably very complex) probability distribution. Its job is to produce new samples from that same probability distribution.

In this chapter, we will begin by describing the two most popular types of generative models: variational autoencoders and generative adversarial networks. We will then discuss a few applications of these models in the life sciences, and work through some code examples.

Variational Autoencoders

An autoencoder is a model that tries to make its output equal to its input. You train it on a library of samples and adjust the model parameters so that on every sample the output is as close as possible to the input.

That sounds trivial. Can’t it just learn to pass the input directly through to the output unchanged? If that were actually possible it would indeed be trivial, but autoencoders usually have architectures that make it impossible. Most often this is done by forcing the data to go through a bottleneck, as shown in Figure 9-1. For example, the input and output might each include 1,000 numbers, but in between would be a hidden layer containing only 10 numbers. This forces the model to learn how to compress the input samples. It must represent 1,000 numbers worth of information using only 10 numbers.

Structure of a variational autoencoder.
Figure 9-1. Structure of a variational autoencoder.

If the model needed to handle arbitrary inputs, that would be impossible. You can’t throw out 99% of the information and still reconstruct the input! But we don’t care about arbitrary inputs, only the specific ones in the training set (and others that resemble them). Of all possible images, far less than 1% look anything like cats. An autoencoder doesn’t need to work for all possible inputs, only ones that are drawn from a specific probability distribution. It needs to learn the “structure” of that distribution, figure out how to represent the distribution using much less information, and then be able to reconstruct the samples based on the compressed information.

Now let’s take the model apart. The middle layer, the one that serves as the bottleneck, is called the latent space of the autoencoder. It is the space of compressed representations of samples. The first half of the autoencoder is called the encoder. Its job is to take samples and convert them to compressed representations. The second half is called the decoder. It takes compressed representations in the latent space and converts them back into the original samples.

This gives us our first clue about how autoencoders could be used for generative modeling. The decoder takes vectors in the latent space and converts them into samples, so we could take random vectors in the latent space (picking a random value for each component of the vector) and pass them through the decoder. If everything goes well, the decoder should produce a completely new sample that still resembles the ones it was trained on.

This sort of works, but not very well. The problem is that the encoder may only produce vectors in a small region of the latent space. If we pick a vector anywhere else in the latent space, we may get an output that looks nothing like the training samples. In other words, the decoder has only learned to work for the particular latent vectors produced by the encoder, not for arbitrary ones.

A variational autoencoder (VAE) adds two features to overcome this problem. First, it adds a term to the loss function that forces the latent vectors to follow a specified distribution. Most often they are constrained to have a Gaussian distribution with a mean of 0 and a variance of 1. We don’t leave the encoder free to generate vectors wherever it wants. We force it to generate vectors with a known distribution. That way, if we pick random vectors from that same distribution, we can expect the decoder to work well on them.

Second, during training we add random noise to the latent vector. The encoder converts the input sample to a latent vector, and then we randomly change it a little bit before passing it through the decoder, requiring the output to still be as close as possible to the original sample. This prevents the decoder from being too sensitive to the precise details of the latent vector. If we only change it by a little bit, the output should only change by a little bit.

These changes do a good job of improving the results. VAEs are a popular tool for generative modeling: they produce excellent results on many problems.

Generative Adversarial Networks

A generative adversarial network (GAN) has much in common with a VAE. It uses the same exact decoder network to convert latent vectors into samples (except in a GAN, it is called the generator instead of the decoder). But it trains that network in a different way. It works by passing random vectors into the generator and directly evaluating the outputs on how well they follow the expected distribution. Effectively, you create a loss function to measure how well the generated samples match the training samples, then use that loss function to optimize the model.

That sounds simple for a few seconds, until you think about it and realize it isn’t simple at all. Could you write a loss function to measure how well an image resembles a cat? No, of course not! You wouldn’t know where to begin. So, instead of asking you to come up with that loss function yourself, a GAN learns the loss function from the data.

As shown in Figure 9-2, a GAN consists of two parts. The generator takes random vectors and generates synthetic samples. The second part, called the discriminator, tries to distinguish the generated samples from real training samples. It takes a sample as input and outputs a probability that this is a real training sample. It acts as a loss function for the generator.

Structure of a generative adversarial network.
Figure 9-2. Structure of a generative adversarial network.

Both parts are trained simultaneously. Random vectors are fed into the generator, and the output is fed into the discriminator. The parameters of the generator are adjusted to make the discriminator’s output as close as possible to 1, while the parameters of the discriminator are adjusted to make its output as close as possible to 0. In addition, real samples from the training set are fed into the discriminator, and its parameters are adjusted to make the output close to 1.

This is the “adversarial” aspect. You can think of it as a competition between the generator and discriminator. The discriminator is constantly trying to get better at distinguishing real samples from fake ones. The generator is constantly trying to get better at fooling the discriminator.

Like VAEs, GANs are a popular type of generative model that produces good results on many problems. The two types of models have distinct strengths and weaknesses. Very roughly speaking, one might say that GANs tend to produce higher-quality samples, while VAEs tend to produce higher-quality distributions. That is, individual samples generated by a GAN will more closely resemble training samples, while the range of samples generated by a VAE will more closely match the range of training samples. Don’t take that statement too literally, though. It all depends on the particular problem and the details of the model. Also, countless variations on both approaches have been proposed. There even are models that combine a VAE with a GAN to try to get the best features of both. This is still a very active field of research, and new ideas are published frequently.

Applications of Generative Models in the Life Sciences

Now that we’ve introduced you to the basics of deep generative models, let’s start talking about applications. Broadly speaking, generative models bring a few superpowers to the table. First, they allow for a semblance of “creativity.” New samples can be generated according to the learned distribution. This allows for a powerful complement to a creative process that can tie into existing efforts in drug or protein design. Second, being able to model complex systems accurately with generative models could allow scientists to build an understanding of complex biological processes. We’ll discuss these ideas in more depth in this section.

Generating New Ideas for Lead Compounds

A major part of a modern drug discovery effort is coming up with new compounds. This is mostly done semiannually, with expert human chemists suggesting modifications to core structures. Often, this will involve projecting a picture of the current molecular series on a screen and having a room full of senior chemists suggest modifications to the core structure of the molecule. Some subset of these suggested molecules are actually synthesized and tested, and the process repeats until a suitable molecule is found or the program is dropped. This process has powerful advantages since it can draw upon the deep intuition of expert chemists who may be able to identify flaws with a potential structure (perhaps it resembles a compound they’ve seen before which caused unexplained liver failure in rats) that may not be easy to identify algorithmically.

At the same time, though, this process is very human-limited. There aren’t that many talented and experienced senior chemists in the world, so the process can’t scale outward. In addition, it makes it very challenging for a pharmaceutical division in a country that has historically lacked drug discovery expertise to bootstrap itself. A generative model of molecular structures could serve to overcome these limitations. If the model were trained on a suitable molecular representation, it might be able to rapidly suggest new alternative compounds. Access to such a model could help improve current processes by suggesting new chemical directions that may have been missed by human designers. It’s worth noting that such design algorithms have serious caveats, though, as we will see a little later in this chapter.

Protein Design

Design of new enzymes and proteins is a major business these days. Engineered enzymes are used widely in modern manufacturing. (There’s a good chance your laundry detergent holds some enzymes!) However, in general, design of new enzymes has proven challenging. Some early work has shown that deep models can have some success at predicting protein function from sequence. It’s not unreasonable at all to envision using deep generative models to suggest new protein sequences that might have desired properties.

The introduction of generative models for this purpose could be even more impactful than for small molecule design. Unlike with small molecules, it can be very tricky for human experts to predict the downstream effects of mutations to a given protein. Using generative models can allow for richer protein design, enabling directions beyond the capability of human experts today.

A Tool for Scientific Discovery

Generative models can be a powerful tool for scientific discovery. For example, having an accurate generative model of a tissue development process could be extremely valuable to developmental biologists or as a tool in basic science. It might be possible to create “synthetic assays” where we can study tissue development in many combinations of environmental conditions by using the generative model to run rapid simulations. This future is still a ways off, since we’d need generative models that work effectively as initial conditions change. This will take some more research beyond the current state of the art. Nevertheless, the vision is exciting because generative modeling could allow for biologists to build effective models of extremely complex developmental and physiological processes and test their hypotheses of how these systems evolve.

The Future of Generative Modeling

Generative models are challenging! The first GANs were only capable of generating blurry images that were barely recognizable as faces. The latest GANs (at the time of writing) are capable of generating images of faces that are more or less indistinguishable from true photographs. It is likely that in the next decade, these models will be further refined to allow for generative videos. These developments will have profound repercussions on modern societies. For much of the last century, photographs have been routinely used as “proof” of crimes, quality, and more. As generative tools develop, this standard of proof will fall short, as arbitrary images will be able to be “photoshopped.” This development will pose a major challenge for criminal justice and even international relations.

At the same time, it’s likely that the advent of high-fidelity generative video will trigger a revolution in modern science. Imagine high-quality generative models of embryonic development! It might be feasible to model the effects of CRISPR genetic modifications or understand developmental processes in greater detail than has ever been possible. Improvements in generative models will have effects in other fields of science too. It’s likely that generative modeling will become a powerful tool in physics and climate science, allowing for more powerful simulations of complex systems. However, it’s worth emphasizing that these improvements today remain in the future; much basic science has to be done to mature these models to useful stability.

Working with Generative Models

Now let’s work through a code example. We will train a VAE to generate new molecules. More specifically, it will output SMILES strings. This choice of representation has distinct advantages and disadvantages compared to some of the other representations we have discussed. On the one hand, SMILES strings are very simple to work with. Each one is just a sequence of characters drawn from a fixed alphabet. That allows us to use a very simple model to process them. On the other hand, SMILES strings are required to obey a complex grammar. If the model does not learn all the subtleties of the grammar, then most of the strings it produces will be invalid and not correspond to any molecule.

The first thing we need is a collection of SMILES strings on which to train the model. Fortunately, MoleculeNet provides us with lots to choose from. For this example, we will use the MUV dataset. The training set includes 74,469 molecules of varying sizes and structures. Let’s begin by loading it:

import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_muv()
train_dataset, valid_dataset, test_dataset = datasets
train_smiles = train_dataset.ids

Next, we need to define the vocabulary our model will work with. What is the list of characters (or “tokens”) that can appear in a string? How long are strings allowed to be? We can determine these from the training data by creating a sorted list of every character that appears in any training molecule:

tokens = set()
for s in train_smiles:
  tokens = tokens.union(set(s))
tokens = sorted(list(tokens))
max_length = max(len(s) for s in train_smiles)

Now we need to create a model. What sort of architecture should we use for the encoder and decoder? This is an ongoing field of research. Various papers have been published suggesting different models. For this example, we will use DeepChem’s AspuruGuzikAutoEncoder class, which  implements a particular published model. It uses a convolutional network for the encoder and a recurrent network for the decoder. You can consult the original paper if you are interested in the details, but they are not necessary to follow the example. Also notice that we use ExponentialDecay for the learning rate. The rate is initially set to 0.001, then decreased by a little bit (multiplied by 0.95) after every epoch. This helps optimization to proceed more smoothly in many problems:

from deepchem.models.tensorgraph.optimizers import Adam, ExponentialDecay
from deepchem.models.tensorgraph.models.seqtoseq import AspuruGuzikAutoEncoder
model = AspuruGuzikAutoEncoder(tokens, max_length, model_dir='vae')
batches_per_epoch = len(train_smiles)/model.batch_size
learning_rate = ExponentialDecay(0.001, 0.95, batches_per_epoch)
model.set_optimizer(Adam(learning_rate=learning_rate))

We are now ready to train the model. Instead of using the standard fit() method that takes a Dataset, AspuruGuzikAutoEncoder provides its own fit_sequences() method. It takes a Python generator object that produces sequences of tokens (SMILES strings in our case). Let’s train for 50 epochs:

def generate_sequences(epochs):
  for i in range(epochs):
    for s in train_smiles:
      yield (s, s)

model.fit_sequences(generate_sequences(50))

If everything has gone well, the model should now be able to generate entirely new molecules. We just need to pick random latent vectors and pass them through the decoder. Let’s create a batch of one thousand vectors, each of length 196 (the size of the model’s latent space).

As noted previously, not all outputs will actually be valid SMILES strings. In fact, only a small fraction of them are. Fortunately, we can easily use RDKit to check them and filter out the invalid ones:

import numpy as np
from rdkit import Chem
predictions = model.predict_from_embeddings(np.random.normal(size=(1000,196)))
molecules = []
for p in predictions:
  smiles = ''.join(p)
  if Chem.MolFromSmiles(smiles) is not None:
    molecules.append(smiles)
for m in molecules:
  print(m)

Analyzing the Generative Model’s Output

In addition to the problem of invalid outputs, many of the molecules corresponding to the SMILES strings that are output may not be characteristic of drug molecules. So, we need to develop strategies that will enable us to quickly identify molecules that are not drug-like. These strategies can best be explained through a practical example. Let’s assume that this is the list of SMILES strings that came from our generative model:

smiles_list = ['CCCCCCNNNCCOCC',
'O=C(O)C(=O)ON/C=N/CO',
'C/C=N/COCCNSCNCCNN',
'CCCNC(C(=O)O)c1cc(OC(OC)[SH](=O)=O)ccc1N',
'CC1=C2C=CCC(=CC(Br)=CC=C1)C2',
'CCN=NNNC(C)OOCOOOOOCOOO',
'N#CNCCCCCOCCOC1COCNN1CCCCCCCCCCCCCCCCCCCOOOOOSNNCCCCCSCSCCCCCCCCCOCOOOSS',
'CCCC(=O)NC1=C(N)C=COO1',
'CCCSc1cc2nc(C)cnn2c1NC',
'CONCN1N=NN=CC=C1CC1SSS1',
'CCCOc1ccccc1OSNNOCCNCSNCCN',
'C[SH]1CCCN2CCN2C=C1N',
'CC1=C(C#N)N1NCCC1=COOO1',
'CN(NCNNNN)C(=O)CCSCc1ccco1',
'CCCN1CCC1CC=CC1=CC=S1CC=O',
'C/N=C/c1ccccc1',
'Nc1cccooo1',
'CCOc1ccccc1CCCNC(C)c1nccs1',
'CNNNNNNc1nocc1CCNNC(C)C',
'COC1=C(CON)C=C2C1=C(C)c1ccccc12',
'CCOCCCCNN(C)C',
'CCCN1C(=O)CNC1C',
'CCN',
'NCCNCc1cccc2c1C=CC=CC=C2',
'CCCCCN(NNNCNCCCCCCCCCCSCCCCCCCCCCCCCCNCCNCCCCSSCSSSSSSCCCCCCCCCCCCCSCCCCCSC)
C(O)OCCN',
'CCCS1=CC=C(C)N(CN)C2NCC2=C1',
'CCNCCCCCCOc1cccc(F)c1',
'NN1O[SH](CCCCO)C12C=C2',
'Cc1cc2cccc3c(CO)cc-3ccc-2c1']

The first step in our analysis will be to examine the molecules and determine whether there are any that we want to discard. We can use some of the facilities in RDKit, which is included as part of DeepChem, to examine the molecules represented by these strings. In order to evaluate the strings, we must first convert them to molecule objects. We can do this using the following list comprehension:

molecules = [Chem.MolFromSmiles(x) for x in smiles_list] 

One factor we may want to examine is the size of the molecules. Molecules with fewer than 10 atoms are unlikely to generate sufficient interaction energy to produce a measurable signal in a biological assay. Conversely, molecules with more than 50 atoms may not be capable of dissolving in water and may create other problems in biological assays. We can get a rough estimate of the sizes of the molecules by calculating the number of non-hydrogen atoms in each molecule. The following code creates a list of the number of atoms in each molecule. For convenience, we sort the array so that we can more easily understand the distribution (if we had a larger list of molecules we would probably want to generate a histogram for this distribution):

print(sorted([x.GetNumAtoms() for x in molecules]))

The results are as follows:

[3, 8, 9, 10, 11, 11, 12, 12, 13, 14, 14, 14, 15, 
16, 16, 16, 17, 17, 17, 17, 18, 19, 19, 20, 20, 22, 24, 69, 80]

We can see that there are four very small molecules as well as two large molecules. We can use another list comprehension to remove molecules with 10 or fewer than 50 atoms:

good_mol_list = [x for x in molecules if x.GetNumAtoms() > 10
        and x.GetNumAtoms() < 50]
print(len(good_mol_list))
23

This list comprehension reduces our previous list of 29 molecules to 23.

In practice, we can use a number of other calculated properties to evaluate the quality of the generated molecules. Several recent generative model publications use calculated molecular properties to determine which of the generated molecules to retain or discard. One of the more common methods for determining whether molecules are similar to known drugs, or “drug-like,” is known as the quantitative estimate of drug-likeness (QED). The QED metric, which was originally published by Bickerton and coworkers,1 scores molecules by comparing a set of properties calculated for each molecule with distributions of the same properties in marketed drugs. This score ranges between 0 and 1, with values closer to 1 being considered more drug-like.

We can use RDKit to calculate QED values for our remaining molecules and retain only those molecules with QED > 0.5 as follows:

qed_list = [QED.qed(x) for x in good_mol_list]
final_mol_list = [(a,b) for a,b in
         zip(good_mol_list,qed_list) if b > 0.5]

As our final step, we can visualize the chemical structures of final_mol_list and the corresponding QED scores:

MolsToGridImage([x[0] for x in final_mol_list],
molsPerRow=3,useSVG=True,
subImgSize=(250, 250),
legends=[f"{x[1]:.2f}" for x in final_mol_list])

The results are shown in Figure 9-3.

Figure 9-3. Chemical structures of the generated molecules along with their QED scores.

While these structures are valid and have reasonably high QED scores, they still contain functionality that may be chemically unstable. Strategies for identifying and removing problematic molecules like these are discussed in the next section.

Conclusion

While generative models provide an interesting means of producing ideas for new molecules, some key issues still need to be resolved to ensure their general applicability. The first is ensuring that the generated molecules will be chemically stable and that they can be physically synthesized. One current method to assess the quality of molecules produced by a generative model is to observe the fraction of the generated molecules that obey standard rules of chemical valence—in other words, ensuring that each carbon atom has four bonds, each oxygen atom has two bonds, each fluorine atom has one bond, and so on. These factors become especially important when decoding from a latent space with a SMILES representation. While a generative model may have learned the grammar of SMILES, there may be nuances that are still missing.

The fact that a molecule obeys standard rules of valence does not necessarily ensure that it will be chemically stable. In some cases, a generative model may produce molecules containing functional groups that are known to readily decompose. As an example, consider the molecule in Figure 9-4. The functional group highlighted in the circle, known as a hemiacetal, is known to readily decompose.

Figure 9-4. A molecule containing an unstable group.

In practice, the probability of this molecule existing and being chemically stable is very small. There are dozens of chemical functionalities like this which are known to be unstable or reactive. When synthesizing molecules in a drug discovery project, medicinal chemists know to avoid introducing these functional groups. One way of imparting this sort of “knowledge” to a generative model is to provide a set of filters that can be used to postprocess the model output and remove molecules that may be problematic. In Chapter 11, we will provide a further discussion of some of these filters and how they are used in virtual screening. Many of the same techniques used to identify potentially problematic screening compounds can also be used to evaluate virtual molecules that are created by a generative model.

In order to test the biological activity of a molecule produced by a generative model, that molecule must first be synthesized by a chemist. The science of organic chemical synthesis has a rich history going back more than one hundred years. In this time, chemists have developed thousands of chemical reactions to synthesize drugs and drug-like molecules. The synthesis of a drug-like molecule typically requires somewhere between 5 and 10 chemical reactions, often referred to as “steps.” While some drug-like molecules can be readily synthesized, the synthetic route to more complex drug molecules may require more than 20 steps. Despite more than 50 years of work on automating the planning of organic syntheses, much of the process is still driven by human intuition followed by trial and error.

Fortunately, recent developments in deep learning are providing new ways of planning the synthesis of drug-like molecules. A number of groups have published methods that use deep learning to propose routes that can be used to synthesize molecules. As input, the model is given a molecule, often referred to as a product, and the set of steps that were used to synthesize that molecule. By training with thousands of product molecules and the steps used for synthesis, a deep neural network is able to learn the relationship between product molecules and reaction steps. When presented with a new molecule, the model suggests a set of reactions that could be used to synthesize the molecule. In one test, the synthetic routes produced by these models were presented to human chemists for evaluation. These evaluators felt that the routes generated by the models were comparable in quality to routes generated by human chemists.

The application of deep learning to organic synthesis is a relatively new field. It is hoped that the field will continue to evolve and that these models become an important tool for organic chemists. One can imagine a day in the not too distant future where these synthesis planning capabilities could be paired with robotic automation to create a fully automated platform. However, there are difficulties to overcome.

One potential roadblock in the broad adoption of deep learning in organic synthesis is data availability. The majority of the information used to train these models is in databases which are the property of a small number of organizations. If these organizations decide to only utilize this data for their internal efforts, the field will be left with very few alternatives.

Another factor that may limit the advance of generative models is the quality of the predictive models that are used to drive molecule generation. Regardless of the architecture used to develop a generative model, some function must be used to evaluate the generated molecules and to direct the search for new molecules. In some cases, we may be able to develop reliable predictive models. In other cases, the models may be less reliable. While we can test our models on external validation sets, it is often difficult to determine the scope of a predictive model. This scope, also known as the “domain of applicability,” is the degree to which one can extrapolate outside the molecules on which a model was trained. This applicability domain is not well defined, so it may be difficult to determine how well a model will work on novel molecules produced by a generative model.

Generative models are a relatively new technique, and it will be interesting to see how this field evolves in the coming years. As our ability to use deep learning to predict routes for organic synthesis and build predictive models improves, the power of generative models will continue to grow.

1 Bickerton, Richard G. et al. “Quantifying the Chemical Beauty of Drugs.” http://dx.doi.org/10.1038/nchem.1243. 2012.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset