Chapter 11. A Virtual Screening Workflow Example

Virtual screening can provide an efficient and cost-effective means of identifying starting points for drug discovery programs. Rather than carrying out an expensive, experimental high-throughput screen (HTS), we can use computational methods to virtually evaluate millions, or even tens of millions, of molecules. Virtual screening methods are often grouped into two categories, structure-based virtual screening and ligand-based virtual screening. 

In a structure-based virtual screen, computational methods are used to identify molecules that will optimally fit into a cavity, known as a binding site, in a protein. The binding of a molecule into the protein binding site can often inhibit the function of the protein. For instance, proteins known as enzymes catalyze a variety of physiological chemical reactions. By identifying and optimizing inhibitors of these enzymatic processes, scientists have been able to develop treatments for a wide range of diseases in oncology, inflammation, infection, and other therapeutic areas.

In a ligand-based virtual screen, we search for molecules that function similarly to one or more known molecules. We may be looking to improve the function of an existing molecule, to avoid pharmacological liabilities associated with a known molecule, or to develop novel intellectual property. A ligand-based virtual screen typically starts with a set of known molecules identified through any of a variety of experimental methods. Computational methods are then used to develop a model based on experimental data, and this model is used to virtually screen a large set of molecules to find new chemical starting points.

In this chapter, we will walk through a practical example of a virtual screening workflow. We will examine the code used to carry out components of the virtual screen as well as the thought process behind decisions made throughout the analysis. In this particular case, we will carry out a ligand-based virtual screen. We will use a set of molecules known to bind to a particular protein, as well as a set of molecules assumed to not bind, to train a convolutional neural network to identify new molecules with the potential to bind to the target.

Preparing a Dataset for Predictive Modeling

As a first step, we will build a graph convolution model to predict the ability of molecules to inhibit a protein known as ERK2. This protein, also known as mitogen-activated protein kinase 1, or MAPK1, plays an important role in the signaling pathways that regulate how cells multiply. ERK2 has been implicated in a number of cancers, and ERK2 inhibitors are currently being tested in clinical trials for non-small-cell lung cancer and melanoma (skin cancer).

We will train the model to distinguish a set of ERK2 active compounds from a set of decoy compounds. The active and decoy compounds are derived from the DUD-E database, which is designed for testing predictive models. In practice, we would typically obtain active and inactive molecules from the scientific literature, or from a database of biologically active molecules such as the ChEMBL database from the European Bioinformatics Institute (EBI). In order to generate the best model, we would like to have decoys with property distributions similar to those of our active compounds. Let’s say this was not the case and the inactive compounds had lower molecular weight than the active compounds. In this case, our classifier might be trained simply to separate low molecular weight compounds from high molecular weight compounds. Such a classifier would have very limited utility in practice.

In order to better understand the dataset, let’s examine a few calculated properties of our active and decoy molecules. To build a reliable model, we need to ensure that the properties of the active molecules are similar to those of the decoy molecules.

First, let’s import the necessary libraries:

from rdkit import Chem             # RDKit libraries for chemistry functions 
from rdkit.Chem import Draw        # Drawing chemical structures 
import pandas as pd                # Dealing with data in tables 
from rdkit.Chem import PandasTools # Manipulating chemical data 
from rdkit.Chem import Descriptors # Calculating molecular descriptors 
from rdkit.Chem import rdmolops    # Additional molecular properties 
import seaborn as sns              # Making graphs 

In this exercise, molecules are represented using SMILES strings. For more information on SMILES, please see Chapter 4. We can now read a SMILES file into a Pandas dataframe and add an RDKit molecule to the dataframe. While the input SMILES file is not technically a CSV file, the Pandas read_CSV() function can read it as long as we specify the delimiter, which in this case is a space:

active_df = pd.read_CSV("mk01/actives_final.ism",header=None,sep=" ")
active_rows,active_cols = active_df.shape
active_df.columns = ["SMILES","ID","ChEMBL_ID"]
active_df["label"] = ["Active"]*active_rows
PandasTools.AddMoleculeColumnToFrame(active_df,"SMILES","Mol")

Let’s define a function to add the calculated properties to a dataframe:

def add_property_columns_to_df(df_in):
df_in["mw"] = [Descriptors.MolWt(mol) for mol in
df_in.Mol]
df_in["logP"] = [Descriptors.MolLogP(mol) for mol in
df_in.Mol]
df_in["charge"] = [rdmolops.GetFormalCharge(mol) for mol
in df_in.Mol]

With this function in hand, we can calculate the molecular weight, LogP, and formal charge of the molecules. These properties encode the size of a molecule, its ability to partition from an oily substance (octanol) to water, and whether the molecule has a positive or negative charge. Once we have these properties we can compare the distributions for the active and decoy sets:

add_property_columns_to_df(active_df)

Let’s look at the first few rows of our dataframe to ensure that the contents of the dataframe match the input file (see Table 11-1):

active_df.head()
Table 11-1. The first few lines of the active_df dataframe.
  SMILES ID ChEMBL_ID label
0 Cn1ccnc1Sc2ccc(cc2Cl)Nc3c4cc(c(cc4ncc3C#N)OCCCN5CCOCC5)OC 168691 CHEMBL318804 Active
1 C[C@@]12[C@@H]([C@@H](CC(O1)n3c4ccccc4c5c3c6n2c7ccccc7c6c8c5C(=O)NC8)NC)OC 86358 CHEMBL162 Active
2 Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4OC(O5)(F)F 575087 CHEMBL576683 Active
3 Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4OCO5 575065 CHEMBL571484 Active
4 Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4CCC5 575047 CHEMBL568937 Active

Now let’s do the same thing with the decoy molecules:

decoy_df = pd.read_CSV("mk01/decoys_final.ism",header=None,sep=" ")
decoy_df.columns = ["SMILES","ID"]
decoy_rows, decoy_cols = decoy_df.shape
decoy_df["label"] = ["Decoy"]*decoy_rows
PandasTools.AddMoleculeColumnToFrame(decoy_df,"SMILES","Mol")
add_property_columns_to_df(decoy_df)

In order to build a model, we need a single dataframe with the active and decoy molecules. We can use the Pandas append function to add the two dataframes and create a new dataframe called tmp_df:

tmp_df = active_df.append(decoy_df)

With properties calculated for both the active and decoy sets, we can compare the properties of the two sets of molecules. To do the comparison, we will use violin plots. A violin plot is analogous to a boxplot. The violin plot provides a mirrored, horizontal view of a frequency distribution. Ideally, we would like to see similar distributions for the active and decoy sets. The results are shown in Figure 11-1:

sns.violinplot(tmp_df["label"],tmp_df["mw"])
Figure 11-1. Violin plots of molecular weight for the active and decoy sets.

An examination of these plots shows that the molecular weight distributions for the two sets are roughly equivalent. The decoy set has more low molecular weight molecules, but the center of the distribution, shown as a box in the middle of each violin plot, is in a similar location in both plots.

We can use violin plots to perform a similar comparison of the LogP distributions (Figure 11-2). Again, we can see that the distributions are similar, with a few more of the decoy molecules at the lower end of the distribution:

sns.violinplot(tmp_df["label"],tmp_df["logP"])
Violin plots of LogP for the active and decoy sets.
Figure 11-2. Violin plots of LogP for the active and decoy sets.

Finally, we perform the same comparison with the formal charges of the molecules (Figure 11-3):

sns.violinplot(new_tmp_df["label"],new_tmp_df["charge"])
Violin plots of formal charge for the active and decoy sets.
Figure 11-3. Violin plots of formal charge for the active and decoy sets.

In this case, we see a significant difference. All of the active molecules are neutral, having charges of 0, while some of the decoys are charged, with charges of +1 or –1. Let see what fraction of the decoy molecules are charged. We can do this by creating a new dataframe with just the charged molecules:

charged = decoy_df[decoy_df["charge"] != 0]

A Pandas dataframe has a property, shape, that returns the number of rows and columns in the dataframe. As such, element [0] in the shape property will be the number of rows. Let’s divide the number of rows in our dataframe of charged molecules by the total number of rows in the decoy dataframe:

charged.shape[0]/decoy_df.shape[0]

This returns 0.162. As we saw in the violin plot, approximately 16% of the decoy molecules are charged. This appears to be because the active and decoy sets were not prepared in a consistent fashion. We can address this problem by modifying the chemical structures of the decoy molecules to neutralize their charges. Fortunately we can do this easily with the NeutraliseCharges() function from the RDKit Cookbook:

from neutralize import NeutraliseCharges

In order to avoid confusion, we create a new dataframe with the SMILES stings, IDs, and labels for the decoys:

revised_decoy_df = decoy_df[["SMILES","ID","label"]].copy()

With this new dataframe in hand, we can replace the original SMILES strings with the strings for the neutral forms of the molecules. The NeutraliseCharges function returns two values. The first is the SMILES string for the neutral form of the molecule and the second is a Boolean variable indicating whether the molecule was changed. In the following code, we only need the SMILES string, so we use the first element of the tuple returned by NeutraliseCharges.

revised_decoy_df["SMILES"] = [NeutraliseCharges(x)[0] for x
in revised_decoy_df["SMILES"]]

Once we’ve replaced the SMILES strings, we can add a molecule column to our new dataframe and calculate the properties again:

PandasTools.AddMoleculeColumnToFrame(revised_decoy_df,"SMILES","Mol")
add_property_columns_to_df(revised_decoy_df)

We can then append the dataframe with the active molecules to the one with the revised, neutral decoys:

new_tmp_df = active_df.append(revised_decoy_df)

Next, we can generate a new boxplot to compare the charge distributions of the active molecules with those of our neutralized decoys (Figure 11-4):

sns.violinplot(new_tmp_df["label"],new_tmp_df["charge"])
Violin plots of the charge distribution for our revised decoy set.
Figure 11-4. Violin plots of the charge distribution for our revised decoy set.

An examination of the plots shows that there are now very few charged molecules in the decoy set. We can use the same technique we used earlier to create a dataframe with only the charged molecules. We then use this dataframe to determine the number of charged molecules remaining in the set:

charged = revised_decoy_df[revised_decoy_df["charge"] != 0]
charged.shape[0]/revised_decoy_df.shape[0]

The result now is 0.003. We have reduced the fraction of charged molecules from 16% to 0.3%. We can now be confident that our active and decoy sets are reasonably well balanced.

In order to use these datasets with DeepChem, we need to write the molecules out as a CSV file containing for each molecule the SMILES string, ID, Name, and an integer value indicating whether the compounds are active (labeled as 1) or inactive (labeled as 0):

active_df["is_active"] = [1] * active_df.shape[0]
revised_decoy_df["is_active"] = [0] * revised_decoy_df.shape[0]
combined_df = active_df.append(revised_decoy_df)[["SMILES","ID","is_active"]]
combined_df.head()

The first five lines are shown in Table 11-2.

Table 11-2. The first few lines of our new combined dataframe
  SMILES ID is_active
0 Cn1 ccnc1Sc2ccc(cc2Cl}Nc3c4cc(c(cc4ncc3C#N}OCCCN5CCOCC5)OC 168691 1
1 C[C@@]12[C@@H]([C@@H](CC(O1)n3c4ccccc4c5c3c6n2c7ccccc7c6c8c5C(=O)NC8)NC)OC 86358 1
2 Cc1cnc(nc1c2cc([nH]c2)C(=0) N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4OC(O5)(F)F 575087 1
3 CCc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4OCO5 575065 1
4 Cc1cnc(nc1c2cc([nH]c2)C(=0) N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4CCC5 575047 1

Our final step in this section is to save our new combined_df as a CSV file. The index=False option causes Pandas to not include the row number in the first column:

combined_df.to_csv("dude_erk1_mk01.CSV",index=False)

Training a Predictive Model

Now that we have taken care of formatting, we can use this data to train a graph convolution model. First, we need to import the necessary libraries. Some of these libraries were imported in the first section, but let’s assume we are starting with the CSV file we created in the previous section:

import deepchem as dc                      # DeepChem libraries
from deepchem.models import GraphConvModel # Graph convolutions 
import numpy as np                         # NumPy for numeric operations 
import sys                                 # Error handling 
import pandas as pd                        # Data table manipulation 
import seaborn as sns                      # Seaborn library for plotting 
from rdkit.Chem import PandasTools         # Chemical structures in Pandas 

Now let’s define a function to create a GraphConvModel. In this case, we will be creating a classification model. Since we will apply the model later on a different dataset, it’s a good idea to create a directory in which to store the model. You will need to change the directory to something accessible on your filesystem:

def generate_graph_conv_model():
batch_size = 128
model = GraphConvModel(1, batch_size=batch_size,
mode='classification',
model_dir="/tmp/mk01/model_dir")
return model

In order to train the model, we first read in the CSV file we created in the previous section:

dataset_file = "dude_erk2_mk01.CSV"
tasks = ["is_active"]
featurizer = dc.feat.ConvMolFeaturizer()
loader = dc.data.CSVLoader(tasks=tasks,
smiles_field="SMILES",
featurizer=featurizer)
dataset = loader.featurize(dataset_file, shard_size=8192)

Now that we have the dataset loaded, let’s build a model. We will create training and test sets to evaluate the model’s performance. In this case, we will use the RandomSplitter (DeepChem offers a number of other splitters too, such as the ScaffoldSplitter, which divides the dataset by chemical scaffold, and the ButinaSplitter, which first clusters the data then splits the dataset so that different clusters end up in the training and test sets):

splitter = dc.splits.RandomSplitter()

With the dataset split, we can train a model on the training set and test that model on the validation set. At this point, we need to define some metrics and evaluate the performance of our model. In this case, our dataset is unbalanced: we have a small number of active compounds and a large number of inactive compounds. Given this difference, we need to use a metric that reflects the performance on unbalanced datasets. One metric that is appropriate for datasets like this is the Matthews correlation coefficient (MCC):

metrics = [
dc.metrics.Metric(dc.metrics.matthews_corrcoef, np.mean,
mode="classification")]

In order to evaluate the performance of our model, we will perform 10 folds of cross validation, where we train a model on the training set and validate on the validation set:

training_score_list = []
validation_score_list = []
transformers = []
cv_folds = 10
for i in range(0, cv_folds):
model = generate_graph_conv_model()
res = splitter.train_valid_test_split(dataset)
train_dataset, valid_dataset, test_dataset = res
model.fit(train_dataset)
train_scores = model.evaluate(train_dataset, metrics,
transformers)
training_score_list.append(
train_scores["mean-matthews_corrcoef"])
validation_scores = model.evaluate(valid_dataset,
metrics,
transformers)
validation_score_list.append(
validation_scores["mean-matthews_corrcoef"])
print(training_score_list)
print(validation_score_list)

To visualize the performance of our model on the training and test data, we can make boxplots. The results are shown in Figure 11-5:

sns.boxplot(
["training"] * cv_folds + ["validation"] * cv_folds,
training_score_list + validation_score_list)
Boxplots of scores for the training and validation sets.
Figure 11-5. Boxplots of scores for the training and validation sets.

The plots indicate that, as expected, the performance on the training set is superior to that on the validation set. However, the performance on the validation set is still quite good. At this point, we can be confident in the performance of our model.

It is also useful to visualize the results of our model. In order to do this, we will generate a set of predictions for a validation set:

pred = [x.flatten() for x in model.predict(valid_dataset)]

To make processing easier, we’ll create a Pandas dataframe with the predictions:

pred_df = pd.DataFrame(pred,columns=["neg","pos"])

We can easily add the activity class (1 = active, 0 = inactive) and the SMILES strings for our predicted molecules to the dataframe:

pred_df["active"] = [int(x) for x in valid_dataset.y]
pred_df["SMILES"] = valid_dataset.ids

It’s always a good idea to look at the first few lines of the dataframe to ensure that the data makes sense. Table 11-3 shows the results.

Table 11-3. The first few lines of the dataframe containing the predictions
  neg pos active SMILES
0 0.906081 0.093919 1 Cn1ccnc1Sc2ccc(cc2Cl)Nc3c4cc(c(cc4ncc3C#N)OCCC...
1 0.042446 0.957554 1 Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3...
2 0.134508 0.865492 1 Cc1cccc(c1)[C@@H](CO)NC(=O)c2cc(c[nH]2)c3c(cnc...
3 0.036508 0.963492 1 Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3ccccc3)...
4 0.940717 0.059283 1 c1c2c([nH]c1Br)C(=O)NCC/C2=C/3C(=O)N=C(N3)N

Creating boxplots enables us to compare the predicted values for the active and inactive molecules (see Figure 11-6).

sns.boxplot(pred_df.active,pred_df.pos)
Positive scores for the predicted molecules.
Figure 11-6. Positive scores for the predicted molecules.

The performance of our model is very good: we can see a clear separation between the active and inactive molecules. When building a predictive model it is often important to examine inactive molecules that are predicted as active (false positives) as well as active molecules that are predicted as inactive (false negatives). It appears that only one of our active molecules received a low positive score. In order to look more closely, we will create a new dataframe containing all of the active molecules with a positive score < 0.5):

false_negative_df = pred_df.query("active == 1 & pos < 0.5").copy()

To inspect the chemical structures of the molecules in our dataframe, we use the PandasTools module from RDKit:

PandasTools.AddMoleculeColumnToFrame(false_negative_df,
"SMILES", "Mol")

Let’s look at the new dataframe (Figure 11-7):

false_negative_df
False negative predictions.
Figure 11-7. False negative predictions.

In order to fully take advantage of the information in this dataframe, we need to have some knowledge of medicinal chemistry. It is often informative to look at the chemical structures of the false negative molecules and compare these with the chemical structures of the true positive molecules. This may provide some insight into the reasons that molecules were not predicted correctly. Often it may be the case that the false negative molecules are not similar to any of the true positive molecules. In this case, it may be worth performing additional literature searches to increase the diversity of the molecules in the training set.

We can use a similar approach to examine the false positive molecules, which are inactive but received a positive score > 0.5 (see Figure 11-8). Again, comparison with the chemical structures of the true positive molecules may be informative:

false_positive_df = pred_df.query(
    "active == 0 & pos > 0.5").copy()
PandasTools.AddMoleculeColumnToFrame(false_positive_df,
                                     "SMILES", "Mol")
false_positive_df
A false positive molecule.
Figure 11-8. A false positive molecule.

During the model training phase, our objective was to evaluate the performance of our model. As such, we trained the model on a portion of the data and validated the model on the remainder. Now that we have evaluated the performance, we want to generate the most effective model. In order to do this, we will train the model on all of the data:

model.fit(dataset)

This gives us an accuracy score of 91%. Finally, we save the model to disk so that we can use it to make future predictions:

model.save()

Preparing a Dataset for Model Prediction

Now that we’ve created a predictive model, we can apply this model to a new set of molecules. In many cases, we will build a predictive model based on literature data, then apply that model to a set of molecules that we want to screen. The molecules we want to screen may come from an internal database or from a commercially available screening collection. As an example, we will use the predictive model we created to screen a small sample of 100,000 compounds from the ZINC database, a collection of more than 1 billion commercially available molecules.

One potential source of difficulty when carrying out a virtual screen is the presence of molecules that have the potential to interfere with biological assays. Over the last 25 years, many groups within the scientific community have developed sets of rules to identify potentially reactive or problematic molecules. Several of these rule sets, which are encoded as SMARTS strings, have been collected by the group that curates the ChEMBL database. These rule sets have been made available through a Python script called rd_filters.py. In this example, we will use rd_filters.py to identify potentially problematic molecules in our set of 100,000 molecules from the ZINC database.

The rd_filters.py script and associated data files are available on our GitHub repository.

The available modes and arguments for the script can be obtained by calling it with the -h flag.

rd_filters.py -h
Usage:
rd_filters.py $ filter --in INPUT_FILE --prefix PREFIX [--rules RULES_FILE_NAME] 
[--alerts ALERT_FILE_NAME][--np NUM_CORES]
rd_filters.py $ template --out TEMPLATE_FILE [--rules RULES_FILE_NAME]
Options:
--in INPUT_FILE input file name
--prefix PREFIX prefix for output file names
--rules RULES_FILE_NAME name of the rules JSON file
--alerts ALERTS_FILE_NAME name of the structural alerts file
--np NUM_CORES the number of cpu cores to use (default is all)
--out TEMPLATE_FILE parameter template file name

To call the script on our input file, which is called zinc_100k.smi, we can specify the input file and a prefix for output filenames. The filter argument calls the script in “filter” mode, where it identifies potentially problematic molecules. The --prefix argument indicates that the output file names will start with the prefix zinc.

rd_filters.py filter --in zinc_100k.smi --prefix zinc
using 24 cores
Using alerts from Inpharmatica
Wrote SMILES for molecules passing filters to zinc.smi
Wrote detailed data to zinc.CSV
68752 of 100000 passed filters 68.8%
Elapsed time 15.89 seconds

The output indicates the following:

  • The script is running on 24 cores. It runs in parallel across multiple cores, and the number of cores can be selected with the -np flag.

  • The script is using the “Inpharmatica” set of rules. This rule set covers a large range of chemical functionality that has been shown to be problematic in biological assays. In addition to the Inpharmaticia set, the script has seven other rule sets available. Please see the rd_filters.py documentation for more information.

  • SMILES strings for the molecules passing the filters were written to a file called zinc.smi. We will use this as the input when we use the predictive model.

  • Detailed information on which compounds triggered particular structural alerts was written to a file called zinc.CSV.

  • 69% of the molecules passed the filters, and 31% were considered problematic.

It is informative to take a look at the reasons why 31% of the molecules were rejected. This can let us know whether we need to adjust any of the filters. We will use a bit of Python code to look at the first few lines of output, shown in Table 11-4.

import pandas as pd
df = pd.read_CSV("zinc.CSV")
df.head()
Table 11-4. The first few lines of the dataframe created from zinc.CSV
  SMILES NAME FILTER MW LogP HBD
0 CN(CCO)C[C@@H](O)Cn1cnc2c1c(=O)n(C)c(=O)n2C ZINC000000000843 Filter82_pyridinium >0 311.342 –2.2813 2
1 O=c1[nH]c(=O)n([C@@H]2C[C@@H](O)[C@H](CO)O2)cc1Br ZINC000000001063 Filter82_pyridinium >0 307.100 –1.0602 3
2 Cn1c2ncn(CC(=O)N3CCOCC3)c2c(=O)n(C)c1=O ZINC000000003942 Filter82_pyridinium >0 307.310 –1.7075 0
3 CN1C(=O)C[C@H](N2CCN(C(=O)CN3CCCC3)CC2)C1=O ZINC000000036436 OK 308.382 –1.0163 0
4 CC(=O)NC[C@H](O)[C@H]1O[C@H]2OC(C)(C)O[C@H]2[C... ZINC 000000041101 OK 302.327 -1.1355 3

The dataframe has six columns:

SMILES
the SMILES strings for each molecule.
NAME
the molecule NAME, as listed in the input file.
FILTER
the reason the molecule was rejected, or “OK” if the molecule was not rejected.
MW
the molecular weight of the molecule. By default, molecules with molecular weight greater than 500 are rejected.
LogP
the calculated octanol/water partition coefficient of the molecule. By default, molecules with LogP greater than five are rejected.
HBD
the number of hydrogen bond donors. By default, molecules with more than 5 hydrogen bond donors are rejected.

We can use the Counter class from the Python collections library to identify which filters were responsible for removing the largest numbers of molecules (see Table 11-5):

from collections import Counter
count_list = list(Counter(df.FILTER).items())
count_df = pd.DataFrame(count_list,columns=["Rule","Count"])
count_df.sort_values("Count",inplace=True,ascending=False)
count_df.head()
Table 11-5. Counts of the number of molecules selected by the top 5 filters
  Rule Count
1 OK 69156
6 Filter41_12_dicarbonyl > O 19330
0 Filter82_pyridinium > O 7761
10 Filter93_acetyl_urea > O 1541
11 Filter78_bicyclic_lmide > O 825


The first line in the table, labeled as “OK,” indicates the number of molecules that were not eliminated by any of the filters. From this, we can see that 69,156 of the molecules in our input passed all of the filters. The largest number of molecules (19,330) were rejected because they contained a 1,2-dicarbonyl group. Molecules of this type may react and form covalent bonds with protein residues such as serine and cysteine. We can find the SMARTS pattern used to identify these molecules by looking for the string “Filter41_12_dicarbonyl” in the filter_collection.CSV file that is part of the rd_filters.py distribution. The SMARTS pattern is “*C(=O)C(=O)*”, which represents:

  • Any atom, connected to
  • carbon double bonded to oxygen, connected to
  • carbon double bonded to oxygen, connected to
  • any atom.

It is always good to look at the data and ensure that everything is working as expected. We can use the highlightAtomLists argument to RDKit’s MolsToGridImage() function to highlight the 1,2-dicarbonyl functionality (see Figure 11-9):

from rdkit import Chem
from rdkit.Chem import Draw

mol_list = [Chem.MolFromSmiles(x) for x in smiles_list]
dicarbonyl = Chem.MolFromSmarts('*C(=O)C(=O)*')
match_list = [mol.GetSubstructMatch(dicarbonyl) for mol in
              mol_list]
Draw.MolsToGridImage(mol_list,
                     highlightAtomLists=match_list,
                     molsPerRow=5)

We can see that the molecules do indeed have dicarbonyl groups, as highlighted in the figure. If we wanted to, we could similarly evaluate other filters. At this point, however, we can be satisfied with the results of the filtering. We have removed the problematic molecules from the set we plan to use for our virtual screen. We can now use this set, which is in the file zinc.smi, in the next step of this exercise.

Molecules containing a 1,2-dicarbonyl group.
Figure 11-9. Molecules containing a 1,2-dicarbonyl group.

Applying a Predictive Model

The GraphConvMdel we created can now be used to search the set of commercially available compounds we just filtered. Applying the model requires a few steps:

  1. Load the model from disk.

  2. Create a featurizer.

  3. Read and featurize the molecules that will run through the model.

  4. Examine the scores for the predictions.

  5. Examine the chemical structures of the top predicted molecules.

  6. Cluster the selected molecules.

  7. Write the selected molecules from each cluster to a CSV file.

We begin by importing the necessary libraries:

import deepchem as dc                           # DeepChem libraries
import pandas as pd                             # Pandas for tables
from rdkit.Chem import PandasTools, Draw        # Chemistry in Pandas
from rdkit import DataStructs                   # For fingerprint handling
from rdkit.ML.Cluster import Butina             # Cluster molecules
from rdkit.Chem import rdMolDescriptors as rdmd # Descriptors
import seaborn as sns                           # Plotting

and loading the model we generated earlier:

model = dc.models.TensorGraph.load_from_dir(""/tmp/mk01/model_dir"")

To generate predictions from our model, we first need to featurize the molecules we plan to use to generate predictions. We do this by instantiating a DeepChem ConvMolFeaturizer:

featurizer = dc.feat.ConvMolFeaturizer()

In order to featurize the molecules, we need to transform our SMILES file into a CSV file. In order to create a DeepChem featurizer we also require an activity column, so we add one, then write the file to CSV:

df = pd.read_CSV("zinc.smi",sep=" ",header=None)
df.columns=["SMILES","Name"]
rows,cols = df.shape
# Just add add a dummy column to keep the featurizer happy
df["Val"] = [0] * rows 

As before, we should look at the first few lines of the file (shown in Table 11-6) to make sure everything is as we had expect:

df.head()
Table 11-6. The first few lines of the input file
  SMILES Name Val
0 CN1C(=O)C[C@H](N2CCN(C(=O)CN3CCCC3)CC2)C1=O ZINC000000036436 0
1 CC(=O)NC[C@H](O)[C@H]1O[C@H]2OC(C)(C)O[C@H]2[C@@H]1NC(C)=O ZINC000000041101 0
2 C1CN(c2nc(-c3nn[nH]n3)nc(N3CCOCC3)n2)CCO1 ZINC000000054542 0
3 OCCN(CCO)c1nc(Cl)nc(N(CCO)CCO)n1 ZINC000000109481 0
4 COC(=O)c1ccc(S(=O)(=O)N(CCO)CCO)n1C ZINC000000119782 0

Note that the Val column is just a placeholder to keep the DeepChem featurizer happy. The file looks good, so we will write it as a CSV file to use as input for DeepChem. We use the index=False argument to prevent Pandas from writing the row numbers as the first column:

infile_name = "zinc_filtered.CSV"
df.to_CSV(infile_name,index=False)

We can use DeepChem to read this CSV file with a loader and featurize the molecules we plan to predict:

loader = dc.data.CSVLoader(tasks=['Val'],
                           smiles_field="SMILES",
                           featurizer=featurizer)
dataset = loader.featurize(infile_name, shard_size=8192)

The featurized molecules can be used to generate predictions with the model:

pred = model.predict(dataset)

For convenience,we will put the predictions into a Pandas dataframe:

pred_df = pd.DataFrame([x.flatten() for x in pred],
columns=["Neg", "Pos"]

The distribution plot, available in the Seaborn library, provides a nice overview of the distribution of scores. Unfortunately, in virtual screening, there are no clear rules for defining an activity cutoff. Often the best strategy is to look at the distribution of scores, then select a set of the top-scoring molecules. If we look at the plot in Figure 11-10, we can see that there are only a small number of molecules with scores above 0.3. We can use this value as a preliminary cutoff for molecules that we may want to screen experimentally.

Distribution plot of the scores for the predicted molecules.
Figure 11-10. Distribution plot of the scores for the predicted molecules.

We can join the dataframe with the scores to the dataframe with the SMILES strings. This gives us the ability to view the chemical structures of the top-scoring molecules:

combo_df = df.join(pred_df, how="outer")
combo_df.sort_values("Pos", inplace=True, ascending=False)

As we saw earlier, adding a molecule column to the dataframe enables us to look at the chemical structures of the hits (see Figure 11-11).

Chemical structures of the top-scoring molecules.
Figure 11-11. Chemical structures of the top-scoring molecules.

Based on what we see here, it looks like many of the hits are similar. Let’s look at a few more molecules (Figure 11-12):

Draw.MolsToGridImage(combo_df.Mol[:10], molsPerRow=5,
                    legends=["%.2f" % x for x in combo_df.Pos[:10]])
Structure grid with top-scoring hits. Values below the structures are model scores.
Figure 11-12. Structure grid with top-scoring hits. Values below the structures are model scores.

Indeed, many of the molecules are very similar and might end up being redundant in our screen. One way to be more efficient would be to cluster the molecules and only screen the highest-scoring molecule in each cluster. RDKit has an implementation of the Butina clustering method, one of the most highly used methods in cheminformatics. In the Butina clustering method, we group molecules based on their chemical similarity, which is calculated using a comparison of bit vectors (arrays of 1 and 0), also known as chemical fingerprints that represent the presence or absence of patterns of connected atoms in a molecule. These bit vectors are typically compared using a metric known as the Tanimoto coefficient, which is defined as:

Tanimoto=ABAB

The numerator of the equation is the intersection, or the number of bits that are 1 in both bit vectors A and B. The denominator is the number of bits that are 1 in either vector A or vector B. The Tanimoto coefficient can range between 0, indicating that the molecules have no patterns of atoms in common, and 1, indicating that all of the patterns contained in molecule A are also contained in molecule B. As an example,we can consider the bit vectors shown in Figure 11-13. The intersection of the two vectors is 3 bits, while the union is 5. The Tanimoto coefficient is then 3/5, or 0.6. Note that the example shown here has been simplified for demonstration purposes. In practice, these bit vectors can contain hundreds or even thousands of bits.

Calculating a Tanimoto coefficient.
Figure 11-13. Calculating a Tanimoto coefficient.

A small amount of code is necessary to cluster a set of molecules. The only parameter required for Butina clustering is the cluster cutoff. If the Tanimoto similarity of two molecules is greater than the cutoff, the molecules are put into the same cluster. If the similarity is less than the cutoff, the molecules are put into different clusters:

def butina_cluster(mol_list, cutoff=0.35):
    fp_list = [
        rdmd.GetMorganFingerprintAsBitVect(m, 3, nBits=2048)
        for m in mol_list]
    dists = []
    nfps = len(fp_list)
    for i in range(1, nfps):
        sims = DataStructs.BulkTanimotoSimilarity(
            fp_list[i], fp_list[:i])
        dists.extend([1 - x for x in sims])
    mol_clusters = Butina.ClusterData(dists, nfps, cutoff,
                                      isDistData=True)
    cluster_id_list = [0] * nfps
    for idx, cluster in enumerate(mol_clusters, 1):
        for member in cluster:
            cluster_id_list[member] = idx
    return cluster_id_list

Before clustering, we will create a new dataframe with only the 100 top-scoring molecules. Since combo_df is already sorted, we only have to use the head function to select the first 100 rows in the dataframe:

best_100_df = combo_df.head(100).copy()

We can then create a new column containing the cluster identifier for each compound:

best_100_df["Cluster"] = butina_cluster(best_100_df.Mol)
best_100_df.head()

As always, it’s good to take a look and make sure everything worked. We now see that in addition to the SMILES string, molecule name, and predicted values, we also have a cluster identifier (see Figure 11-14).

The first few rows of the clustered dataset.
Figure 11-14. The first few rows of the clustered dataset.

We can use the Pandas unique function to determine that we have 55 unique clusters:

len(best_100_df.Cluster.unique())

Ultimately, we would like to purchase these compounds and screen them experimentally. In order to do this, we need to save a CSV file listing the molecules we plan to purchase. The drop_duplicates function can be used to select one molecule per cluster. By default, the function starts from the top of the table and removes rows with values that have already been seen:

best_cluster_rep_df = best_100_df.drop_duplicates("Cluster")

Just to make sure that this operation worked, let’s use the shape parameter to get the number of rows and columns in the new dataframe:

best_cluster_rep_df.shape

Finally, we can write out a CSV file with the molecules we want to screen:

best_cluster_rep_df.to_CSV("best_cluster_represenatives.CSV")

Conclusion

At this point, we have followed all the steps of a ligand-based virtual screening workflow. We used deep learning to build a classification model that was capable of distinguishing active from inactive molecules. The process began with evaluating our training data and ensuring that the molecular weight, LogP, and charge distributions were balanced between the active and decoy sets. Once we’d made the necessary adjustments to the chemical structures of the decoy molecules, we were ready to build a model.

The first step in building the model was generating a set of chemical features for the molecules being used. We used the DeepChem GraphConv featurizer to generate a set of appropriate chemical features. These features were then used to build a graph convolution model, which was subsequently used to predict the activity of a set of commercially available molecules. In order to avoid molecules that could be problematic in biological assays, we used a set of computational rules encoded as SMARTS patterns to identify molecules containing chemical functionality previously known to interfere with assays or create subsequent liabilities.

With our list of desired molecules in hand, we are in a position to test these molecules in biological assays. Typically the next step in our workflow would be to obtain samples of the chemical compounds for testing. If the molecules came from a corporate compound collection, a robotic system would collect the samples and prepare them for testing. If the molecules were purchased from commercial sources, additional weighing and dilution with buffered water or another solvent would be necessary.

Once the samples are prepared, they are tested in biological assays. These assays can cover a wide range of endpoints, ranging from inhibiting bacterial growth to preventing the proliferation of cancer cells. While the testing of these molecules is the final step in our virtual screening exercise, it is far from the end of the road for a drug discovery project. Once we have run the initial biological assay on the molecules we identified through virtual screening, we analyze the results of the screen. If we find experimentally active molecules, we will typically identify and test other similar molecules that will enable us to understand the relationships between different parts of the molecule and the biological activity that we are measuring. This optimization process often involves the synthesis and testing of hundreds or even thousands of molecules to identify those with the desired combination of safety and biological activity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset