Virtual screening can provide an efficient and cost-effective means of identifying starting points for drug discovery programs. Rather than carrying out an expensive, experimental high-throughput screen (HTS), we can use computational methods to virtually evaluate millions, or even tens of millions, of molecules. Virtual screening methods are often grouped into two categories, structure-based virtual screening and ligand-based virtual screening.
In a structure-based virtual screen, computational methods are used to identify molecules that will optimally fit into a cavity, known as a binding site, in a protein. The binding of a molecule into the protein binding site can often inhibit the function of the protein. For instance, proteins known as enzymes catalyze a variety of physiological chemical reactions. By identifying and optimizing inhibitors of these enzymatic processes, scientists have been able to develop treatments for a wide range of diseases in oncology, inflammation, infection, and other therapeutic areas.
In a ligand-based virtual screen, we search for molecules that function similarly to one or more known molecules. We may be looking to improve the function of an existing molecule, to avoid pharmacological liabilities associated with a known molecule, or to develop novel intellectual property. A ligand-based virtual screen typically starts with a set of known molecules identified through any of a variety of experimental methods. Computational methods are then used to develop a model based on experimental data, and this model is used to virtually screen a large set of molecules to find new chemical starting points.
In this chapter, we will walk through a practical example of a virtual screening workflow. We will examine the code used to carry out components of the virtual screen as well as the thought process behind decisions made throughout the analysis. In this particular case, we will carry out a ligand-based virtual screen. We will use a set of molecules known to bind to a particular protein, as well as a set of molecules assumed to not bind, to train a convolutional neural network to identify new molecules with the potential to bind to the target.
As a first step, we will build a graph convolution model to predict the ability of molecules to inhibit a protein known as ERK2. This protein, also known as mitogen-activated protein kinase 1, or MAPK1, plays an important role in the signaling pathways that regulate how cells multiply. ERK2 has been implicated in a number of cancers, and ERK2 inhibitors are currently being tested in clinical trials for non-small-cell lung cancer and melanoma (skin cancer).
We will train the model to distinguish a set of ERK2 active compounds from a set of decoy compounds. The active and decoy compounds are derived from the DUD-E database, which is designed for testing predictive models. In practice, we would typically obtain active and inactive molecules from the scientific literature, or from a database of biologically active molecules such as the ChEMBL database from the European Bioinformatics Institute (EBI). In order to generate the best model, we would like to have decoys with property distributions similar to those of our active compounds. Let’s say this was not the case and the inactive compounds had lower molecular weight than the active compounds. In this case, our classifier might be trained simply to separate low molecular weight compounds from high molecular weight compounds. Such a classifier would have very limited utility in practice.
In order to better understand the dataset, let’s examine a few calculated properties of our active and decoy molecules. To build a reliable model, we need to ensure that the properties of the active molecules are similar to those of the decoy molecules.
First, let’s import the necessary libraries:
from
rdkit
import
Chem
# RDKit libraries for chemistry functions
from
rdkit.Chem
import
Draw
# Drawing chemical structures
import
pandas
as
pd
# Dealing with data in tables
from
rdkit.Chem
import
PandasTools
# Manipulating chemical data
from
rdkit.Chem
import
Descriptors
# Calculating molecular descriptors
from
rdkit.Chem
import
rdmolops
# Additional molecular properties
import
seaborn
as
sns
# Making graphs
In this exercise, molecules are represented using SMILES strings. For more information on SMILES, please see Chapter 4. We can now read a SMILES file into a Pandas dataframe and add an RDKit molecule to the dataframe. While the input SMILES file is not technically a CSV file, the Pandas read_CSV()
function can read it as long as we specify the delimiter, which in this case is a space:
active_df
=
pd
.
read_CSV
(
"
mk01/actives_final.ism
"
,
header
=
None
,
sep
=
"
"
)
active_rows
,
active_cols
=
active_df
.
shape
active_df
.
columns
=
[
"
SMILES
"
,
"
ID
"
,
"
ChEMBL_ID
"
]
active_df
[
"
label
"
]
=
[
"
Active
"
]
*
active_rows
PandasTools
.
AddMoleculeColumnToFrame
(
active_df
,
"
SMILES
"
,
"
Mol
"
)
Let’s define a function to add the calculated properties to a dataframe:
def
add_property_columns_to_df
(
df_in
):
df_in
[
"mw"
]
=
[
Descriptors
.
MolWt
(
mol
)
for
mol
in
df_in
.
Mol
]
df_in
[
"logP"
]
=
[
Descriptors
.
MolLogP
(
mol
)
for
mol
in
df_in
.
Mol
]
df_in
[
"charge"
]
=
[
rdmolops
.
GetFormalCharge
(
mol
)
for
mol
in
df_in
.
Mol
]
With this function in hand, we can calculate the molecular weight, LogP, and formal charge of the molecules. These properties encode the size of a molecule, its ability to partition from an oily substance (octanol) to water, and whether the molecule has a positive or negative charge. Once we have these properties we can compare the distributions for the active and decoy sets:
add_property_columns_to_df
(
active_df
)
Let’s look at the first few rows of our dataframe to ensure that the contents of the dataframe match the input file (see Table 11-1):
active_df
.
head
(
)
SMILES | ID | ChEMBL_ID | label | |
---|---|---|---|---|
0 | Cn1ccnc1Sc2ccc(cc2Cl)Nc3c4cc(c(cc4ncc3C#N)OCCCN5CCOCC5)OC | 168691 | CHEMBL318804 | Active |
1 | C[C@@]12[C@@H]([C@@H](CC(O1)n3c4ccccc4c5c3c6n2c7ccccc7c6c8c5C(=O)NC8)NC)OC | 86358 | CHEMBL162 | Active |
2 | Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4OC(O5)(F)F | 575087 | CHEMBL576683 | Active |
3 | Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4OCO5 | 575065 | CHEMBL571484 | Active |
4 | Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3)Cl)Nc4cccc5c4CCC5 | 575047 | CHEMBL568937 | Active |
Now let’s do the same thing with the decoy molecules:
decoy_df
=
pd
.
read_CSV
(
"
mk01/decoys_final.ism
"
,
header
=
None
,
sep
=
"
"
)
decoy_df
.
columns
=
[
"
SMILES
"
,
"
ID
"
]
decoy_rows
,
decoy_cols
=
decoy_df
.
shape
decoy_df
[
"
label
"
]
=
[
"
Decoy
"
]
*
decoy_rows
PandasTools
.
AddMoleculeColumnToFrame
(
decoy_df
,
"
SMILES
"
,
"
Mol
"
)
add_property_columns_to_df
(
decoy_df
)
In order to build a model, we need a single dataframe with the active and decoy molecules. We can use the Pandas append
function to add the two dataframes and create a new dataframe called tmp_df
:
tmp_df
=
active_df
.
append
(
decoy_df
)
With properties calculated for both the active and decoy sets, we can compare the properties of the two sets of molecules. To do the comparison, we will use violin plots. A violin plot is analogous to a boxplot. The violin plot provides a mirrored, horizontal view of a frequency distribution. Ideally, we would like to see similar distributions for the active and decoy sets. The results are shown in Figure 11-1:
sns
.
violinplot
(
tmp_df
[
"
label
"
]
,
tmp_df
[
"
mw
"
]
)
An examination of these plots shows that the molecular weight distributions for the two sets are roughly equivalent. The decoy set has more low molecular weight molecules, but the center of the distribution, shown as a box in the middle of each violin plot, is in a similar location in both plots.
We can use violin plots to perform a similar comparison of the LogP distributions (Figure 11-2). Again, we can see that the distributions are similar, with a few more of the decoy molecules at the lower end of the distribution:
sns
.
violinplot
(
tmp_df
[
"
label
"
]
,
tmp_df
[
"
logP
"
]
)
Finally, we perform the same comparison with the formal charges of the molecules (Figure 11-3):
sns
.
violinplot
(
new_tmp_df
[
"
label
"
]
,
new_tmp_df
[
"
charge
"
]
)
In this case, we see a significant difference. All of the active molecules are neutral, having charges of 0, while some of the decoys are charged, with charges of +1 or –1. Let see what fraction of the decoy molecules are charged. We can do this by creating a new dataframe with just the charged molecules:
charged
=
decoy_df
[
decoy_df
[
"
charge
"
]
!=
0
]
A Pandas dataframe has a property, shape
, that returns the number of rows and columns in the dataframe. As such, element [0]
in the shape
property will be the number of rows. Let’s divide the number of rows in our dataframe of charged molecules by the total number of rows in the decoy dataframe:
charged
.
shape
[
0
]
/
decoy_df
.
shape
[
0
]
This returns 0.162. As we saw in the violin plot, approximately 16% of the decoy molecules are charged. This appears to be because the active and decoy sets were not prepared in a consistent fashion. We can address this problem by modifying the chemical structures of the decoy molecules to neutralize their charges. Fortunately we can do this easily with the NeutraliseCharges()
function from the RDKit Cookbook:
from
neutralize
import
NeutraliseCharges
In order to avoid confusion, we create a new dataframe with the SMILES stings, IDs, and labels for the decoys:
revised_decoy_df
=
decoy_df
[
[
"
SMILES
"
,
"
ID
"
,
"
label
"
]
]
.
copy
(
)
With this new dataframe in hand, we can replace the original SMILES strings with the strings for the neutral forms of the molecules. The NeutraliseCharges
function returns two values. The first is the SMILES string for the neutral form of the molecule and the second is a Boolean variable indicating whether the molecule was changed. In the following code, we only need the SMILES string, so we use the first element of the tuple returned by NeutraliseCharges
.
revised_decoy_df
[
"SMILES"
]
=
[
NeutraliseCharges
(
x
)[
0
]
for
x
in
revised_decoy_df
[
"SMILES"
]]
Once we’ve replaced the SMILES strings, we can add a molecule column to our new dataframe and calculate the properties again:
PandasTools
.
AddMoleculeColumnToFrame
(
revised_decoy_df
,
"
SMILES
"
,
"
Mol
"
)
add_property_columns_to_df
(
revised_decoy_df
)
We can then append the dataframe with the active molecules to the one with the revised, neutral decoys:
new_tmp_df
=
active_df
.
append
(
revised_decoy_df
)
Next, we can generate a new boxplot to compare the charge distributions of the active molecules with those of our neutralized decoys (Figure 11-4):
sns
.
violinplot
(
new_tmp_df
[
"
label
"
]
,
new_tmp_df
[
"
charge
"
]
)
An examination of the plots shows that there are now very few charged molecules in the decoy set. We can use the same technique we used earlier to create a dataframe with only the charged molecules. We then use this dataframe to determine the number of charged molecules remaining in the set:
charged
=
revised_decoy_df
[
revised_decoy_df
[
"
charge
"
]
!=
0
]
charged
.
shape
[
0
]
/
revised_decoy_df
.
shape
[
0
]
The result now is 0.003. We have reduced the fraction of charged molecules from 16% to 0.3%. We can now be confident that our active and decoy sets are reasonably well balanced.
In order to use these datasets with DeepChem, we need to write the molecules out as a CSV file containing for each molecule the SMILES string, ID, Name, and an integer value indicating whether the compounds are active (labeled as 1) or inactive (labeled as 0):
active_df
[
"
is_active
"
]
=
[
1
]
*
active_df
.
shape
[
0
]
revised_decoy_df
[
"
is_active
"
]
=
[
0
]
*
revised_decoy_df
.
shape
[
0
]
combined_df
=
active_df
.
append
(
revised_decoy_df
)
[
[
"
SMILES
"
,
"
ID
"
,
"
is_active
"
]
]
combined_df
.
head
(
)
The first five lines are shown in Table 11-2.
SMILES | ID | is_active | |
---|---|---|---|
0 | Cn1 ccnc1Sc2ccc(cc2Cl}Nc3c4cc(c(cc4ncc3C#N}OCCCN5CCOCC5)OC | 168691 | 1 |
1 | C[C@@]12[C@@H]([C@@H](CC(O1)n3c4ccccc4c5c3c6n2c7ccccc7c6c8c5C(=O)NC8)NC)OC | 86358 | 1 |
2 | Cc1cnc(nc1c2cc([nH]c2)C(=0) N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4OC(O5)(F)F | 575087 | 1 |
3 | CCc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4OCO5 | 575065 | 1 |
4 | Cc1cnc(nc1c2cc([nH]c2)C(=0) N[C@H](CO)c3cccc(c3}Cl}Nc4cccc5c4CCC5 | 575047 | 1 |
Our final step in this section is to save our new combined_df
as a CSV file. The index=False
option causes Pandas to not include the row number in the first column:
combined_df
.
to_csv
(
"
dude_erk1_mk01.CSV
"
,
index
=
False
)
Now that we have taken care of formatting, we can use this data to train a graph convolution model. First, we need to import the necessary libraries. Some of these libraries were imported in the first section, but let’s assume we are starting with the CSV file we created in the previous section:
import
deepchem
as
dc
# DeepChem libraries
from
deepchem.models
import
GraphConvModel
# Graph convolutions
import
numpy
as
np
# NumPy for numeric operations
import
sys
# Error handling
import
pandas
as
pd
# Data table manipulation
import
seaborn
as
sns
# Seaborn library for plotting
from
rdkit.Chem
import
PandasTools
# Chemical structures in Pandas
Now let’s define a function to create a GraphConvModel
. In this case, we will be creating a classification model. Since we will apply the model later on a different dataset, it’s a good idea to create a directory in which to store the model. You will need to change the directory to something accessible on your filesystem:
def
generate_graph_conv_model
():
batch_size
=
128
model
=
GraphConvModel
(
1
,
batch_size
=
batch_size
,
mode
=
'classification'
,
model_dir
=
"/tmp/mk01/model_dir"
)
return
model
In order to train the model, we first read in the CSV file we created in the previous section:
dataset_file
=
"dude_erk2_mk01.CSV"
tasks
=
[
"is_active"
]
featurizer
=
dc
.
feat
.
ConvMolFeaturizer
()
loader
=
dc
.
data
.
CSVLoader
(
tasks
=
tasks
,
smiles_field
=
"SMILES"
,
featurizer
=
featurizer
)
dataset
=
loader
.
featurize
(
dataset_file
,
shard_size
=
8192
)
Now that we have the dataset loaded, let’s build a model. We will create training and test sets to evaluate the model’s performance. In this case, we will use the RandomSplitter
(DeepChem offers a number of other splitters too, such as the ScaffoldSplitter
, which divides the dataset by chemical scaffold, and the ButinaSplitter
, which first clusters the data then splits the dataset so that different clusters end up in the training and test sets):
splitter
=
dc
.
splits
.
RandomSplitter
(
)
With the dataset split, we can train a model on the training set and test that model on the validation set. At this point, we need to define some metrics and evaluate the performance of our model. In this case, our dataset is unbalanced: we have a small number of active compounds and a large number of inactive compounds. Given this difference, we need to use a metric that reflects the performance on unbalanced datasets. One metric that is appropriate for datasets like this is the Matthews correlation coefficient (MCC):
metrics
=
[
dc
.
metrics
.
Metric
(
dc
.
metrics
.
matthews_corrcoef
,
np
.
mean
,
mode
=
"classification"
)]
In order to evaluate the performance of our model, we will perform 10 folds of cross validation, where we train a model on the training set and validate on the validation set:
training_score_list
=
[]
validation_score_list
=
[]
transformers
=
[]
cv_folds
=
10
for
i
in
range
(
0
,
cv_folds
):
model
=
generate_graph_conv_model
()
res
=
splitter
.
train_valid_test_split
(
dataset
)
train_dataset
,
valid_dataset
,
test_dataset
=
res
model
.
fit
(
train_dataset
)
train_scores
=
model
.
evaluate
(
train_dataset
,
metrics
,
transformers
)
training_score_list
.
append
(
train_scores
[
"mean-matthews_corrcoef"
])
validation_scores
=
model
.
evaluate
(
valid_dataset
,
metrics
,
transformers
)
validation_score_list
.
append
(
validation_scores
[
"mean-matthews_corrcoef"
])
(
training_score_list
)
(
validation_score_list
)
To visualize the performance of our model on the training and test data, we can make boxplots. The results are shown in Figure 11-5:
sns
.
boxplot
(
[
"training"
]
*
cv_folds
+
[
"validation"
]
*
cv_folds
,
training_score_list
+
validation_score_list
)
The plots indicate that, as expected, the performance on the training set is superior to that on the validation set. However, the performance on the validation set is still quite good. At this point, we can be confident in the performance of our model.
It is also useful to visualize the results of our model. In order to do this, we will generate a set of predictions for a validation set:
pred
=
[
x
.
flatten
(
)
for
x
in
model
.
predict
(
valid_dataset
)
]
To make processing easier, we’ll create a Pandas dataframe with the predictions:
pred_df
=
pd
.
DataFrame
(
pred
,
columns
=
[
"
neg
"
,
"
pos
"
]
)
We can easily add the activity class (1 = active, 0 = inactive) and the SMILES strings for our predicted molecules to the dataframe:
pred_df
[
"
active
"
]
=
[
int
(
x
)
for
x
in
valid_dataset
.
y
]
pred_df
[
"
SMILES
"
]
=
valid_dataset
.
ids
It’s always a good idea to look at the first few lines of the dataframe to ensure that the data makes sense. Table 11-3 shows the results.
neg | pos | active | SMILES | |
---|---|---|---|---|
0 | 0.906081 | 0.093919 | 1 | Cn1ccnc1Sc2ccc(cc2Cl)Nc3c4cc(c(cc4ncc3C#N)OCCC... |
1 | 0.042446 | 0.957554 | 1 | Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3cccc(c3... |
2 | 0.134508 | 0.865492 | 1 | Cc1cccc(c1)[C@@H](CO)NC(=O)c2cc(c[nH]2)c3c(cnc... |
3 | 0.036508 | 0.963492 | 1 | Cc1cnc(nc1c2cc([nH]c2)C(=O)N[C@H](CO)c3ccccc3)... |
4 | 0.940717 | 0.059283 | 1 | c1c2c([nH]c1Br)C(=O)NCC/C2=C/3C(=O)N=C(N3)N |
Creating boxplots enables us to compare the predicted values for the active and inactive molecules (see Figure 11-6).
sns
.
boxplot
(
pred_df
.
active
,
pred_df
.
pos
)
The performance of our model is very good: we can see a clear separation between the active and inactive molecules. When building a predictive model it is often important to examine inactive molecules that are predicted as active (false positives) as well as active molecules that are predicted as inactive (false negatives). It appears that only one of our active molecules received a low positive score. In order to look more closely, we will create a new dataframe containing all of the active molecules with a positive score < 0.5):
false_negative_df
=
pred_df
.
query
(
"
active == 1 & pos < 0.5
"
)
.
copy
(
)
To inspect the chemical structures of the molecules in our dataframe, we use the PandasTools
module from RDKit:
PandasTools
.
AddMoleculeColumnToFrame
(
false_negative_df
,
"SMILES"
,
"Mol"
)
Let’s look at the new dataframe (Figure 11-7):
false_negative_df
In order to fully take advantage of the information in this dataframe, we need to have some knowledge of medicinal chemistry. It is often informative to look at the chemical structures of the false negative molecules and compare these with the chemical structures of the true positive molecules. This may provide some insight into the reasons that molecules were not predicted correctly. Often it may be the case that the false negative molecules are not similar to any of the true positive molecules. In this case, it may be worth performing additional literature searches to increase the diversity of the molecules in the training set.
We can use a similar approach to examine the false positive molecules, which are inactive but received a positive score > 0.5 (see Figure 11-8). Again, comparison with the chemical structures of the true positive molecules may be informative:
false_positive_df
=
pred_df
.
query
(
"active == 0 & pos > 0.5"
)
.
copy
()
PandasTools
.
AddMoleculeColumnToFrame
(
false_positive_df
,
"SMILES"
,
"Mol"
)
false_positive_df
During the model training phase, our objective was to evaluate the performance of our model. As such, we trained the model on a portion of the data and validated the model on the remainder. Now that we have evaluated the performance, we want to generate the most effective model. In order to do this, we will train the model on all of the data:
model
.
fit
(
dataset
)
This gives us an accuracy score of 91%. Finally, we save the model to disk so that we can use it to make future predictions:
model
.
save
(
)
Now that we’ve created a predictive model, we can apply this model to a new set of molecules. In many cases, we will build a predictive model based on literature data, then apply that model to a set of molecules that we want to screen. The molecules we want to screen may come from an internal database or from a commercially available screening collection. As an example, we will use the predictive model we created to screen a small sample of 100,000 compounds from the ZINC database, a collection of more than 1 billion commercially available molecules.
One potential source of difficulty when carrying out a virtual screen is the presence of molecules that have the potential to interfere with biological assays. Over the last 25 years, many groups within the scientific community have developed sets of rules to identify potentially reactive or problematic molecules. Several of these rule sets, which are encoded as SMARTS strings, have been collected by the group that curates the ChEMBL database. These rule sets have been made available through a Python script called rd_filters.py. In this example, we will use rd_filters.py to identify potentially problematic molecules in our set of 100,000 molecules from the ZINC database.
The rd_filters.py script and associated data files are available on our GitHub repository.
The available modes and arguments for the script can be obtained by calling it with the -h
flag.
rd_filters.py -h
Usage:
rd_filters.py $ filter --in INPUT_FILE --prefix PREFIX [--rules RULES_FILE_NAME]
[--alerts ALERT_FILE_NAME][--np NUM_CORES]
rd_filters.py $ template --out TEMPLATE_FILE [--rules RULES_FILE_NAME]
Options:
--in INPUT_FILE input file name
--prefix PREFIX prefix for output file names
--rules RULES_FILE_NAME name of the rules JSON file
--alerts ALERTS_FILE_NAME name of the structural alerts file
--np NUM_CORES the number of cpu cores to use (default is all)
--out TEMPLATE_FILE parameter template file name
To call the script on our input file, which is called zinc_100k.smi, we can specify the input file and a prefix for output filenames. The filter
argument calls the script in “filter” mode, where it identifies potentially problematic molecules. The --prefix
argument indicates that the output file names will start with the prefix zinc.
rd_filters
.
py
filter
-
-
in
zinc_100k
.
smi
-
-
prefix
zinc
using 24 cores
Using alerts from Inpharmatica
Wrote SMILES for molecules passing filters to zinc.smi
Wrote detailed data to zinc.CSV
68752 of 100000 passed filters 68.8%
Elapsed time 15.89 seconds
The output indicates the following:
The script is running on 24 cores. It runs in parallel across multiple cores, and the number of cores can be selected with the -np
flag.
The script is using the “Inpharmatica” set of rules. This rule set covers a large range of chemical functionality that has been shown to be problematic in biological assays. In addition to the Inpharmaticia set, the script has seven other rule sets available. Please see the rd_filters.py documentation for more information.
SMILES strings for the molecules passing the filters were written to a file called zinc.smi. We will use this as the input when we use the predictive model.
Detailed information on which compounds triggered particular structural alerts was written to a file called zinc.CSV.
69% of the molecules passed the filters, and 31% were considered problematic.
It is informative to take a look at the reasons why 31% of the molecules were rejected. This can let us know whether we need to adjust any of the filters. We will use a bit of Python code to look at the first few lines of output, shown in Table 11-4.
import
pandas
as
pd
df
=
pd
.
read_CSV
(
"zinc.CSV"
)
df
.
head
()
SMILES | NAME | FILTER | MW | LogP | HBD | |
---|---|---|---|---|---|---|
0 | CN(CCO)C[C@@H](O)Cn1cnc2c1c(=O)n(C)c(=O)n2C | ZINC000000000843 | Filter82_pyridinium >0 | 311.342 | –2.2813 | 2 |
1 | O=c1[nH]c(=O)n([C@@H]2C[C@@H](O)[C@H](CO)O2)cc1Br | ZINC000000001063 | Filter82_pyridinium >0 | 307.100 | –1.0602 | 3 |
2 | Cn1c2ncn(CC(=O)N3CCOCC3)c2c(=O)n(C)c1=O | ZINC000000003942 | Filter82_pyridinium >0 | 307.310 | –1.7075 | 0 |
3 | CN1C(=O)C[C@H](N2CCN(C(=O)CN3CCCC3)CC2)C1=O | ZINC000000036436 | OK | 308.382 | –1.0163 | 0 |
4 | CC(=O)NC[C@H](O)[C@H]1O[C@H]2OC(C)(C)O[C@H]2[C... | ZINC 000000041101 | OK | 302.327 | -1.1355 | 3 |
The dataframe has six columns:
We can use the Counter
class from the Python collections
library to identify which filters were responsible for removing the largest numbers of molecules (see Table 11-5):
from
collections
import
Counter
count_list
=
list
(
Counter
(
df
.
FILTER
)
.
items
(
)
)
count_df
=
pd
.
DataFrame
(
count_list
,
columns
=
[
"
Rule
"
,
"
Count
"
]
)
count_df
.
sort_values
(
"
Count
"
,
inplace
=
True
,
ascending
=
False
)
count_df
.
head
(
)
Rule | Count | |
---|---|---|
1 | OK | 69156 |
6 | Filter41_12_dicarbonyl > O | 19330 |
0 | Filter82_pyridinium > O | 7761 |
10 | Filter93_acetyl_urea > O | 1541 |
11 | Filter78_bicyclic_lmide > O | 825 |
The first line in the table, labeled as “OK,” indicates the number of molecules that were not eliminated by any of the filters. From this, we can see that 69,156 of the molecules in our input passed all of the filters. The largest number of molecules (19,330) were rejected because they contained a 1,2-dicarbonyl group. Molecules of this type may react and form covalent bonds with protein residues such as serine and cysteine. We can find the SMARTS pattern used to identify these molecules by looking for the string “Filter41_12_dicarbonyl” in the filter_collection.CSV file that is part of the rd_filters.py distribution. The SMARTS pattern is “*C(=O)C(=O)*”, which represents:
It is always good to look at the data and ensure that everything is working as expected. We can use the highlightAtomLists
argument to RDKit’s MolsToGridImage()
function to highlight the 1,2-dicarbonyl functionality (see Figure 11-9):
from
rdkit
import
Chem
from
rdkit.Chem
import
Draw
mol_list
=
[
Chem
.
MolFromSmiles
(
x
)
for
x
in
smiles_list
]
dicarbonyl
=
Chem
.
MolFromSmarts
(
'*C(=O)C(=O)*'
)
match_list
=
[
mol
.
GetSubstructMatch
(
dicarbonyl
)
for
mol
in
mol_list
]
Draw
.
MolsToGridImage
(
mol_list
,
highlightAtomLists
=
match_list
,
molsPerRow
=
5
)
We can see that the molecules do indeed have dicarbonyl groups, as highlighted in the figure. If we wanted to, we could similarly evaluate other filters. At this point, however, we can be satisfied with the results of the filtering. We have removed the problematic molecules from the set we plan to use for our virtual screen. We can now use this set, which is in the file zinc.smi, in the next step of this exercise.
The GraphConvMdel
we created can now be used to search the set of commercially available compounds we just filtered. Applying the model requires a few steps:
Load the model from disk.
Create a featurizer.
Read and featurize the molecules that will run through the model.
Examine the scores for the predictions.
Examine the chemical structures of the top predicted molecules.
Write the selected molecules from each cluster to a CSV file.
We begin by importing the necessary libraries:
import
deepchem
as
dc
# DeepChem libraries
import
pandas
as
pd
# Pandas for tables
from
rdkit.Chem
import
PandasTools
,
Draw
# Chemistry in Pandas
from
rdkit
import
DataStructs
# For fingerprint handling
from
rdkit.ML.Cluster
import
Butina
# Cluster molecules
from
rdkit.Chem
import
rdMolDescriptors
as
rdmd
# Descriptors
import
seaborn
as
sns
# Plotting
and loading the model we generated earlier:
model
=
dc
.
models
.
TensorGraph
.
load_from_dir
(
"
"
/
tmp
/
mk01
/
model_dir
"
"
)
To generate predictions from our model, we first need to featurize the molecules we plan to use to generate predictions. We do this by instantiating a DeepChem ConvMolFeaturizer
:
featurizer
=
dc
.
feat
.
ConvMolFeaturizer
(
)
In order to featurize the molecules, we need to transform our SMILES file into a CSV file. In order to create a DeepChem featurizer we also require an activity column, so we add one, then write the file to CSV:
df
=
pd
.
read_CSV
(
"zinc.smi"
,
sep
=
" "
,
header
=
None
)
df
.
columns
=
[
"SMILES"
,
"Name"
]
rows
,
cols
=
df
.
shape
# Just add add a dummy column to keep the featurizer happy
df
[
"Val"
]
=
[
0
]
*
rows
As before, we should look at the first few lines of the file (shown in Table 11-6) to make sure everything is as we had expect:
df
.
head
(
)
SMILES | Name | Val | |
---|---|---|---|
0 | CN1C(=O)C[C@H](N2CCN(C(=O)CN3CCCC3)CC2)C1=O | ZINC000000036436 | 0 |
1 | CC(=O)NC[C@H](O)[C@H]1O[C@H]2OC(C)(C)O[C@H]2[C@@H]1NC(C)=O | ZINC000000041101 | 0 |
2 | C1CN(c2nc(-c3nn[nH]n3)nc(N3CCOCC3)n2)CCO1 | ZINC000000054542 | 0 |
3 | OCCN(CCO)c1nc(Cl)nc(N(CCO)CCO)n1 | ZINC000000109481 | 0 |
4 | COC(=O)c1ccc(S(=O)(=O)N(CCO)CCO)n1C | ZINC000000119782 | 0 |
Note that the Val column is just a placeholder to keep the DeepChem featurizer happy. The file looks good, so we will write it as a CSV file to use as input for DeepChem. We use the index=False
argument to prevent Pandas from writing the row numbers as the first column:
infile_name
=
"
zinc_filtered.CSV
"
df
.
to_CSV
(
infile_name
,
index
=
False
)
We can use DeepChem to read this CSV file with a loader and featurize the molecules we plan to predict:
loader
=
dc
.
data
.
CSVLoader
(
tasks
=
[
'Val'
],
smiles_field
=
"SMILES"
,
featurizer
=
featurizer
)
dataset
=
loader
.
featurize
(
infile_name
,
shard_size
=
8192
)
The featurized molecules can be used to generate predictions with the model:
pred
=
model
.
predict
(
dataset
)
For convenience,we will put the predictions into a Pandas dataframe:
pred_df
=
pd
.
DataFrame
([
x
.
flatten
()
for
x
in
pred
],
columns
=
[
"Neg"
,
"Pos"
]
The distribution plot, available in the Seaborn library, provides a nice overview of the distribution of scores. Unfortunately, in virtual screening, there are no clear rules for defining an activity cutoff. Often the best strategy is to look at the distribution of scores, then select a set of the top-scoring molecules. If we look at the plot in Figure 11-10, we can see that there are only a small number of molecules with scores above 0.3. We can use this value as a preliminary cutoff for molecules that we may want to screen experimentally.
We can join the dataframe with the scores to the dataframe with the SMILES strings. This gives us the ability to view the chemical structures of the top-scoring molecules:
combo_df
=
df
.
join
(
pred_df
,
how
=
"outer"
)
combo_df
.
sort_values
(
"Pos"
,
inplace
=
True
,
ascending
=
False
)
As we saw earlier, adding a molecule column to the dataframe enables us to look at the chemical structures of the hits (see Figure 11-11).
Based on what we see here, it looks like many of the hits are similar. Let’s look at a few more molecules (Figure 11-12):
Draw
.
MolsToGridImage
(
combo_df
.
Mol
[:
10
],
molsPerRow
=
5
,
legends
=
[
"
%.2f
"
%
x
for
x
in
combo_df
.
Pos
[:
10
]])
Indeed, many of the molecules are very similar and might end up being redundant in our screen. One way to be more efficient would be to cluster the molecules and only screen the highest-scoring molecule in each cluster. RDKit has an implementation of the Butina clustering method, one of the most highly used methods in cheminformatics. In the Butina clustering method, we group molecules based on their chemical similarity, which is calculated using a comparison of bit vectors (arrays of 1 and 0), also known as chemical fingerprints that represent the presence or absence of patterns of connected atoms in a molecule. These bit vectors are typically compared using a metric known as the Tanimoto coefficient, which is defined as:
The numerator of the equation is the intersection, or the number of bits that are 1 in both bit vectors A and B. The denominator is the number of bits that are 1 in either vector A or vector B. The Tanimoto coefficient can range between 0, indicating that the molecules have no patterns of atoms in common, and 1, indicating that all of the patterns contained in molecule A are also contained in molecule B. As an example,we can consider the bit vectors shown in Figure 11-13. The intersection of the two vectors is 3 bits, while the union is 5. The Tanimoto coefficient is then 3/5, or 0.6. Note that the example shown here has been simplified for demonstration purposes. In practice, these bit vectors can contain hundreds or even thousands of bits.
A small amount of code is necessary to cluster a set of molecules. The only parameter required for Butina clustering is the cluster cutoff. If the Tanimoto similarity of two molecules is greater than the cutoff, the molecules are put into the same cluster. If the similarity is less than the cutoff, the molecules are put into different clusters:
def
butina_cluster
(
mol_list
,
cutoff
=
0.35
):
fp_list
=
[
rdmd
.
GetMorganFingerprintAsBitVect
(
m
,
3
,
nBits
=
2048
)
for
m
in
mol_list
]
dists
=
[]
nfps
=
len
(
fp_list
)
for
i
in
range
(
1
,
nfps
):
sims
=
DataStructs
.
BulkTanimotoSimilarity
(
fp_list
[
i
],
fp_list
[:
i
])
dists
.
extend
([
1
-
x
for
x
in
sims
])
mol_clusters
=
Butina
.
ClusterData
(
dists
,
nfps
,
cutoff
,
isDistData
=
True
)
cluster_id_list
=
[
0
]
*
nfps
for
idx
,
cluster
in
enumerate
(
mol_clusters
,
1
):
for
member
in
cluster
:
cluster_id_list
[
member
]
=
idx
return
cluster_id_list
Before clustering, we will create a new dataframe with only the 100 top-scoring molecules. Since combo_df
is already sorted, we only have to use the head
function to select the first 100 rows in the dataframe:
best_100_df
=
combo_df
.
head
(
100
)
.
copy
(
)
We can then create a new column containing the cluster identifier for each compound:
best_100_df
[
"Cluster"
]
=
butina_cluster
(
best_100_df
.
Mol
)
best_100_df
.
head
()
As always, it’s good to take a look and make sure everything worked. We now see that in addition to the SMILES string, molecule name, and predicted values, we also have a cluster identifier (see Figure 11-14).
We can use the Pandas unique
function to determine that we have 55 unique clusters:
len
(
best_100_df
.
Cluster
.
unique
(
)
)
Ultimately, we would like to purchase these compounds and screen them experimentally. In order to do this, we need to save a CSV file listing the molecules we plan to purchase. The drop_duplicates
function can be used to select one molecule per cluster. By default, the function starts from the top of the table and removes rows with values that have already been seen:
best_cluster_rep_df
=
best_100_df
.
drop_duplicates
(
"Cluster"
)
Just to make sure that this operation worked, let’s use the shape
parameter to get the number of rows and columns in the new dataframe:
best_cluster_rep_df
.
shape
Finally, we can write out a CSV file with the molecules we want to screen:
best_cluster_rep_df
.
to_CSV
(
"best_cluster_represenatives.CSV"
)
At this point, we have followed all the steps of a ligand-based virtual screening workflow. We used deep learning to build a classification model that was capable of distinguishing active from inactive molecules. The process began with evaluating our training data and ensuring that the molecular weight, LogP, and charge distributions were balanced between the active and decoy sets. Once we’d made the necessary adjustments to the chemical structures of the decoy molecules, we were ready to build a model.
The first step in building the model was generating a set of chemical features for the molecules being used. We used the DeepChem GraphConv
featurizer to generate a set of appropriate chemical features. These features were then used to build a graph convolution model, which was subsequently used to predict the activity of a set of commercially available molecules. In order to avoid molecules that could be problematic in biological assays, we used a set of computational rules encoded as SMARTS patterns to identify molecules containing chemical functionality previously known to interfere with assays or create subsequent liabilities.
With our list of desired molecules in hand, we are in a position to test these molecules in biological assays. Typically the next step in our workflow would be to obtain samples of the chemical compounds for testing. If the molecules came from a corporate compound collection, a robotic system would collect the samples and prepare them for testing. If the molecules were purchased from commercial sources, additional weighing and dilution with buffered water or another solvent would be necessary.
Once the samples are prepared, they are tested in biological assays. These assays can cover a wide range of endpoints, ranging from inhibiting bacterial growth to preventing the proliferation of cancer cells. While the testing of these molecules is the final step in our virtual screening exercise, it is far from the end of the road for a drug discovery project. Once we have run the initial biological assay on the molecules we identified through virtual screening, we analyze the results of the screen. If we find experimentally active molecules, we will typically identify and test other similar molecules that will enable us to understand the relationships between different parts of the molecule and the biological activity that we are measuring. This optimization process often involves the synthesis and testing of hundreds or even thousands of molecules to identify those with the desired combination of safety and biological activity.