Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2
Applications of Chemoinformatics in Drug Discovery

Valerie J. Gillet

Information School, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DP, UK

2.1 Significance and Background

The term chemoinformatics first appeared in the literature 20 years ago following the introduction of automation techniques within the pharmaceutical industry for the synthesis and testing of compounds in drug discovery. The use of combinatorial chemistry and high throughput screening techniques resulted in a vast increase in the volumes of structural and biological data available to guide decision making in drug discovery and led to the following definition of chemoinformatics:

The mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation [1].

Chemoinformatics is now recognised as a discipline, albeit one that falls at the intersection of other disciplines such as chemistry and computer science. It is also a discipline with fuzzy boundaries; for example, the distinction between computational chemistry and chemoinformatics is not always clear. A feature of chemoinformatics today is that it typically involves the analysis of large datasets of compounds. That said, many of the techniques embodied within chemoinformatics have much earlier origins [2], starting with the representation of chemical structures in databases more than 50 years ago. Much of the early activity in chemoinformatics was based on proprietary data and commercial or proprietary software. However, more recently the availability of very large public data sources such as ChEMBL and PubChem, together with the increasing number of open source software tools [3], means that chemoinformatics techniques are now more mainstream and accessible within academia than was the case previously. Furthermore, the techniques have now extended beyond drug and agrochemicals discovery to a much wider range of chemistry domains including the food industry and materials design.

This chapter provides an overview of chemoinformatics techniques that are commonly applied in early stage drug discovery. Following a discussion of some basic foundations relating to structure representation and search, the main focus will be on virtual screening. Virtual screening is the computational equivalent of biological screening and is used to prioritise compounds for experimental testing.

2.2 Computer Representation of Chemical Structures

The common language of organic chemistry is the two‐dimensional (2D) structure diagram as shown in Figure 2.1. Most chemical information systems, including web‐based systems, include user‐friendly structure drawing programs that enable graphical input of query structures and allow high quality images to be produced. For example, the standalone ChemDraw package was used to produce the image of aspirin shown in Figure 2.1 and the JMSE molecular editor allows molecular editing within web browsers [4]. While these programs enable structures to be drawn on a computer, the images themselves have little value for chemoinformatics applications since they do not contain chemical meaning; they have to be converted to other forms for computer processing.

Image described by caption. — **Figure 2.1** Structure diagram of aspirin.

A widely used method for the representation of chemical structures is the line notation, of which the SMILES (Simplified Molecular Input Line Entry System) notation [5] is most common. In SMILES, 2D structures are represented by linear strings of alphanumeric (that is, textual) characters. SMILES strings are based on a small number of simple rules, they are relatively easy to understand, and they are compact in terms of data storage. For these reasons, they are widely used for transferring compounds between systems and users, and for entering structures into chemoinformatics applications. In SMILES, atoms are represented by their atomic symbols. Upper case characters are used to represent aliphatic atoms (C, N, S, O, etc.) and lower case are used for aromatic atoms. Hydrogen atoms are implicit. Single bonds are inferred between adjacent atoms, as are aromatic bonds; double bonds are represented as ‘’; and triple bonds as ‘#’. Additional rules enable stereochemistry to be encoded. A SMILES string can be constructed by ‘walking through’ the bonds in a structure diagram visiting each atom once. Rings are encoded by ‘breaking’ one of the ring bonds and attaching a pair of integer values, one to each of the two atoms of the broken bond. Branch points, where more than one path can be followed, are encoded using parentheses. A SMILES representation of aspirin is shown in Table 2.1, along with InChI and InChIKey notations, which are described below.

Table 2.1 Line notation representations for aspirin.

SMILES	OC(O)c1ccccc1)OC(O)C
InChI	InChI = 1S/C9H8O4/c1‐6(10)13‐8‐5‐3‐2‐4‐7(8)9(11)12/h2‐5H,1H3,(H,11,12)
InChIKey	BSYNRYMUTXBXSQ‐UHFFFAOYSA‐N

2.3 Database Searching

One of the fundamental concepts in chemoinformatics is the representation of chemical structures as graphs. While molecular editors and line notations allow easy input and sharing of molecules, the storage and retrieval of structures from databases is based on the mapping of a chemical structure to a mathematical graph. A graph consists of nodes that are connected by edges. Graph theory is a branch of mathematics in which many well‐established algorithms exist that can be used to facilitate analysis and searches of databases of compounds. In a molecular graph, the nodes correspond to atoms and the edges to bonds. The nodes and edges can have properties associated with them, for example, atom type and bond type. Molecular graphs are represented as connection tables, which essentially list the atoms and bonds contained in the structure. A number of different connection table formats exist that vary in the way in which the information is stored. A common format is the MDL Molfile, which consists of a header section, an atom block, with one row for each atom, and a bond block, in which each bond in the structure is listed once only [6]. The Molfile for aspirin is shown in Figure 2.2 with the different blocks of information highlighted. Connection tables can also be aggregated into a single file to allow sets of molecules to be exchanged or input to applications.

Chemical databases can be searched in different ways. An exact structure search is used to retrieve information about a specific compound from a database, for example, to look for a synthetic route or to see if a compound is available to purchase from a chemical supplier. An exact structure search is also required during the registration process for new compounds; before adding a new compound to a database it is necessary to ensure that it is not already present. In graph theory terms, an exact structure search can be solved using a graph isomorphism algorithm, that is, an algorithm for determining if two graphs are the same. While this may appear simple at first glance, it is actually a difficult problem to solve since the atoms in a connection table can be presented in any order. Similarly, for a SMILES string, the start point for the walk through a structure can begin with any atom, so that typically there are many valid SMILES representations for a given molecule. For example, alternative SMILES representations for aspirin to that shown in Table 2.1 are c1cccc(OC(O)C)c1C(O)O) and CC(O)Oc1ccccc1C(O)O).

In principle, exact matching could be achieved by renumbering the connection table of each database structure in all possible ways and testing each for identity with the query molecule. However, this approach is computationally infeasible since there are N! different ways of numbering a connection table consisting of N atoms. By way of example, there are more than 6 × 10⁹ different numberings for aspirin which consists of 13 heavy atoms. Instead, a canonical representation is generated, which is a unique ordering of the atoms in a molecular graph. This can be achieved using the Morgan algorithm [7] or a variant thereof. The Morgan algorithm is an iterative process that involves assigning connectivity values to the atoms in order to differentiate them, as illustrated in Figure 2.3. In the first iteration, each atom is assigned a value according to the number of non‐hydrogen atoms it is connected to, that is, to its number of neighbouring atoms. In subsequent iterations, each atom is assigned the sum of the connectivity values of its neighbours. The iterations continue until the number of different connectivity values is at a maximum. The atoms are then numbered as follows: the atom with the highest connectivity value is assigned as the first atom, its neighbours are then numbered in decreasing order of connectivity values and so on. If a tie occurs, then additional properties of the atoms are taken into account.

Illustration of the use of the Morgan algorithm to produce a canonical representation, involving a compound with number of distinct values. — **Figure 2.3** Illustration of the use of the Morgan algorithm to produce a canonical representation.

A substructure search is the process of identifying all molecules that contain a particular substructural fragment, for example, the dopamine substructure as shown in Figure 2.4. In graph theory, this is the problem of subgraph isomorphism, that is, determining if one graph is contained within another. Efficient algorithms for subgraph isomorphism exist and have been adapted to substructure search [2], but for large databases they are too slow to be used in isolation. Therefore, most substructure search procedures consist of two stages. The first is a screening step, in which fast bit string operations are used to eliminate compounds that cannot possibly match the query structure. Those compounds that remain then undergo the more time consuming subgraph isomorphism search. Compound screening is based on what have become known as molecular fingerprints. These are binary vectors where each position in the vector is associated with a molecular fragment and is set to ‘1’ if the fragment is present within a structure and to ‘0’ otherwise. There are two types of molecular fingerprints: dictionary‐based and hash‐based. In dictionary‐based fingerprints, a dictionary of fragments is pre‐compiled with each fragment mapped to a particular bit in the bit string. There is, therefore, a one‐to‐one mapping between the fragment and bit position, as shown in Figure 2.5a. An example of dictionary‐based fingerprints is MACCS [8]. In hashed fingerprints, the fragments are generated algorithmically, for example, to include all paths of atoms up to a given length. A series of hashing algorithms is then applied, each of which generates a number corresponding to a bit in the bit string. Thus, there is a many‐to‐many mapping between fragments and bit positions, as shown in Figure 2.5b.

Skeletal formulas of query, adrenaline, mefeclorazine, olmidine, fenoldopam, apomorphine, and morphine. — **Figure 2.4** Example of dopamine as a substructural query along with some compounds that contain the substructure, with the substructure highlighted in bold.

For a database search, fingerprints are pre‐generated for all the compounds in the database. A fingerprint is generated for the query at run time and only those compounds whose fingerprints have bits set for all of the ‘1’ bits in the query are passed forward to the subgraph isomorphism search. Both types of fingerprints have been shown to be effective for screening out the vast majority of compounds that do not match a query.

The final search technique in common use in chemoinformatics is similarity searching. This forms one of the family of methods used for virtual screening and so is discussed below.

2.4 Practical Issues on Representation

The analogy between chemical structures and mathematical graphs is extremely useful and forms the basis of many chemoinformatics applications; however, it is not perfect. This is due to the complexities associated with chemical representation. Particular issues relate to tautomerism, aromaticity, ionisation state and mesomerism, whereby a given structure may be represented (and exist) in different forms [9], as illustrated in Figure 2.6. In some instances, it may be beneficial to recognise different forms as the same structure, for example, for registration in databases. The usual way this is handled is to apply somewhat arbitrary ‘business rules’ whereby a structure is converted to a standard form for storage, with the same rules being applied to a query structure [10]. The particular set of rules that is applied may differ from one system to another [9b], which can cause problems when the output of one system or program is used as the input to another. In other cases, it may be desirable to treat different forms as different structures due to their different properties. For example, the shift of a hydrogen from one atom to another can change an atom from a hydrogen bond donor to an acceptor, which can affect how a molecule may interact with a receptor. Hence for some applications it is more appropriate to enumerate the different protonation states and tautomeric forms of each compound. These issues remain the topic of much debate.

The complexities associated with the representation of chemical compounds are such that careful handling of structures is required to ensure the effective operation of chemoinformatics applications. As a consequence, structure cleaning is a common step in any chemoinformatics workflow and a typical data curation workflow is described by Fourches et al. [11].

SMILES is a proprietary system and the published algorithm for generating canonical SMILES is incomplete. The result is that several different canonicalisation algorithms have been developed that differ slightly from one another so that there is no single standard [12]. To overcome this issue, a more recent line notation called InChI has been developed as a non‐proprietary, unique standard chemical identifier. A hierarchical approach is followed that enables chemical structures to be encoded at different levels of granularity. The highest layer encodes connectivity, with subsequent layers encoding more detailed information such as charge, stereochemistry, isotopic information, hydrogens, etc. The hierarchical approach enables different mesomeric forms of a structure to be given a single unique identifier. The InChIKey is a condensed digital representation of an InChi developed for web‐based searching [13].

2.5 Virtual Screening

Virtual screening refers to the application of computational (in‐silico) tools to predict the properties of compounds in order to prioritise them for experimental testing. Virtual screening is usually applied in the lead generation phase of drug discovery with the aim of identifying compounds with a desired biological activity. These compounds would then be taken forward to lead optimisation with the aim of improving their properties, for example, their potency and ADME (absorption, distribution, metabolism and excretion) properties.

A wide range of different virtual screening techniques is available with the method of choice depending on the information available about the biological target of interest [ 2,14]. Ligand‐based virtual screening refers to techniques that are used when the three‐dimensional (3D) structure of the protein target is unknown. When one or more active compound is available, for example, a competitor compound or a compound in the literature, then similarity searching can be used to identify compounds that are structurally similar to the known active. When multiple active compounds are known then it may be possible to build a pharmacophore that represents the spatial arrangement of functional groups required for a molecule to bind to a receptor. The pharmacophore can then be used to query a database to find other molecules that may exhibit the pharmacophore. If both active and inactive compounds are known, for example, following a round of experimental testing, then machine learning techniques can be used to develop a structure–activity relationship model for making predictions about unknown compounds. Structure‐based virtual screening techniques are used when the 3D structure of the target is known, with the most common technique being protein–ligand docking. Each of these techniques is described in more detail below.

2.6 Ligand‐Based Virtual Screening

2.6.1 Similarity Searching

Perhaps the simplest virtual screening method is that of similarity searching, first developed in the 1980s [15]. Given a molecule of interest, for example, one that is known to exhibit a desired biological activity, it can be used as a query molecule to rank order, or prioritise, compounds in a database that are most similar to it. The premise on which similarity searching is based is the similar property principle [16], which states that molecules that are structurally similar are likely to exhibit similar activities. Thus, the top ranking compounds should have an increased likelihood of exhibiting similar activity to the query compared to a set of compounds selected at random. While there are exceptions to this rule, where a small change in structure leads to a large change in potency (a phenomenon that has become known as an activity cliff [17]), the similar property principle is the cornerstone of all medicinal chemistry.

Similarity searching requires a way of quantifying the similarity between the query and each compound in the database. A similarity measure consists of three components: a set of molecular descriptors, an optional weighting scheme, whereby some of the descriptors can be given more emphasis than others, and a similarity coefficient, which is used to quantify the similarity based on the molecular descriptors and their weights (if applicable).

Many different molecular descriptors have been developed for similarity searching. They can be divided into whole molecule descriptors, 2D fingerprints and descriptors derived from a 3D representation. Whole molecule properties are typically represented by real valued numbers or integers and represent physicochemical properties, such as log P and molecular weight, or counts of features such as the numbers of hydrogen bond donors and acceptors. Whole molecule descriptors can also include topological descriptors, which are derived from the 2D graph representation of a molecule. A topological index typically captures some element of the size and shape of a molecule into a single number. For example, the Wiener index is the average pairwise topological (through bond) distance between all pairs of atoms within a structure and gives a measure of the amount of branching in a molecule.

A single property is unlikely to be sufficiently discriminating to be useful in similarity searching and so whole molecule descriptors are usually combined into a vector of values. Some pre‐processing of the descriptors is usually required. Firstly, standardisation should be used to place all the descriptors on to the same scale and ensure that subsequent calculations are not dominated by descriptors that operate over a greater range of values. For example, the typical molecular weight for a drug‐like molecule is in the order of a few hundred Daltons, whereas, according to Lipinski rules, the log P of a compound intended for oral absorption should be <5 [18]. Standardisation can be achieved using the Z score, in which a value is transformed such that the mean value of the dataset is at zero and ±1 represents one standard deviation from the mean. Another common approach is to scale the descriptors to fall in the range zero to one. It is also usual to remove highly correlated variables and variables that show limited variance across the dataset. More sophisticated data reduction techniques can also be used, such as a principal components analysis, which transforms the original descriptors into a smaller number of orthogonal descriptors that are linear combinations of the originals.

Despite being developed for substructure search, 2D fingerprints have proved to be very effective at identifying molecules that are structurally similar and are the most commonly used descriptors for similarity searching. In addition to the dictionary‐based and hashed‐based fingerprints described above, other variants include atom pair descriptors and circular fingerprints. An atom pair encodes each atom according to its properties such as element type and number of pi electrons together with the shortest bond distance between the two atoms [19]. A molecule is represented by all of its constituent atom pairs, which can be mapped to a binary vector. In circular fingerprints, each atom is represented by its element type together with its neighbouring atoms, see Figure 2.7. The radius/diameter that is used to capture the neighbours can be varied, for example, ECFP2 descriptors use a diameter of two and include the nearest neighbours only, whereas ECFP4 descriptors also encode the neighbours of the nearest neighbours at diameter 4 [20]. As for atom pairs, circular fingerprints can be mapped to a binary vector. Recent comparisons of different 2D fingerprints for similarity searching are provided by Riniker and Landrum [21] and O'Boyle and Sayle [22].

The role of weighting schemes in similarity search has been much less studied than either the types of descriptors used or the similarity coefficient. Perhaps the most studied approach has been to count the number of occurrence of the fragments when constructing a fingerprint, rather than simply recording the presence or absence of fragments. Although some evidence exists that the use of weighting schemes can be effective in some situations [23], the most common approach to similarity searching is to ignore the use of weighting schemes.

Similarity coefficients provide a way of quantifying the similarity between a pair of molecules based on their molecular descriptors. For binary fingerprints, the most common similarity coefficient is the Tanimoto coefficient, which is defined as

2.1

where c is the number of bits set to ‘1’ in common, a is the number of bits set to ‘1’ in molecule A and b is the number of bits set to ‘1’ in molecule B . For binary fingerprints, the Tanimoto coefficient ranges from 1 (when the fingerprints are identical) to 0 (when there are no bits in common). Thus, the molecules in a database are ranked on decreasing similarity value to the query. Note that whereas identical structures will give rise to identical fingerprints the opposite is not true; identical fingerprints does not necessarily imply identical structures.

For descriptors that are based on continuous values such as physicochemical properties, it is more usual to calculate the distance between molecules using, for example, Euclidean distance, which is defined as

2.2

where x _{i, A} and x _{i, B} are the values of descriptor i in molecules A and B , respectively, and there are N different descriptors. In virtual screening, the molecules in a database will then be ranked on increasing distance to the query molecule.

2.6.2 Scaffold Hopping

2D fingerprints have proved to be surprisingly effective in similarity searching applications, but evidence suggests that they are good at finding close analogues and less effective at finding compounds that belong to different chemical series. Moving from one chemical series to another is often referred to as scaffold hopping [24]. There are a number of reasons why scaffold hopping can be beneficial. One is to enable new intellectual property (IP) to be established, by moving away from the patent coverage associated with the query compound. Another reason might be to replace some parts of the query structure that give rise to undesirable properties, for example, unwanted side effects. A third reason might be related to compound synthesis; for example, the query compound may have some characteristics that make it unsuitable for scale‐up as a drug compound.

A number of descriptors have been developed that aim to focus on features of the molecules that could be responsible for receptor binding with less emphasis given to the exact connectivity or skeleton of a structure. For example, Similog keys [25] use an atom typing scheme in which key atoms are described according to the presence or absence of four properties: hydrogen bond donor, hydrogen bond acceptor, bulkiness and electropositivity. Atom triplets (sets of three atoms) consisting of atom types combined with the through bond distances between the atoms form DABE keys, see Figure 2.8. A molecule descriptor is then constructed by counting all possible DABE keys contained within a structure. CATS descriptors are related but are based on atom pairs with the individual atoms represented according to property rather than element type [26].

Another approach is that of reduced graphs. A reduced graph is an abstract representation of a molecule in which groups of atoms are replaced by nodes and edges are formed between the nodes in order to retain the topology of the original structure. There are many ways in which chemical graphs can be reduced. For drug discovery applications the aim is usually to capture groups of atoms according to their functionality so that compounds with similar bioactivity are identified as similar, irrespective of their exact chemical scaffolds. Thus, typical node definitions include hydrogen bond donor groups, hydrogen bond acceptors, aromatic rings, etc. See Figure 2.9 for an example. Reduced graphs have been used in a number of applications including similarity searching, identifying structure–activity relationships and as a way of browsing the content of clusters of compounds [27].

2.6.3 3D Similarity Search

Given that drug–receptor binding is a 3D event, there has been considerable interest in developing similarity methods that are based on 3D representations of molecules [28]. 3D similarity searching can be divided into alignment‐free methods and those that require the molecules to be superimposed prior to calculating their similarity. Alignment‐free methods typically involve abstracting the 3D features of a molecule, such as interatomic distances and/or angles, into a vector representation that is independent of the orientation of the molecule in 3D space. This then allows descriptors generated from different molecules to be compared directly. An example of this approach is ultrafast shape recognition (USR) in which a molecule is represented by 12 descriptors, which are statistical moments derived from the atomic distances with a molecule [29]. Four reference locations are generated from the atomic coordinates and include the geometric centre of the molecule together with atoms at the extremes. The distances of all atoms to each reference point are collected as a histogram from which the average, the standard deviation and the skewness are extracted to give the 12 descriptors.

Perhaps the most well‐known of the alignment‐based methods is the ROCS (rapid overlay of chemical structures) program [30]. The ROCS algorithm compares molecules by measuring the common volume they occupy and is based on earlier work by Grant, Gallardo and Pickup in which atomic volumes are represented by Gaussian functions [31]. The use of Gaussians enables the overlap volume to be calculated rapidly using analytical methods. A Tanimoto‐like function is used to convert the volume overlap to a similarity score as follows:

2.3

where V _c is the common overlap volume, V _a is the volume of molecule A and V _b is the volume of molecule B. Although the calculation for a given superposition is relatively fast, the method requires that the superposition of the pair of molecules that maximises the overlap is found prior to calculating the similarity; hence ROCS is typically slower in operation than alignment‐free methods. In addition to shape‐based similarity, ROCS also has a chemistry‐based alignment method known as ‘colour’, which takes account of the properties of atoms; for example, atoms that can act as hydrogen bond donors or acceptors. The combination of shape and chemistry matching is known as ComboROCS.

Although 3D similarity methods are appealing conceptually, the issue of conformational flexibility makes them considerably more complex than 2D methods. Most methods handle conformational flexibility by pre‐computing an ensemble of conformers for each molecule. The aim is to sample conformational space at a resolution that is sufficient to include all low energy conformations but not so exhaustive that excessive numbers of conformers are produced, thereby greatly increasing the computational time required to process them [24c]. Typical sampling strategies are based on a threshold strain energy, root mean square deviation of atom positions or simply on the maximum number of conformers permitted. Ideally the sampling method will result in something similar to the bioactive conformation of a molecule being represented; however, it may be that the bound conformation is not close enough to an energy minimum for this to occur. Effective conformational sampling remains a challenging area for 3D methods [32].

2.6.4 Pharmacophore‐Based Virtual Screening

A pharmacophore is the spatial arrangement of functional groups, or features, required for a small molecule to bind to a receptor [33]. A pharmacophore is usually expressed according to features that describe the types of interaction made, rather than specific functional groups. For example, typical features are hydrogen bond donors, hydrogen bond acceptors and hydrophobic and aromatic groups. Figure 2.10 shows a set of pharmacophore features for a series of CDK2 ligands. In this example, the alignment was generated from a series of ligands for which 3D structures of the protein–ligand complexes are available by superimposing the complexes according to the binding site atoms of the proteins and extracting the ligands. The pharmacophoric features were then determined as those features that occur in all of the ligands at the same location within the binding site. These include hydrophobic features that are centred on the rings and donor and acceptor features shown as projections from the relevant heavy atoms.

Pharmacophore methods are usually used when the 3D structure of the target is unknown. The aim is to explore the conformational space of a set of active compounds while simultaneously aligning them such that similar features in the different compounds are overlaid. This process is carried out in the absence of the 3D structure of the receptor. The underlying assumption is often made that all of the active compounds bind to the target in the same way. It is important the actives that are used to build a pharmacophore are chosen carefully. For example, they should be sufficiently diverse that spurious features are not identified, that is, features that occur in all of the molecules but that are not involved in binding. Also, ideally at least one of the actives should be reasonably rigid to avoid finding promising looking overlays that do not correspond to the binding conformations. A wide number of pharmacophore identification methods have been developed , with their performance being assessed on the ability to reproduce known binding poses [35].

The difficulties associated with pharmacophore identification are such that it is rarely possible to determine the true pharmacophore directly and, therefore, most pharmacophore generation programs will generate a set of plausible hypotheses that should then be evaluated. This can be done by using a holdout set, that is, omitting some of the active compounds when generating the hypotheses and choosing a hypothesis that is consistent with those left out. Once a pharmacophore has been identified and validated, it can then be used to search a database to identify those compounds that match the pharmacophore. As this is a 3D technique, it is important that conformational flexibility is taken into account when searching the database. It is also important that the same feature definitions are used when searching as were used to identify the pharmacophore. These may sound like obvious conditions, but they are sometimes overlooked. As for other 3D methods, it is important that protonation and tautomer states are considered appropriately since these can determine whether a given atom can act as a hydrogen bond donor or a hydrogen bond acceptor.

2.6.5 Structure–Activity Relationship Modelling

When both active and inactive compounds are available, machine learning techniques can be used to learn a model of activity that can then be applied to make predictions about previously unseen compounds. Quantitative structure–activity relationship (QSAR) modelling dates back to the work of Hansch, who correlated biological activity with physicochemical properties using multiple linear regression. This approach was extended later by Free and Wilson to the use of structural properties as descriptors rather than physicochemical properties. While these approaches are still in use they are generally restricted to building local models relevant to a small structurally homogeneous dataset, as seen in lead optimisation [36].

For virtual screening applications, the aim is to use machine learning techniques to build structure‐activity models that can be applied to a wide variety of drug‐like molecules that are structurally heterogeneous [37]. The models are developed using training data that is typically labelled as active and inactive, with the focus being on classifying test compounds as active or inactive, or rank‐ordering compounds on their probability of being active. The popularity of this approach has grown in recent years due to a number of factors including the rapid growth in publicly available data about compounds and their properties and the development of a wide variety of sophisticated nonlinear machine learning algorithms.

The first application of machine learning to biological screening data was the use of substructural analysis (SSA) [38]. In SSA, each molecule in a training set of active and inactive compounds is characterised by a set of binary descriptors such as 2D fingerprints. A set of weights is then calculated, one for each fragment (or bit position) in the fingerprint. A fragment's weight reflects the probability that a molecule containing that fragment will be active; for example, the weight may be the fraction of the actives in the training set that contain the fragment. A previously unseen compound is scored by combining the weights of the fragments it contains. The resulting score represents the probability that the test compound will be active and SSA scores can be used to rank‐order a database of compounds. SSA is closely related to the naïve Bayesian classifier that has become popular in chemoinformatics recently. Other commonly used machine learning methods include k‐nearest neighbours (kNN), decision trees, random forests and support vector machines.

kNN is conceptually very easy to understand. Each test set compound is compared with all training set compounds to find its k nearest neighbours. The class membership of a test compound is then predicted to be the majority class amongst the k nearest neighbours, as shown in Figure 2.11. Implementing kNN requires a definition of similarity (that is, a set of descriptors and a similarity coefficient) and the value of k to be specified; k is usually chosen as an odd number in order to avoid the use of ties.

A decision tree consists of a set of rules that is used to associate specific features or descriptor values with a classification label. For example, each rule may correspond to the presence or absence of a particular feature or a particular range of descriptor values. A decision tree is constructed using training data, which is progressively split into subsets with the aim of separating the two classes. At each decision point, the descriptor or variable that gives the best split is chosen, where best is determined using some measure of the purity of a split. The same procedure is then applied to each of the subsets that are produced. One way of determining the best split is to use entropy, which measures the extent of disorder in a set. If there are N molecules in i classes with each class containing n _i molecules, the entropy is

2.4

The aim is to choose the split that minimises entropy. Once a tree has been constructed a test compound is classified by ‘dropping’ it down the tree until a terminal, or leaf, node is reached. The test compound is then assigned to the class according to the distribution of training examples in the node; for example, it is assigned to the majority class represented in the leaf node.

Decision trees generate models that are easy to interpret, but they are prone to overfitting and they are sensitive to small changes in the training data. These factors can limit their performance as predictors. One way of improving performance is to use an ensemble of decision trees. Random Forest is one such approach where many classification trees are grown using different random samples of the training data. The randomisation occurs at two levels. One is that a random subset of the features is available at each node and, second, each tree is based on a bootstrap sample of the data whereby for a set of N molecules, N are chosen at random from the data with replacement. Bootstrapping means that each molecule may be chosen zero, one or more than one times. The trained ensemble of trees, the forest, is then used to make predictions on test data. A test compound is passed through each tree with its classification being determined by the majority vote. The proportion of votes cast for a particular class can be used as a measure of the confidence of the prediction.

Support vector machines (SVMs) have been widely used in chemoinformatics, especially for classification problems. The aim is to find the best separation between two classes of compounds such that each class lies on the opposite side of a hyperplane, as shown in Figure 2.12. There is usually an infinite number of hyperplanes that could be constructed and in SVMs the hyperplane that maximises the margin between the two classes is chosen, the aim being to maximise the margin while minimising the number of misclassified samples. It is not usually possible to find a linear separation of the compound classes using the original descriptor space and so what is known as the ‘kernel trick’ is used to transform the original descriptors into a higher dimensional space where they are linearly separable. The radial basis function has been shown to perform well and is therefore widely used [39].

Graph of support vector machine with scattered stars and dots (left) and graph with 3 parallel lines separating stars and dots (right). A rightward arrow in between 2 graphs is labeled projection into higher dimensional space. — **Figure 2.12** Support vector machine.

2.7 Protein–Ligand Docking

When the 3D structure of the protein target is known, the most widely used virtual screening technique is protein–ligand docking. The Protein Databank (PDB) is a public repository of 3D structures of macromolecules including proteins, DNA and RNA, the majority of which have been determined using experimental techniques such as X‐ray crystallography, cryoelectron microscopy or NMR techniques [40]. In the context of virtual screening, the aim of docking is to evaluate small molecules, or ligands, within a database and prioritise those that are most likely to bind to the protein [41]. In order to bind, a small molecule should have complementary steric and electrostatic properties to a binding site on the protein. For each ligand, docking involves identifying the bound conformation of the ligand, determining the relative orientation of the ligand within the protein binding site and then estimating the binding energy of the ligand, so that the database of ligands can be rank‐ordered [14a]. Protein–ligand docking is a complex problem, especially accurate scoring, and, as for other virtual screening methods, the aim is to reduce the number of false negatives in the prioritised compounds rather than selecting high affinity compounds directly.

Many docking programs have been developed that adopt different approaches to the steps outlined above. The conformational space of the ligands can be explored by pre‐computing ensembles of conformers, as described above in 3D similarity methods, with each conformer docked to the protein as a rigid body, as in DOCK [42]. The alternative approach is to explore the conformational space of the ligand simultaneously with the docking procedure, that is, at the same time as exploring different orientations of the ligand within the binding site. This can be achieved using stochastic methods such as simulated annealing or genetic algorithms as in GOLD [43] and AutoDock [44] or by using an incremental construction method whereby a ligand is broken into fragments and reconstructed within the binding site with conformational space explored in a stepwise manner [45]. Identifying the binding pose also requires some degree of protein flexibility to be taken into account and most docking programs now allow movement of the protein sidechains within the binding pocket. However, handling full protein flexibility is limited to molecular dynamics simulations, which are far too computationally demanding to be used in virtual screening.

A scoring function can be divided into force field, empirical and knowledge‐based scoring functions [41b]. Force‐field methods use classical molecular mechanics methods to estimate binding energies directly by summing the contributions of non‐bonded terms such as electrostatic and van der Waals energy terms between all atoms in the two molecules in the complex. Empirical scoring functions adopt a QSAR‐type approach in which the weights associated with various energy terms are fitted to experimental data using, for example, multiple linear regression. The typical terms include hydrogen bonding interactions, hydrophobic contact terms, desolvation and entropy terms with the experimental data consisting of measured binding affinities for protein ligand complexes. Empirical scoring functions are straightforward to calculate but their accuracy is dependent on the relationship between the training data used to derive the function and the data used for virtual screening. Knowledge‐based scoring functions are based on statistical data extracted from protein–ligand complexes with the underlying assumption being that frequently occurring interatomic distances represent favourable contacts. Pairwise energy potentials are derived from known complexes and a given binding pose is scored by summing the contributions for individual atom pairs.

There are many examples of the successful applications of protein–ligand docking in the literature; see, for example, [41c]. However, considerable care needs to be exercised when setting up a docking‐based virtual screening protocol. This includes careful selection and preparation of both the protein target and the database of ligands [41a and c]. First it is necessary to select the most suitable protein structure. Ideally this should be a structure determined using X‐ray crystallography and be of high resolution. Consideration should also be given to which conformation(s) of the protein to use; for example, there may be multiple 3D structures of the protein in different conformations with and without ligands being bound. When a protein–ligand complex is available, the binding site is already known; in the absence of a complex, various programs are available to predict the binding site [41b]. It has been shown, however, that dockings based on structures for which protein–ligand complexes exist generally perform better than those based on apo (unbound) protein structures. A key factor in this is the ability of a binding site to adapt its shape on ligand binding, a process known as induced fit [46].

Preparation of the protein includes adding hydrogen atoms (since these are not present in structures extracted from the PDB) and specifying the correct protonation and tautomeric states of the binding site residues. This is particularly important since they determine the interactions that can be made with a ligand. Consideration also needs to be given to water molecules; that is, whether or not these should be eliminated from the binding site or if key water molecules should be conserved. The ligands also require careful handling. If they originate as 2D structures then they must first be converted to 3D structures. For chiral molecules, the stereochemistry should be handled appropriately and, if not specified, then all enantiomers should be enumerated. The most effective way of handling the different protonation and tautomeric states of ligands is to enumerate the different possibilities. There is a growing number of publicly available datasets that are available for docking; for example, the ZINC database now consists of over 100 million commercially available compounds that are already prepared for docking [47]. Finally, if a rigid docking algorithm is to be used then low energy conformers should be pre‐generated [48].

Given the complexities of setting up a docking run, it can be beneficial to run retrospective tests using known ligands and a carefully selected set of decoys. Such a test should be able to reproduce the known binding modes for protein–ligand complexes, as shown in Figure 2.13, and should also result in the known actives being ranked above the decoy compounds. If this cannot be achieved for retrospective data then it is unlikely that the docking setup will be effective in a prospective virtual screen.

As indicated above, the accurate prediction of binding affinity remains a challenge. Scoring functions are generally considered to be successful in predicting the correct binding pose for a small molecule but they are less effective at predicting relative binding affinities. Therefore, post‐docking analysis procedures are often used to select the final compounds for experimental screening rather than relying on the docking score alone. This can involve visual inspection as well as the use of automated methods, for example, to prioritise preferred interaction patterns [14a]. There continues to be a steady stream of incremental improvements to scoring and other aspects of docking, such as the ability to handle new interactions terms, metal coordination and improved handling of water [41c].

2.8 Evaluating Virtual Screening Methods

Given the wide variety of different descriptors and methods available for virtual screening along with the parameterisation required by many methods, it is important to evaluate virtual screening performance in order to establish the best protocol. Performance can be evaluated using retrospective data, that is, data for which the correct answers are already known. For methods that generate a ranking of molecules, such as similarity searching and protein–ligand docking, the usual measures include the enrichment factor and area under the curve (AUC). The enrichment factor determines the increase in actives in the top few positions of the ranked list relative to their being distributed evenly throughout the ranked list. The AUC takes account of the distribution of actives over the entire list, with the value 1 indicating that all of the actives are ranked at the top and a value of 0.5 indicating an even distribution throughout the list. A number of variants of the AUC have been developed that give greater weight to the earlier part of the ranked list since this is most important for virtual screening, where the aim is to select a small subset of the available compounds.

For machine learning methods, the usual approach is to divide the available data into two sets: a training set that is used to build the model and an external test set that is used to assess its performance. The training data can be further split into training and validation data, with the validation set used to tune the parameters of the model. The performance of the model is calculated on the external test set. A key consideration for machine learning methods is that predictions made by the resulting models should only be considered reliable for compounds that have similar characteristics to the training data, a concept that has become known as the applicability domain of a model [50]. For classification problems, performance is typically assessed using elements of the confusion matrix (Table 2.2), where predictions are assigned as true positives (TPs), false positives (FPs), true negatives (TNs) and false negatives (FNs).

Table 2.2 A confusion matrix.

		Predicted class
Positive	Negative
Actual class	Positive	True positive	False negative
Negative	False positive	True negative

A variety of different measures can be calculated from the confusion matrix. For example, sensitivity, or the TP rate, is the proportion of the positive class that is correctly predicted (TP/(TP + FN)) and specificity, or the TN rate, is the proportion of the negative class that is predicted as negative (TN/(FP + TN)). Other measures include accuracy, the F1 score and Matthew's correlation coefficient (MCC).

2.9 Case Studies of Virtual Screening

Many successful applications of virtual screening have now been reported in the literature and have been summarised in various reviews; see, for example, . Two examples are provided here.

2.9.1 Ligand‐Based Virtual Screening: Discovery of LXR Agonists

Temml et al. [52] used a combination of pharmacophore modelling and 3D shape matching to identify selective agonists of liver X receptor (LXR) as potential regulators of cholesterol metabolism. The study involved a set of six pharmacophore hypotheses that had been generated in an earlier study using Discovery Studio and had been experimentally validated as LXR agonists. The active compounds found using these pharmacophores had shown general LXR activity but had not been tested for selectivity. Therefore, a test set of compounds was assembled and sorted on their LXR subtype selectivity. Conformers were generated for these compounds and the resulting library was screened against the six pharmacophore models. Three of the models successfully identified a significant number of highly selective compounds. The most selective of the test compounds was also used as a query in a shape‐based search using the ROCS program with the ComboScore, which combines shape and chemistry features. The ROCS search was conducted against the Specs library consisting of around 200 000 compounds. A maximum of 400 conformers was generated for each compound in the database using the OMEGA program. The top 500 molecules found using ROCS were then filtered using the three selective pharmacophores. For the pharmacophore, matching a new conformer database was generated using the same software suite that had been used to generate the original hypotheses. The pharmacophore filtering reduced the compound set to 56, of which 10 were selected for biological testing. The combined pharmacophore and shape‐based virtual screening procedure identified some selective LXR agonists and was therefore deemed to be a success.

2.9.2 Combined Ligand‐ and Structure‐Based Approaches: Pim‐1 Kinase Inhibitors

Pim‐1 has been found to be involved in a number of signalling pathways and is implicated in multiple human cancers. Ren et al. [53] used a combination of ligand‐ and structure‐based virtual screening methods in a hierarchical virtual screening procedure and were successful in identifying novel inhibitors of Pim‐1. The virtual screening methods they used were SVM modelling, pharmacophore modelling and protein–ligand docking.

An SVM model was built using a set of 500 known Pim‐1 inhibitors and a set of 37 000 presumed inactive compounds. The model was based on a set of 50 molecular descriptors consisting of topological descriptors and physicochemical properties. The SVM model was trained using cross‐validation and evaluated on an independent test set. The resulting model showed high accuracy on both the training set and the test set and was then used as the first virtual screen in the hierarchy to screen a database of 20 million compounds; 56 500 compounds were retained.

The pharmacophore model was built using eight Pim‐1 inhibitors. Four hypotheses were generated and the top scoring one, consisting of one hydrogen bond acceptor, one hydrogen bond donor and one hydrophobic aromatic feature, was validated on a test set consisting of known actives and inactives. A high percentage (83%) of the actives mapped to the hypothesis. The molecule outputs from the SVM classifier were mapped to the pharmacophore hypothesis and a subset of 10 700 compounds were retained. These were then docked to the protein binding site using the GOLD docking program. The docking protocol was optimised prior to running the virtual screening. A high resolution protein–ligand complex was chosen. Seven compounds that were known to bind to the protein were docked to it with the docking parameters adjusted, including the choice of scoring function, until the known binding poses were reproduced. The 935 top scoring compounds found by GOLD were inspected visually and 47 were chosen for experimental testing in an in vitro biochemical assay. Five compounds were identified as novel Pim‐1 inhibitors.

2.10 Conclusions

This chapter has provided an introduction to the fundamental concepts that underpin many chemoinformatics applications. The starting point was to consider representation issues as a foundation before describing basic search techniques. The main focus was then on the variety of virtual screening techniques that are routinely used in the process of identifying lead compounds. Although these were presented as distinct approaches, the brief application examples highlighted above show that many drug discovery programs follow a cascaded approach with multiple in‐silico approaches used. This is especially the case when very large libraries of compounds are explored, with the less computationally demanding approaches used first to reduce the number of compounds that undergo more complex analysis such as docking.

The focus here has been on lead generation, but it should be acknowledged that chemoinformatics techniques impact on all stages of the drug discovery pipeline and there is a large number of topics that have not been discussed, for example, computational filtering and library design, chemical space exploration including diversity analysis and clustering, toxicity prediction, reaction searching, de novo design and assessment of synthetic accessibility, to name a few. Another limitation is the focus on the traditional single target compound approach whereas drug discovery is now shifting towards multitarget approaches. For example, machine learning methods are being developed for multitask prediction where the aim is to predict the response to a number of different endpoints using deep learning methods [54]. What is clear is that with the growing volumes of data, chemoinformatics techniques are required more than ever to guide decision making in drug discovery.

References