Linlin Zhao1 and Hao Zhu1,2
1Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
2Department of Chemistry, Rutgers University, Camden, NJ, USA
In the past decade, along with the vibrant and rapid development of chemical synthesis and biological screening technologies, immense data were generated daily and most of these data are available to the public [1, 2]. Biological data generated from high-throughput screening (HTS) of large chemical libraries contains rich toxicology information that has the potential to be integrated into toxicity research [3]. Currently available toxicity data exist both in structured formats (e.g., deposited into PubChem and other data-sharing web portals) and as unstructured data (papers, laboratory reports, toxicity web-site information, etc.). These data, even in structured formats, become so large and complex that it is difficult to process and use them with traditional database management, data processing, and modeling tools.
The term “big data” describes this phenomena, which can be viewed as the biggest change in modern toxicology. Originally, the focus of big data was the development of advanced data storage and handling techniques, such as cloud-based computing or high-speed heterogeneous computational environments [4]. The concept of big data is gaining increasing recognition in clinical studies and other research areas driven by biological data, including the field of toxicology [1–3, 5]. The daily updated toxicology big data landscape consists of large-scale datasets from various sources, which contain enormous number of chemical toxicity endpoints. The terms volume (scale of data), velocity (growth of data), and variety (diversity of sources) have been used to characterize big data and fits to the current toxicity big data landscape. Besides these characteristics, owing to the nature of experimental protocols, one should also be aware of the veracity (uncertainty of data) of these datasets. Thus, the “four v's” (namely, volume, velocity, variety, and veracity) of big data also represent the relevant challenges of the modern toxicology research (Figure 11.1) [6]. The volume and velocity of the data-driven research for toxicology have been well recognized by multiple screening and data-sharing projects, which are described in the next section of this chapter. However, the use of all these available data sources is still based on traditional computational approaches (e.g., quantitative structure–activity relationships, QSAR) and the issues of variety and veracity were rarely considered in most previous studies. This chapter also highlights the recent progress of data-driven research to answer the above challenges.
Since the NIH Roadmap for medical research was launched in 2004 [7], several molecular library screening centers have been funded [8] and several HTS projects have been performed to experimentally test large chemical libraries. The recent data generation efforts in the area of toxicology are toxicity forecaster (ToxCast) initiated by the US Environmental Protection Agency (EPA) [9] and Toxicity Testing in the twenty-first century (Tox21), which was launched by the National Toxicology Program (NTP), the National Institutes of Health (NIH) Chemical Genomics Center (NCGC), and EPA [10–12]. The direct results of these experimental screening method efforts, especially HTS, are the toxicity data currently public available, as summarized in Table 11.1. These databases construct the current toxicity big data landscape, which is described in detail in the following.
Table 11.1 Public available toxicity data resources (as of 10/22/2016)
Database | Size | Data description | Accessibility |
PubChem [13, 14] | Over 60 million compounds, over 1 million bioassays, over 13 billion data points | Toxicity, genomics, and literature data | https://pubchem.ncbi.nlm.nih.gov/ |
ChEMBL [15, 16] | Over 1 million compounds, over 13 million data points | Drugs, drug-like small molecules, and their bioactivity data | https://www.ebi.ac.uk/chembl/ |
ACToR [17, 18] | Over 800,000 compounds, over 500,000 assays | Both in vitro and in vivo toxicity data | https://actor.epa.gov |
TOXNET [19] | Multiple sub-databases contain over 300,000 compounds and related toxicity data | Toxicity, diseases, and literature data | https://toxnet.nlm.nih.gov/ |
REACH [20] | 816,048 studies for 9800 substances and 3600 study types | Data submitted in EU chemical legislation, machine-readable | http://www.reach.lu/mmp/online/website/menu_hori/homepage/index_EN.html |
SEURAT-1 [21, 22] | Over 5500 cosmetic-type compounds in the current COSMOS database web portal | Animal toxicity data | http://www.seurat-1.eu/ |
ISSTOX [23] | Five sub-databases | (i) Long-term carcinogenicity bioassay on rodents (rat and mouse) (ISSCAN); (ii) In vitro Salmonella typhimurium mutagenesis (Ames test) (ISSSTY); (iii) In vivo mutagenesis (micronucleus test) (ISSMIC); (iv) Cell transformation Assays (ISSCTA); (v) Mutagenicity and carcinogenicity of biocides (ISSBIOC) | https://www.dimdi.de/static/en/index.html |
HESS DB [24] | Repeated-dose toxicity study of over 2000 compounds and toxicity mechanism information of about 80 compounds were added recently | High curated results of repeat dose toxicity tests | http://www.nite.go.jp/en/chem/qsar/hess-e.html |
RepDOSE [25] | 364 chemicals investigated in 1018 studies which resulted in 6002 specific effects | Repeat-dose study data for dog, mouse, and rat | http://fraunhofer-repdose.de/ |
GEO [26] | Over 4000 sub-datasets | Microarray, next-generation sequencing and other forms of high-throughput functional genomics data submitted by the research community | https://www.ncbi.nlm.nih.gov/geo/ |
Open TG-GATEs [27] | 170 compounds at multiple dosages and time points | Toxicity data, gene expression data, and metadata | http://toxico.nibiohn.go.jp/english/ |
CEBS [28] | About 10,000 toxicity bioassays from various sources | Gene expression data | http://www.niehs.nih.gov/research/resources/databases/cebs/index.cfm |
CTD [29–31] | Over 13,000 compounds, over 32,000 genes, over 6000 diseases | Compound, gene, and disease relationships | https://ctdbase.org/ |
DrugMatrix [32] | About 600 drug molecules and 10,000 genes | Gene expression data | https://ntp.niehs.nih.gov/drugmatrix/index.html |
Connectivity Map [33] | About 1300 compounds and 7000 genes | Gene expression data | http://portals.broadinstitute.org/cmap/ |
The well-known chemical/biological data-sharing projects were not initially designed for toxicants but contain considerable amounts of toxicity-relevant data. For example, PubChem is a public repository for chemical structures and their biological data, including the toxicity data from the screening centers as described above [13, 14]. Figure 11.2 shows the yearly increase in the number of PubChem compounds and bioassays. Over the past eight years, the number of PubChem compounds increased from 19 million in September 2008 [34] to over 60 million in September 2015 [35]. During the same period, the number of bioassays that were used to test these compounds increased from 1197 in September 2008 [34] to over 1 million in September 2015, resulting in over five terabytes of data [35]. Another large reservoir of bioassay data is the ChEMBL database, which is a manually curated chemical database maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL)[15]. The EBI's goal is to provide freely available data and bioinformatics services to the scientific community. As part of this goal, the ChEMBL database was constructed for experimental data of both chemical toxicity and absorption, distribution, metabolism, and excretion (ADME) properties. ChEMBL version 22 (ChEMBL_22) was released in 2016. It contains over 1 million compounds and over 14 million data points [16].
The databases to specifically share toxicity data also made considerable progress. The Aggregated Computational Toxicology Online Resource (ACToR) portal, which was developed by the US EPA's National Center for Computational Toxicology (NCCT) program, aggregates toxicity data for over 800,000 compounds from thousands of public sources which include HTS, chemical exposure, sustainable chemistry (chemical structures and physicochemical properties), and virtual tissue data [17, 18]. Similarly, TOXNET, which was developed by the National Library of Medicines' (NLM) Division of Specialized Information Services (SIS), contains 16 separate sub-databases of various toxicity categories for thousands of diverse chemicals. By grouping these databases together, TOXNET allows users to access all toxicity data for target compounds from one query form [19]. Another famous example is that of the European Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) legislation which aims to collect comprehensive safety information for all substances available on the European market. The current REACH database contains more than 9000 unique substances and about 800,000 safety assessment documents [20].
Some other toxicity resources were designed to support data for one or several specific toxicity categories. The Safety Evaluation Ultimately Replacing Animal Testing (SEURAT-1) was launched by the European Commission and the European Cosmetics Trade Association, Cosmetics Europe in 2011 and it is composed of six complementary research projects [21, 22]. Among them, the COSMOS project was dedicated to the development of methods to predict the safety of cosmetic ingredients [36]. The COSMOS database web portal contains over 5500 unique cosmetic-relevant compounds with their respective in vivo toxicity data. Another European institute, Istituto Superiore di Sanita (ISS) developed ISSTOX database [23] which includes five sub-databases, each relative to a different toxicological endpoint. There are several databases containing multidose toxicity data, such as the Hazard Evaluation Support System Integrated Platform Database (HESS DB) [24] and RepDOSE [25]. HESS DB is a knowledge database containing toxicity data of compounds being tested in various doses/concentrations and relevant toxicity mechanism information. The RepDOSE database consists of toxicity data for 400 chemicals investigated in about 1000 multidose studies [25].
Another type of data source are toxicogenomics (TGx) studies, which generate enormous amounts of “omics” data that are meant to predict toxicity or genetic susceptibility induced by chemicals. Many modern in vitro toxicity studies now address that genomics information is sensitive to toxicants and these findings can be translated into biomarkers that is useful for chemical toxicity assessments [37, 38]. Several publicly available TGx databases have been initiated, such as the Gene Expression Omnibus (GEO) [26], the Japanese Toxicogenomics Project (TGP) database [39], the Chemical Effects in Biological Systems (CEBS) database [28], and the Comparative Toxicogenomics Database (CTD) [29–31]. GEO is a public repository that archives diverse forms of high-throughput functional genomic datasets and it contains a large quantity of gene expression data that can be used in computational toxicology studies [26]. TGP data are available to the public through the Open TG-GATEs database [27], which includes the toxicology data for 170 compounds. The CEBS database [28] developed by the NIEHS is now the public repository for all NTP conventional toxicology and carcinogenicity data as well as NCGC HTS data along with the CTD, which aims to promote comparative studies of genes and proteins across species [29–31]. The DrugMatrix [32] and the Connectivity Map [33] are similar efforts in toxicogenomics data curation, but with more specific research goals.
The massive chemical toxicity data available to research communities poses a great opportunity for modern toxicology research. However, in the current big data era, traditional computational approaches are not suitable to deal with big data resources. The typical big data research in toxicology that has taken place over the past five years; along with the fast-growing data, this field was assisted by “super” computers with powerful computational ability.
With all these screening efforts, especially the HTS projects, as mentioned above, a significant number of “popular” compounds (e.g., well-known toxicants) have been tested multiple times. For example, Table 11.2 shows 20 toxicants obtained from the Integrated Risk Information System (IRIS) database. On the basis of the search result on PubChem, these toxicants were reported to be tested in hundreds of PubChem bioassays. As shown in Table 11.2, hexachlorophene (CAS 70-30-4), a disinfectant banned from the market, showed active responses in 366 bioassays. Other toxicants have similarly rich response information in PubChem (Table 11.2). The rich biological data for these toxicants can be viewed as response profiles which are useful for computational toxicology studies.
Table 11.2 Twenty human toxicants with their relevant PubChem bioassay responses
Chemicals | CAS | No. of active responses | No. of inactive responses |
Hexachlorophene | 70-30-4 | 366 | 721 |
Captan | 133-06-2 | 212 | 198 |
Phenylmercuric acetate | 62-38-4 | 203 | 115 |
Chlordecone (Kepone) | 143-50-0 | 166 | 442 |
Hydroquinone | 123-31-9 | 121 | 450 |
2-Chloroacetophenone | 532-27-4 | 78 | 191 |
Pentachlorophenol | 87-86-5 | 78 | 679 |
Tributyltin oxide (TBTO) | 56-35-9 | 77 | 37 |
Endosulfan | 115-29-7 | 75 | 189 |
Hexachlorocyclopentadiene (HCCPD) | 77-47-4 | 69 | 203 |
Propachlor | 1918-16-7 | 69 | 379 |
Sodium diethyldithiocarbamate | 148-18-5 | 68 | 77 |
p,p′-Dichlorodiphenyl dichloroethane (DDD) | 72-54-8 | 66 | 170 |
Acetaldehyde | 75-07-0 | 63 | 173 |
Phenol | 108-95-2 | 62 | 369 |
Bisphenol A | 80-05-7 | 61 | 739 |
p,p′-Dichlorodiphenyl trichloroethane (DDT) | 50-29-3 | 61 | 279 |
Naphthalene | 91-20-3 | 61 | 697 |
Chlorobenzilate | 510-15-6 | 59 | 62 |
α-Hexachlorocyclohexane (α-HCH) | 319-84-6 | 58 | 619 |
There have been some pioneering studies that use bioassay data to profile toxicants in the early stages. For example, the major goal of the ToxCast project is to use hundreds of bioassays [40–48] to profile the compounds which have been tested for their animal toxicity as shown in the Toxicity Reference Database (ToxRefDB) [49–51]. The profiling studies of ToxCast tried to develop predictive models for toxic compounds using a set of in vitro assays and/or in silico predicted results, for example, ToxPi [52]. The disadvantage of this type of study is the selection of biological data for profiling; besides, prediction is arbitrary and only limited to in-house data.
In the current big data era, all the public toxicity data (e.g., shown in Table 11.1) can be used for profiling toxicants. In 2014, we developed an automatic virtual profiling tool to evaluate potential animal toxicants using all PubChem bioassay data [53]. The core of this study is a scoring system to evaluate the relationship between PubChem bioassays and animal toxicity. The top-ranked bioassays were used to profile the compounds of interest and make predictions of potential toxicants. Recently, similar effort was reported by Helal et al. [54]. They used PubChem bioassays to create bioprofiles for more than 300,000 chemicals and showed that these bioprofiles can be used in toxicity model development. Meanwhile, bioassayR, which is distributed as an R® package, can be used for simultaneous analysis of large numbers of biological data, especially those obtained from HTS [55]. Using bioassayR, bioassays can be clustered with the same targets and compound-target information can also be generated for further analysis.
The direct benefit of these toxicant profiling efforts is to provide new chemical toxicity mechanisms by analyzing the relevant biological data. Recently, we developed the virtual Adverse Outcome Pathway (vAOP) of oxidative stress by profiling hepatotoxicants [56]. In this study, four PubChem assays obtained from bioprofiles of hepatotoxicants were used to create a vAOP. If a new compound contains the initial chemical features described in this vAOP and shows active responses in any of these four assays, it will be predicted to cause liver damage in vivo through inducing oxidative stress. The ToxCast project generated several similar analyses [57–59]. For example, they studied estrogen receptor (ER)-binding potentials by profiling ER binders by diverse in vitro assays [57, 58] and gene expression data [59]. The AOP generated by these assay results can not only predict potential ER binders but also illustrate the relevant toxicity mechanisms.
Traditional read-across approaches of computational toxicology, which were widely used to fill data gaps of new compounds without relevant toxicity data, are usually based on chemical similarity search [60, 61] or QSAR predictions. The basic hypothesis of this type of studies is “similar compounds have similar bioactivities” and is not suitable for most chemical toxicity phenomenon in vivo with complicated toxicity mechanisms. Only using chemical similarity to justify the read-across will be error-prone, especially when chemically similar compounds show dissimilar toxicity phenomena. For example, “activity cliffs” (i.e., similar compounds having different toxicity) result in prediction errors in many toxicity modeling studies [62–67]. Thus, in the big data era, the use of biological data besides chemical structural information adds extra strength to the read-across process.
In some early studies, integrating biological data as biological descriptors into the QSAR modeling procedure was found to be beneficial to the resultant toxicity models. For example, we developed enhanced acute toxicity models by integrating external biological data as extra descriptors [68, 69]. The resulting hybrid models, based on the combination of the chemical and biological descriptors, showed better performance than traditional QSAR models. Research studies were carried out Low et al., who concluded that the read-across study should be based on both chemical and biological data [70]. Similar results were reported by Garcia-Serna et al. [71]. They stated that combining chemical and biological data could enhance the ability to assess the toxicity of small molecules with higher confidence than that using chemical data alone.
Recent read-across studies have been reported to be performed by using various sources with massive amounts of toxicity data. Kleinstreuer et al. [72] used the US EPA's ToxCast dataset to perform read-across for a uterotrophic database collected from a large number of historical reports. The purpose of this study was to evaluate estrogenic endocrine disruption of new compounds. Another example was given by Luechtefeld et al. [5, 20, 73–75], who collected all the available historic REACH toxicity data for 9801 compounds. They performed a read-across study of REACH compounds for their acute oral toxicity [73], Draize eye irritation testing results [74], and skin sensitization activity [75]. A review of using big data to perform read-across of chemical toxicity [5] and two other strategy papers [76, 77] were also published. Recently, to fill data gaps, we developed a new read-across portal using public large-scale chemical and biological data called the Chemical In vitro–In vivo Profiling (CIIPro) portal [78], which can automatically extract biological data from public resources (i.e., PubChem) for compounds of interest. The read-across analysis based on biosimilarity, which was defined on the extracted biological data of target compounds, showed higher predictivity for estrogen receptor binding agents [79].
Chemical toxicity data are rapidly increasing not only in structured data formats (e.g., those databases shown in Table 11.1), but also in various types of text documents such as scientific articles, patents, industry reports, and media reports, which can be classified as unstructured toxicity data. The use of unstructured toxicity data has motivated the development of text mining approaches in computational toxicity. Text mining refers to the automatic processes of deriving high-quality information from text. It usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluating and interpreting the output [80]. In computational toxicity, challenges always come from the step of structuring input text. For example, in documents, chemicals are usually referenced in different ways: common names, systematic nomenclatures, database identifiers (e.g., CAS number), InChI strings, or even as structures shown in images. Thus, text mining studies of chemicals of specific interest (e.g., animal toxicants) are strongly dependent on the named entity recognition (NER), which was reviewed by Vazquez et al. [81]. Furthermore, for a popular word (e.g., a commonly used term), it can refer to different meanings under different contexts. For example, CYP can be used in names of gene, protein, mutation, drug, or an adverse event [82–85]. As a part of biomedical science, the application of text mining in computational toxicity was for pathway extraction and reasoning and pharmacogenomics [86]. Text mining approaches contribute to computational toxicity studies by bringing useful knowledge from literature, either extracted or curated, together with in-house biological datasets to identify relationships between genes, pathways, drugs, environmental contaminants and diseases [87, 88]. There have been several studies applying text mining methods within toxicogenomics studies [89–91].
The rise of big data heralds a profound change in the way that toxicologists perform their research. The big data era brings not only big progress but also big challenges [1, 92, 93]. Although there are some preliminary studies, as described above, which successfully apply big data sources in computational toxicology studies, the urgent needs of new approaches in this area are described in the following.
Experimental error is inevitable in public data sources. It is understandable that the quality of data may be vary on the basis of the nature of experimental protocols. Currently, the usefulness of public data sources is questionable owing to a lack of necessary quality control [94]. A general worry has been raised regarding irreproducible experimental data [95–97], which is relatively common in complicated biological testing (e.g., animal models). There is also a golden rule in computational modeling studies, which is the “trash in, trash out” principle [5]. For this reason, the veracity of big data, represented by the potential data quality of public data resources, is a critical issue that affects all relevant studies. There have been many studies [98–100] which have tried to address the incorrect chemical structure information. However, studies to automatically correct biological data errors are rare [101].
Although the current data growth (i.e., velocity of big data) is exceptional and there are many available data for well-known toxicants, the missing data (i.e., lacking of necessary toxicity data for target compounds) is still a common issue. As described above, read-across studies can be used to fill the data gap in some cases. However, a good read-across practice can only be performed when an “unknown” compound has reliable predictions from its nearest neighbors [5]. For the “outliers” that are excluded because they are out of the applicability domain (AD) of available models [102, 103], extra experimental testing is still necessary. For this reason, a well-defined and applicable AD is critical for any chemical risk assessment studies. Currently, the AD is normally defined by chemical similarity between the test set and modeling set compounds. To make the AD more applicable in big data studies, new methods need to be developed, such as the biosimilarity confidence that we have recently reported [78].
Toxicology research becomes more complicated when various types of data (i.e., a variety of big data) are used in a single study. This is the ultimate challenge of computational toxicology and new computational approaches are always needed to realize this goal. In the above section, we described hybrid models and new computational approaches to use various types of toxicity data in the computational toxicity field (e.g., vAOP studies). However, this type of work is far from achieving final success. The current bioinformatics and cheminformatics modeling approaches and data analysis methods that have been developed in the past decade are not suitable for the requirements of big data analysis.
Big data research will be one of the major efforts of modern toxicology in the future. With all these challenges, there is an urgent need for novel techniques in data mining/generation, curation, and analysis to fulfill the requirements of big data research in computational toxicity. The recent progress in computational toxicology described in this book can be viewed as leading in this direction. The success of data-driven studies will assist toxicologists by highlighting the value of the publicly available toxicity data and providing guidance for future experimental testing.