Chapter 11
Big Data in Computational Toxicology: Challenges and Opportunities

Linlin Zhao1 and Hao Zhu1,2

1Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA

2Department of Chemistry, Rutgers University, Camden, NJ, USA

11.1 Big Data Scenario of Computational Toxicology

In the past decade, along with the vibrant and rapid development of chemical synthesis and biological screening technologies, immense data were generated daily and most of these data are available to the public [1, 2]. Biological data generated from high-throughput screening (HTS) of large chemical libraries contains rich toxicology information that has the potential to be integrated into toxicity research [3]. Currently available toxicity data exist both in structured formats (e.g., deposited into PubChem and other data-sharing web portals) and as unstructured data (papers, laboratory reports, toxicity web-site information, etc.). These data, even in structured formats, become so large and complex that it is difficult to process and use them with traditional database management, data processing, and modeling tools.

Image described by caption/surrounding text.

Figure 11.1 The “four V's” of big data can be used to describe the properties of these fast growing chemical toxicity data.

The term “big data” describes this phenomena, which can be viewed as the biggest change in modern toxicology. Originally, the focus of big data was the development of advanced data storage and handling techniques, such as cloud-based computing or high-speed heterogeneous computational environments [4]. The concept of big data is gaining increasing recognition in clinical studies and other research areas driven by biological data, including the field of toxicology [1–3, 5]. The daily updated toxicology big data landscape consists of large-scale datasets from various sources, which contain enormous number of chemical toxicity endpoints. The terms volume (scale of data), velocity (growth of data), and variety (diversity of sources) have been used to characterize big data and fits to the current toxicity big data landscape. Besides these characteristics, owing to the nature of experimental protocols, one should also be aware of the veracity (uncertainty of data) of these datasets. Thus, the “four v's” (namely, volume, velocity, variety, and veracity) of big data also represent the relevant challenges of the modern toxicology research (Figure 11.1) [6]. The volume and velocity of the data-driven research for toxicology have been well recognized by multiple screening and data-sharing projects, which are described in the next section of this chapter. However, the use of all these available data sources is still based on traditional computational approaches (e.g., quantitative structure–activity relationships, QSAR) and the issues of variety and veracity were rarely considered in most previous studies. This chapter also highlights the recent progress of data-driven research to answer the above challenges.

11.2 Fast-Growing Chemical Toxicity Data

Since the NIH Roadmap for medical research was launched in 2004 [7], several molecular library screening centers have been funded [8] and several HTS projects have been performed to experimentally test large chemical libraries. The recent data generation efforts in the area of toxicology are toxicity forecaster (ToxCast) initiated by the US Environmental Protection Agency (EPA) [9] and Toxicity Testing in the twenty-first century (Tox21), which was launched by the National Toxicology Program (NTP), the National Institutes of Health (NIH) Chemical Genomics Center (NCGC), and EPA [10–12]. The direct results of these experimental screening method efforts, especially HTS, are the toxicity data currently public available, as summarized in Table 11.1. These databases construct the current toxicity big data landscape, which is described in detail in the following.

Table 11.1 Public available toxicity data resources (as of 10/22/2016)

Database Size Data description Accessibility
PubChem [13, 14] Over 60 million compounds, over 1 million bioassays, over 13 billion data points Toxicity, genomics, and literature data https://pubchem.ncbi.nlm.nih.gov/
ChEMBL [15, 16] Over 1 million compounds, over 13 million data points Drugs, drug-like small molecules, and their bioactivity data https://www.ebi.ac.uk/chembl/
ACToR [17, 18] Over 800,000 compounds, over 500,000 assays Both in vitro and in vivo toxicity data https://actor.epa.gov
TOXNET [19] Multiple sub-databases contain over 300,000 compounds and related toxicity data Toxicity, diseases, and literature data https://toxnet.nlm.nih.gov/
REACH [20] 816,048 studies for 9800 substances and 3600 study types Data submitted in EU chemical legislation, machine-readable http://www.reach.lu/mmp/online/website/menu_hori/homepage/index_EN.html
SEURAT-1 [21, 22] Over 5500 cosmetic-type compounds in the current COSMOS database web portal Animal toxicity data http://www.seurat-1.eu/
ISSTOX [23] Five sub-databases (i) Long-term carcinogenicity bioassay on rodents (rat and mouse) (ISSCAN); (ii) In vitro Salmonella typhimurium mutagenesis (Ames test) (ISSSTY); (iii) In vivo mutagenesis (micronucleus test) (ISSMIC); (iv) Cell transformation Assays (ISSCTA); (v) Mutagenicity and carcinogenicity of biocides (ISSBIOC) https://www.dimdi.de/static/en/index.html
HESS DB [24] Repeated-dose toxicity study of over 2000 compounds and toxicity mechanism information of about 80 compounds were added recently High curated results of repeat dose toxicity tests http://www.nite.go.jp/en/chem/qsar/hess-e.html
RepDOSE [25] 364 chemicals investigated in 1018 studies which resulted in 6002 specific effects Repeat-dose study data for dog, mouse, and rat http://fraunhofer-repdose.de/
GEO [26] Over 4000 sub-datasets Microarray, next-generation sequencing and other forms of high-throughput functional genomics data submitted by the research community https://www.ncbi.nlm.nih.gov/geo/
Open TG-GATEs [27] 170 compounds at multiple dosages and time points Toxicity data, gene expression data, and metadata http://toxico.nibiohn.go.jp/english/
CEBS [28] About 10,000 toxicity bioassays from various sources Gene expression data http://www.niehs.nih.gov/research/resources/databases/cebs/index.cfm
CTD [29–31] Over 13,000 compounds, over 32,000 genes, over 6000 diseases Compound, gene, and disease relationships https://ctdbase.org/
DrugMatrix [32] About 600 drug molecules and 10,000 genes Gene expression data https://ntp.niehs.nih.gov/drugmatrix/index.html
Connectivity Map [33] About 1300 compounds and 7000 genes Gene expression data http://portals.broadinstitute.org/cmap/

The well-known chemical/biological data-sharing projects were not initially designed for toxicants but contain considerable amounts of toxicity-relevant data. For example, PubChem is a public repository for chemical structures and their biological data, including the toxicity data from the screening centers as described above [13, 14]. Figure 11.2 shows the yearly increase in the number of PubChem compounds and bioassays. Over the past eight years, the number of PubChem compounds increased from 19 million in September 2008 [34] to over 60 million in September 2015 [35]. During the same period, the number of bioassays that were used to test these compounds increased from 1197 in September 2008 [34] to over 1 million in September 2015, resulting in over five terabytes of data [35]. Another large reservoir of bioassay data is the ChEMBL database, which is a manually curated chemical database maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL)[15]. The EBI's goal is to provide freely available data and bioinformatics services to the scientific community. As part of this goal, the ChEMBL database was constructed for experimental data of both chemical toxicity and absorption, distribution, metabolism, and excretion (ADME) properties. ChEMBL version 22 (ChEMBL_22) was released in 2016. It contains over 1 million compounds and over 14 million data points [16].

Plots for Increase in the number of compounds and bioassays recorded in PubChem within eight years.

Figure 11.2 Increase in the number of compounds and bioassays recorded in PubChem within eight years (from September 2008 to September 2015).

The databases to specifically share toxicity data also made considerable progress. The Aggregated Computational Toxicology Online Resource (ACToR) portal, which was developed by the US EPA's National Center for Computational Toxicology (NCCT) program, aggregates toxicity data for over 800,000 compounds from thousands of public sources which include HTS, chemical exposure, sustainable chemistry (chemical structures and physicochemical properties), and virtual tissue data [17, 18]. Similarly, TOXNET, which was developed by the National Library of Medicines' (NLM) Division of Specialized Information Services (SIS), contains 16 separate sub-databases of various toxicity categories for thousands of diverse chemicals. By grouping these databases together, TOXNET allows users to access all toxicity data for target compounds from one query form [19]. Another famous example is that of the European Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) legislation which aims to collect comprehensive safety information for all substances available on the European market. The current REACH database contains more than 9000 unique substances and about 800,000 safety assessment documents [20].

Some other toxicity resources were designed to support data for one or several specific toxicity categories. The Safety Evaluation Ultimately Replacing Animal Testing (SEURAT-1) was launched by the European Commission and the European Cosmetics Trade Association, Cosmetics Europe in 2011 and it is composed of six complementary research projects [21, 22]. Among them, the COSMOS project was dedicated to the development of methods to predict the safety of cosmetic ingredients [36]. The COSMOS database web portal contains over 5500 unique cosmetic-relevant compounds with their respective in vivo toxicity data. Another European institute, Istituto Superiore di Sanita (ISS) developed ISSTOX database [23] which includes five sub-databases, each relative to a different toxicological endpoint. There are several databases containing multidose toxicity data, such as the Hazard Evaluation Support System Integrated Platform Database (HESS DB) [24] and RepDOSE [25]. HESS DB is a knowledge database containing toxicity data of compounds being tested in various doses/concentrations and relevant toxicity mechanism information. The RepDOSE database consists of toxicity data for 400 chemicals investigated in about 1000 multidose studies [25].

Another type of data source are toxicogenomics (TGx) studies, which generate enormous amounts of “omics” data that are meant to predict toxicity or genetic susceptibility induced by chemicals. Many modern in vitro toxicity studies now address that genomics information is sensitive to toxicants and these findings can be translated into biomarkers that is useful for chemical toxicity assessments [37, 38]. Several publicly available TGx databases have been initiated, such as the Gene Expression Omnibus (GEO) [26], the Japanese Toxicogenomics Project (TGP) database [39], the Chemical Effects in Biological Systems (CEBS) database [28], and the Comparative Toxicogenomics Database (CTD) [29–31]. GEO is a public repository that archives diverse forms of high-throughput functional genomic datasets and it contains a large quantity of gene expression data that can be used in computational toxicology studies [26]. TGP data are available to the public through the Open TG-GATEs database [27], which includes the toxicology data for 170 compounds. The CEBS database [28] developed by the NIEHS is now the public repository for all NTP conventional toxicology and carcinogenicity data as well as NCGC HTS data along with the CTD, which aims to promote comparative studies of genes and proteins across species [29–31]. The DrugMatrix [32] and the Connectivity Map [33] are similar efforts in toxicogenomics data curation, but with more specific research goals.

11.3 The Use of Big Data Approaches in Modern Computational Toxicology

The massive chemical toxicity data available to research communities poses a great opportunity for modern toxicology research. However, in the current big data era, traditional computational approaches are not suitable to deal with big data resources. The typical big data research in toxicology that has taken place over the past five years; along with the fast-growing data, this field was assisted by “super” computers with powerful computational ability.

11.3.1 Profiling the Toxicants with Massive Biological Data

With all these screening efforts, especially the HTS projects, as mentioned above, a significant number of “popular” compounds (e.g., well-known toxicants) have been tested multiple times. For example, Table 11.2 shows 20 toxicants obtained from the Integrated Risk Information System (IRIS) database. On the basis of the search result on PubChem, these toxicants were reported to be tested in hundreds of PubChem bioassays. As shown in Table 11.2, hexachlorophene (CAS 70-30-4), a disinfectant banned from the market, showed active responses in 366 bioassays. Other toxicants have similarly rich response information in PubChem (Table 11.2). The rich biological data for these toxicants can be viewed as response profiles which are useful for computational toxicology studies.

Table 11.2 Twenty human toxicants with their relevant PubChem bioassay responses

Chemicals CAS No. of active responses No. of inactive responses
Hexachlorophene 70-30-4 366 721
Captan 133-06-2 212 198
Phenylmercuric acetate 62-38-4 203 115
Chlordecone (Kepone) 143-50-0 166 442
Hydroquinone 123-31-9 121 450
2-Chloroacetophenone 532-27-4 78 191
Pentachlorophenol 87-86-5 78 679
Tributyltin oxide (TBTO) 56-35-9 77 37
Endosulfan 115-29-7 75 189
Hexachlorocyclopentadiene (HCCPD) 77-47-4 69 203
Propachlor 1918-16-7 69 379
Sodium diethyldithiocarbamate 148-18-5 68 77
p,p′-Dichlorodiphenyl dichloroethane (DDD) 72-54-8 66 170
Acetaldehyde 75-07-0 63 173
Phenol 108-95-2 62 369
Bisphenol A 80-05-7 61 739
p,p′-Dichlorodiphenyl trichloroethane (DDT) 50-29-3 61 279
Naphthalene 91-20-3 61 697
Chlorobenzilate 510-15-6 59 62
α-Hexachlorocyclohexane (α-HCH) 319-84-6 58 619

There have been some pioneering studies that use bioassay data to profile toxicants in the early stages. For example, the major goal of the ToxCast project is to use hundreds of bioassays [40–48] to profile the compounds which have been tested for their animal toxicity as shown in the Toxicity Reference Database (ToxRefDB) [49–51]. The profiling studies of ToxCast tried to develop predictive models for toxic compounds using a set of in vitro assays and/or in silico predicted results, for example, ToxPi [52]. The disadvantage of this type of study is the selection of biological data for profiling; besides, prediction is arbitrary and only limited to in-house data.

In the current big data era, all the public toxicity data (e.g., shown in Table 11.1) can be used for profiling toxicants. In 2014, we developed an automatic virtual profiling tool to evaluate potential animal toxicants using all PubChem bioassay data [53]. The core of this study is a scoring system to evaluate the relationship between PubChem bioassays and animal toxicity. The top-ranked bioassays were used to profile the compounds of interest and make predictions of potential toxicants. Recently, similar effort was reported by Helal et al. [54]. They used PubChem bioassays to create bioprofiles for more than 300,000 chemicals and showed that these bioprofiles can be used in toxicity model development. Meanwhile, bioassayR, which is distributed as an R® package, can be used for simultaneous analysis of large numbers of biological data, especially those obtained from HTS [55]. Using bioassayR, bioassays can be clustered with the same targets and compound-target information can also be generated for further analysis.

The direct benefit of these toxicant profiling efforts is to provide new chemical toxicity mechanisms by analyzing the relevant biological data. Recently, we developed the virtual Adverse Outcome Pathway (vAOP) of oxidative stress by profiling hepatotoxicants [56]. In this study, four PubChem assays obtained from bioprofiles of hepatotoxicants were used to create a vAOP. If a new compound contains the initial chemical features described in this vAOP and shows active responses in any of these four assays, it will be predicted to cause liver damage in vivo through inducing oxidative stress. The ToxCast project generated several similar analyses [57–59]. For example, they studied estrogen receptor (ER)-binding potentials by profiling ER binders by diverse in vitro assays [57, 58] and gene expression data [59]. The AOP generated by these assay results can not only predict potential ER binders but also illustrate the relevant toxicity mechanisms.

11.3.2 Read-Across Study to Fill Data Gap

Traditional read-across approaches of computational toxicology, which were widely used to fill data gaps of new compounds without relevant toxicity data, are usually based on chemical similarity search [60, 61] or QSAR predictions. The basic hypothesis of this type of studies is “similar compounds have similar bioactivities” and is not suitable for most chemical toxicity phenomenon in vivo with complicated toxicity mechanisms. Only using chemical similarity to justify the read-across will be error-prone, especially when chemically similar compounds show dissimilar toxicity phenomena. For example, “activity cliffs” (i.e., similar compounds having different toxicity) result in prediction errors in many toxicity modeling studies [62–67]. Thus, in the big data era, the use of biological data besides chemical structural information adds extra strength to the read-across process.

In some early studies, integrating biological data as biological descriptors into the QSAR modeling procedure was found to be beneficial to the resultant toxicity models. For example, we developed enhanced acute toxicity models by integrating external biological data as extra descriptors [68, 69]. The resulting hybrid models, based on the combination of the chemical and biological descriptors, showed better performance than traditional QSAR models. Research studies were carried out Low et al., who concluded that the read-across study should be based on both chemical and biological data [70]. Similar results were reported by Garcia-Serna et al. [71]. They stated that combining chemical and biological data could enhance the ability to assess the toxicity of small molecules with higher confidence than that using chemical data alone.

Recent read-across studies have been reported to be performed by using various sources with massive amounts of toxicity data. Kleinstreuer et al. [72] used the US EPA's ToxCast dataset to perform read-across for a uterotrophic database collected from a large number of historical reports. The purpose of this study was to evaluate estrogenic endocrine disruption of new compounds. Another example was given by Luechtefeld et al. [5, 20, 73–75], who collected all the available historic REACH toxicity data for 9801 compounds. They performed a read-across study of REACH compounds for their acute oral toxicity [73], Draize eye irritation testing results [74], and skin sensitization activity [75]. A review of using big data to perform read-across of chemical toxicity [5] and two other strategy papers [76, 77] were also published. Recently, to fill data gaps, we developed a new read-across portal using public large-scale chemical and biological data called the Chemical In vitro–In vivo Profiling (CIIPro) portal [78], which can automatically extract biological data from public resources (i.e., PubChem) for compounds of interest. The read-across analysis based on biosimilarity, which was defined on the extracted biological data of target compounds, showed higher predictivity for estrogen receptor binding agents [79].

11.3.3 Unstructured Data Curation

Chemical toxicity data are rapidly increasing not only in structured data formats (e.g., those databases shown in Table 11.1), but also in various types of text documents such as scientific articles, patents, industry reports, and media reports, which can be classified as unstructured toxicity data. The use of unstructured toxicity data has motivated the development of text mining approaches in computational toxicity. Text mining refers to the automatic processes of deriving high-quality information from text. It usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluating and interpreting the output [80]. In computational toxicity, challenges always come from the step of structuring input text. For example, in documents, chemicals are usually referenced in different ways: common names, systematic nomenclatures, database identifiers (e.g., CAS number), InChI strings, or even as structures shown in images. Thus, text mining studies of chemicals of specific interest (e.g., animal toxicants) are strongly dependent on the named entity recognition (NER), which was reviewed by Vazquez et al. [81]. Furthermore, for a popular word (e.g., a commonly used term), it can refer to different meanings under different contexts. For example, CYP can be used in names of gene, protein, mutation, drug, or an adverse event [82–85]. As a part of biomedical science, the application of text mining in computational toxicity was for pathway extraction and reasoning and pharmacogenomics [86]. Text mining approaches contribute to computational toxicity studies by bringing useful knowledge from literature, either extracted or curated, together with in-house biological datasets to identify relationships between genes, pathways, drugs, environmental contaminants and diseases [87, 88]. There have been several studies applying text mining methods within toxicogenomics studies [89–91].

11.4 Challenges of Big Data Research in Computational Toxicology and Relevant Forecasts

The rise of big data heralds a profound change in the way that toxicologists perform their research. The big data era brings not only big progress but also big challenges [1, 92, 93]. Although there are some preliminary studies, as described above, which successfully apply big data sources in computational toxicology studies, the urgent needs of new approaches in this area are described in the following.

Experimental error is inevitable in public data sources. It is understandable that the quality of data may be vary on the basis of the nature of experimental protocols. Currently, the usefulness of public data sources is questionable owing to a lack of necessary quality control [94]. A general worry has been raised regarding irreproducible experimental data [95–97], which is relatively common in complicated biological testing (e.g., animal models). There is also a golden rule in computational modeling studies, which is the “trash in, trash out” principle [5]. For this reason, the veracity of big data, represented by the potential data quality of public data resources, is a critical issue that affects all relevant studies. There have been many studies [98–100] which have tried to address the incorrect chemical structure information. However, studies to automatically correct biological data errors are rare [101].

Although the current data growth (i.e., velocity of big data) is exceptional and there are many available data for well-known toxicants, the missing data (i.e., lacking of necessary toxicity data for target compounds) is still a common issue. As described above, read-across studies can be used to fill the data gap in some cases. However, a good read-across practice can only be performed when an “unknown” compound has reliable predictions from its nearest neighbors [5]. For the “outliers” that are excluded because they are out of the applicability domain (AD) of available models [102, 103], extra experimental testing is still necessary. For this reason, a well-defined and applicable AD is critical for any chemical risk assessment studies. Currently, the AD is normally defined by chemical similarity between the test set and modeling set compounds. To make the AD more applicable in big data studies, new methods need to be developed, such as the biosimilarity confidence that we have recently reported [78].

Toxicology research becomes more complicated when various types of data (i.e., a variety of big data) are used in a single study. This is the ultimate challenge of computational toxicology and new computational approaches are always needed to realize this goal. In the above section, we described hybrid models and new computational approaches to use various types of toxicity data in the computational toxicity field (e.g., vAOP studies). However, this type of work is far from achieving final success. The current bioinformatics and cheminformatics modeling approaches and data analysis methods that have been developed in the past decade are not suitable for the requirements of big data analysis.

Big data research will be one of the major efforts of modern toxicology in the future. With all these challenges, there is an urgent need for novel techniques in data mining/generation, curation, and analysis to fulfill the requirements of big data research in computational toxicity. The recent progress in computational toxicology described in this book can be viewed as leading in this direction. The success of data-driven studies will assist toxicologists by highlighting the value of the publicly available toxicity data and providing guidance for future experimental testing.

References

  1. 1 Marx, V. (2013) Biology: the big challenges of big data. Nature, 498, 255–260.
  2. 2 Swarup, V. and Geschwind, D.H. (2013) Alzheimer's disease: from big data to mechanism. Nature, 500, 34–35.
  3. 3 Zhu, H., Zhang, J., Kim, M.T. et al. (2014) Big data in chemical toxicity research: the use of high-throughput screening assays to identify potential toxicants. Chem. Res. Toxicol., 27, 1643–1651.
  4. 4 Schadt, E.E., Linderman, M.D., Sorenson, J. et al. (2011) Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nat. Rev. Genet., 12, 224. doi: 10.1038/nrg2857-c2
  5. 5 Hartung, T. (2016) Making big sense from big data in toxicology by read-across. ALTEX, 33, 83–93. doi: 10.14573/altex.1603091
  6. 6 McAfee, A. and Brynjolfsson, E. (2012) “Big data.” The management revolution. Harvard Business Rev., 90, 61–67. doi: 10.1007/s12599-013-0249-5
  7. 7 Zerhouni, E. (2003) The NIH Roadmap. Science, 302, 63–72.
  8. 8 Austin, C.P., Brady, L.S., Insel, T.R., and Collins, F.S. (2004) NIH molecular libraries initiative. Science, 306, 1138–1139.
  9. 9 Dix, D.J., Houck, K.A., Martin, M.T. et al. (2007) The toxcast program for prioritizing toxicity testing of environmental chemicals. Toxicol. Sci., 95, 5–12. doi: 10.1093/toxsci/kfl103
  10. 10 Collins, F.S., Gray, G.M., and Bucher, J.R. (2008) Toxicology: transforming environmental health protection. Science, 319, 906–907. doi: 10.1126/science.1154619
  11. 11 Hukkanen, R.R., Halpern, W.G., and Vidal, J.D. (2016) Regulatory forum opinion piece: review of FDA draft guidance testicular toxicity – evaluation during Drug Development Guidance for Industry. Toxicol. Pathol., 44, 927–930. doi: 10.1177/0192623316656416
  12. 12 Shukla, S.J., Huang, R., Austin, C.P., and Xia, M. (2010) The future of toxicity testing: a focus on in vitro methods using a quantitative high throughput screening platform. Drug Discov. Today, 15, 997–1007.
  13. 13 Wang, Y., Xiao, J., Suzek, T.O. et al. (2009) PubChem : a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res., 37, W623–W633. doi: 10.1093/nar/gkp456
  14. 14 Wang, Y., Bolton, E., Dracheva, S. et al. (2009) An overview of the PubChem BioAssay resource. Nucleic Acids Res., 38, D255–D266. doi: 10.1093/nar/gkp965
  15. 15 Gaulton, A., Bellis, L.J., Bento, A.P. et al. (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res., 40, D1100–D1107. doi: 10.1093/nar/gkr777
  16. 16 Bento, A.P., Gaulton, A., Hersey, A. et al. (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res., 42, D1083–D1090. doi: 10.1093/nar/gkt1031
  17. 17 Judson, R., Richard, A., Dix, D. et al. (2008) ACToR – aggregated computational toxicology resource. Toxicol. Appl. Pharmacol., 233, 7–13. doi: 10.1016/j.taap.2007.12.037
  18. 18 Judson, R.S., Martin, M.T., Egeghy, P. et al. (2012) Aggregating data for computational toxicology applications: the U.S. environmental protection agency (EPA) aggregated computational toxicology resource (ACToR) system. Int. J. Mol. Sci., 13, 1805–1831. doi: 10.3390/ijms13021805
  19. 19 Fonger, G.C., Stroup, D., Thomas, P.L., and Wexler, P. (2000) TOXNET: a computerized collection of toxicological and environmental health information. Toxicol. Ind. Health., 16, 4–6. doi: 10.1177/074823370001600101
  20. 20 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014. ALTEX, 33, 95–109. doi: 10.14573/altex.1510052
  21. 21 Vinken, M., Pauwels, M., Ates, G. et al. (2012) Screening of repeated dose toxicity data present in SCC(NF)P/SCCS safety evaluations of cosmetic ingredients. Arch. Toxicol., 86, 405–412. doi: 10.1007/s00204-011-0769-z
  22. 22 Gocht, T., Berggren, E., Ahr, H.J. et al. (2015) The SEURAT-1 approach towards animal free human safety assessment. ALTEX, 32, 9–24. doi: 10.14573/altex.1408041
  23. 23 Benigni, R., Battistelli, C.L., Bossa, C. et al. (2013) New perspectives in toxicological information management, and the role of ISSTOX databases in assessing chemical mutagenicity and carcinogenicity. Mutagenesis, 28, 401–409. doi: 10.1093/mutage/get016
  24. 24 Sakuratani, Y., Zhang, H.Q., Nishikawa, S. et al. (2013) Hazard evaluation support system (HESS) for predicting repeated dose toxicity using toxicological categories. SAR QSAR Environ. Res., 24, 351–363.
  25. 25 Bitsch, A., Jacobi, S., Melber, C. et al. (2006) REPDOSE: a database on repeated dose toxicity studies of commercial chemicals – a multifunctional tool. Regul. Toxicol. Pharmacol., 46, 202–210. doi: 10.1016/j.yrtph.2006.05.013
  26. 26 B. Tesar, Gene Expression Omnibus, (2013) 1–10. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69034 (accessed 20 Aug, 2017).
  27. 27 Igarashi, Y., Nakatsu, N., Yamashita, T. et al. (2014) Open TG-GATEs: a large-scale toxicogenomics database. Nucleic Acids Res., gku955. doi: 10.1093/nar/gku955
  28. 28 Waters, M.D., Boorman, G., Bushel, P. et al. (2003) Systems toxicology and the chemical effects in biological systems (CEBS) knowledge base. Environ. Health Perspect., 111, 811–824. doi: 10.1289/txg.5971
  29. 29 Mattingly, C.J., Colby, G.T., Forrest, J.N., and Boyer, J.L. (2003) The comparative toxicogenomics database (CTD). Environ. Health Perspect., 111, 793. doi: 10.1289/ehp.6028
  30. 30 Davis, A.P., Grondin, C.J., Lennon-Hopkins, K. et al. (2015) The comparative toxicogenomics database's 10th year anniversary: update 2015. Nucleic Acids Res., 43, D914–D920. doi: 10.1093/nar/gku935
  31. 31 Davis, A.P., Murphy, C.G., Johnson, R. et al. (2013) The comparative toxicogenomics database: update 2013. Nucleic Acids Res., 41, 1104–1114. doi: 10.1093/nar/gks994
  32. 32 Ganter, B., Snyder, R.D., Halbert, D.N., and Lee, M.D. (2006) Toxicogenomics in drug discovery and development: mechanistic analysis of compound/class-dependent effects using the DrugMatrix® database. Pharmacogenomics, 1025–1044.
  33. 33 Lamb, J., Crawford, E.D., Peck, D. et al. (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 313, 1929–1935. doi: 10.1126/science.1132939
  34. 34 Sayers, E.W., Barrett, T., Benson, D.A. et al. (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res., 37, D5–D15. doi: 10.1093/nar/gkn741
  35. 35 Acland, A., Agarwala, R., Barrett, T. et al. (2014) Database resources of the national center for biotechnology information. Nucleic Acids Res., 42, D7–D17. doi: 10.1093/nar/gkt1146
  36. 36 Yang, C., Ambrosio, M., Arvidson, K. et al. (2013) Development of new COSMOS oRepeatDose and non-cancer threshold of toxicological concern (TTC) databases to support alternative testing methods for cosmetics related chemicals. Toxicol. Lett., 221, S80–S80. doi: 10.1016/j.toxlet.2013.05.082
  37. 37 McHale, C.M., Zhang, L., Hubbard, A.E., and Smith, M.T. (2010) Toxicogenomic profiling of chemically exposed humans in risk assessment. Mutat. Res., 705, 172–183. doi: 10.1016/j.mrrev.2010.04.001.Toxicogenomic
  38. 38 Blaauboer, B.J., Boekelheide, K., Clewell, H.J. et al. (2012) The use of biomarkers of toxicity for integrating in vitro hazard estimates into risk assessment for humans. ALTEX, 29, 411–425. doi: 10.14573/altex.2012.4.411
  39. 39 Uehara, T., Ono, A., Maruyama, T. et al. (2010) The Japanese toxicogenomics project: Application of toxicogenomics. Mol. Nutr. Food Res., 54, 218–227. doi: 10.1002/mnfr.200900169
  40. 40 Judson, R.S., Houck, K.A., Kavlock, R.J. et al. (2010) In vitro screening of environmental chemicals for targeted testing prioritization: The toxcast project. Environ. Health Perspect., 118, 485–492. doi: 10.1289/ehp.0901392
  41. 41 Reif, D.M., Martin, M.T., Tan, S.W. et al. (2010) Endocrine profling and prioritization of environmental chemicals using toxcast data. Environ. Health Perspect., 118, 1714–1720. doi: 10.1289/ehp.1002180
  42. 42 Martin, M.T., Knudsen, T.B., Reif, D.M. et al. (2011) Predictive model of rat reproductive toxicity from ToxCast high throughput screening. Biol. Reprod., 85, 327–339. doi: 10.1095/biolreprod.111.090977
  43. 43 Sipes, N.S., Martin, M.T., Reif, D.M. et al. (2011) Predictive models of prenatal developmental toxicity from toxcast high-throughput screening data. Toxicol. Sci., 124, 109–127. doi: 10.1093/toxsci/kfr220
  44. 44 Kavlock, R., Chandler, K., Houck, K. et al. (2012) Update on EPA's ToxCast program: providing high throughput decision support tools for chemical risk management. Chem. Res. Toxicol., 25, 1287–1302. doi: 10.1021/tx3000939
  45. 45 Kleinstreuer, N.C., Dix, D.J., Houck, K.A. et al. (2013) In vitro perturbations of targets in cancer hallmark processes predict rodent chemical carcinogenesis. Toxicol. Sci., 131, 40–55. doi: 10.1093/toxsci/kfs285
  46. 46 Chemicals, E., Rotroff, D.M., Dix, D.J. et al. (2013) Using in vitro high throughput screening assays to identify potential endocrine-disrupting chemicals. Environ. Health Perspect., 121, 7–14.
  47. 47 Sipes, N.S., Martin, M.T., Kothiya, P. et al. (2013) Profiling 976 ToxCast chemicals across 331 enzymatic and receptor signaling assays. Chem. Res. Toxicol., 26, 878–895. doi: 10.1021/tx400021f
  48. 48 Tice, R.R., Austin, C.P., Kavlock, R.J., and Bucher, J.R. (2013) Improving the human hazard characterization of chemicals: a Tox21 update. Environ. Health Perspect., 121, 756–765. doi: 10.1289/ehp.1205784
  49. 49 Martin, M.T., Judson, R.S., Reif, D.M. et al. (2009) Profiling chemicals based on chronic toxicity results from the U.S. EPA ToxRef database. Environ. Health Perspect., 117, 392–399. doi: 10.1289/ehp.0800074
  50. 50 Martin, M.T., Mendez, E., Corum, D.G. et al. (2009) Profiling the reproductive toxicity of chemicals from multigeneration studies in the toxicity reference database. Toxicol. Sci., 110, 181–190. doi: 10.1093/toxsci/kfp080
  51. 51 Knudsen, T.B., Martin, M.T., Kavlock, R.J. et al. (2009) Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA's ToxRefDB. Reprod. Toxicol., 28, 209–219. doi: 10.1016/j.reprotox.2009.03.016
  52. 52 Reif, D.M., Sypa, M., Lock, E.F. et al. (2013) ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence. Bioinformatics, 29, 402–403. doi: 10.1093/bioinformatics/bts686
  53. 53 Zhang, J., Hsieh, J.H., and Zhu, H. (2014) Profiling animal toxicants by automatically mining public bioassay data: a big data approach for computational toxicology. PLoS One, 9, e99863. doi: 10.1371/journal.pone.0099863
  54. 54 Helal, K.Y., Maciejewski, M., Gregori-Puigjane, E. et al. (2016) Public domain HTS fingerprints: design and evaluation of compound bioactivity profiles from PubChem's bioassay repository. J. Chem. Inf. Model., 56, 390–398. doi: 10.1021/acs.jcim.5b00498
  55. 55 Backman, T.W.H. and Girke, T. (2016) bioassayR: cross-target analysis of small molecule bioactivity. J. Chem. Inf. Model., 56, 1237–1242. doi: 10.1021/acs.jcim.6b00109
  56. 56 Kim, M.T., Huang, R., Sedykh, A. et al. (2016) Mechanism profiling of hepatotoxicity caused by oxidative stress using antioxidant response element reporter gene assay models and big data. Environ. Health Perspect., 124, 634–641.
  57. 57 Judson, R.S., Magpantay, F.M., Chickarmane, V. et al. (2015) Integrated model of chemical perturbations of a biological pathway using 18 in vitro high-throughput screening assays for the estrogen receptor. Toxicol. Sci., 148, 137–154. doi: 10.1093/toxsci/kfv168
  58. 58 Judson, R., Houck, K., Martin, M. et al. (2016) Analysis of the effects of cell stress and cytotoxicity on in vitro assay activity across a diverse chemical and assay space. Toxicol. Sci., 152, 323–339. doi: 10.1093/toxsci/kfw092
  59. 59 Ryan, N., Chorley, B., Tice, R.R. et al. (2016) Moving toward integrating gene expression profiling into high-throughput testing: a gene expression biomarker accurately predicts estrogen receptor a modulation in a microarray compendium. Toxicol. Sci., 151, 88–103. doi: 10.1093/toxsci/kfw026
  60. 60 Patlewicz, G., Ball, N., Becker, R.A. et al. (2014) Read-across approaches – misconceptions, promises and challenges ahead. ALTEX, 31, 387–396. doi: 10.14573/altex.1410071
  61. 61 Schultz, T.W., Amcoff, P., Berggren, E. et al. (2015) A strategy for structuring and reporting a read-across prediction of toxicity. Regul. Toxicol. Pharmacol., 72, 586–601. doi: 10.1016/j.yrtph.2015.05.016
  62. 62 Guha, R. and Van Drie, J.H. (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J. Chem. Inf. Model., 48, 646–658. doi: 10.1021/ci7004093
  63. 63 Johnson, S.R. (2008) The trouble with QSAR (or how i learned to stop worrying and embrace fallacy). J. Chem. Inf. Model., 48, 25–26. doi: 10.1021/ci700332k
  64. 64 Bajorath, J., Peltason, L., Wawer, M. et al. (2009) Navigating structure–activity landscapes. Drug Discov. Today, 14, 698–705. doi: 10.1016/j.drudis.2009.04.003
  65. 65 Medina-Franco, J.L., Martínez-Mayorga, K., Bender, A. et al. (2009) Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J. Chem. Inf. Model., 49, 477–491. doi: 10.1021/ci800379q
  66. 66 Hu, X., Hu, Y., Vogt, M. et al. (2012) MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J. Chem. Inf. Model., 52, 1138–1145. doi: 10.1021/ci3001138
  67. 67 Stumpfe, D. and Bajorath, J. (2012) Exploring activity cliffs in medicinal chemistry. J. Med. Chem., 55, 2932–2942. doi: 10.1021/jm201706b
  68. 68 Sedykh, A., Zhu, H., Tang, H. et al. (2011) Use of in vitro HTS-derived concentration-response data as biological descriptors improves the accuracy of QSAR models of in vivo toxicity. Environ. Health Perspect., 119, 364–370.
  69. 69 Zhu, H., Ye, L., Richard, A. et al. (2009) A novel two-step hierarchical quantitative structure–activity relationship modeling work flow for predicting acute toxicity of chemicals in rodents. Environ. Health Perspect., 117, 1257–1264. doi: 10.1289/ehp.0800471
  70. 70 Low, Y., Sedykh, A., Fourches, D. et al. (2013) Integrative chemical–biological read-across approach for chemical hazard classification. Chem. Res. Toxicol., 26, 1199–1208. doi: 10.1021/tx400110f
  71. 71 Garcia-Serna, R., Vidal, D., Remez, N., and Mestres, J. (2015) Large-scale predictive drug safety: from structural alerts to biological mechanisms. Chem. Res. Toxicol., 28, 1875–1887. doi: 10.1021/acs.chemrestox.5b00260
  72. 72 Kleinstreuer, N.C., Ceger, P.C., Allen, D.G. et al. (2016) A curated database of rodent uterotrophic bioactivity. Environ. Health Perspect., 124, 556–562.
  73. 73 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of public oral toxicity data from REACH registrations 2008–2014. ALTEX, 33, 111–122. doi: 10.14573/altex.1510054
  74. 74 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of draize eye irritation testing and its prediction by mining publicly available 2008–2014 REACH data. ALTEX, 33, 123–134. doi: 10.14573/altex.1510053
  75. 75 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of publically available skin sensitization data from REACH registrations 2008–2014. ALTEX., 33, 135–148. doi: 10.14573/altex.1510055
  76. 76 Ball, N., Cronin, M.T.D., Shen, J. et al. (2016) Toward good read-across practice (GRAP) guidance. ALTEX, 33, 149–166. doi: 10.14573/altex.1601251
  77. 77 Zhu, H., Bouhifd, M., Donley, E. et al. (2016) Supporting read-across using biological data. ALTEX, 33, 167–182. doi: 10.14573/altex.1601252
  78. 78 Russo, D.P., Kim, M., Wang, W. et al. (2017) CIIPro: A new read-across portal to fill data gaps using public large scale chemical and biological data. Bioinformatics, 33, 464–466.
  79. 79 Ribay, K., Kim, M.T., Wang, W. et al. (2016) Predictive modeling of estrogen receptor binding agents using advanced cheminformatics tools and massive public data. Front. Environ. Sci., 4, 1–9. doi: 10.3389/fenvs.2016.00012
  80. 80 Tan, A.-H. (1999) Text mining: the state of the art and the challenges. Proc. PAKDD 1999 Work. Knowl. Discov. Adv. Databases, 8, 65–70. doi: 10.1.1.132.6973
  81. 81 Vazquez, M., Krallinger, M., Leitner, F., and Valencia, A. (2011) Text mining for drugs and chemical compounds: methods, tools and applications. Mol. Inform., 30, 506–519. doi: 10.1002/minf.201100005
  82. 82 Pelkonen, O., Mäenpää, J., Taavitsainen, P. et al. (1998) Inhibition and induction of human cytochrome P450 (CYP) enzymes. Xenobiotica, 28, 1203–1253. doi: 10.1080/004982598238886
  83. 83 Ingelman-Sundberg, M. (2001) Genetic susceptibility to adverse effects of drugs and environmental toxicants: the role of the CYP family of enzymes. Mutat. Res. Mol. Mech. Mutagen, 482, 11–19. doi: 10.1016/S0027-5107(01)00205-6
  84. 84 Honkakoski, P. and Negishi, M. (2000) Regulation of cytochrome P-450 (CYP) genes by nuclear receptors. Biochem. J., 347, 321–337. doi: 10.1042/0264-6021:3470321
  85. 85 Nebert, D.W. and Russell, D.W. (2002) Clinical importance of the cytochromes P450. Lancet, 360, 1155–1162. doi: 10.1016/S0140-6736(02)11203-7
  86. 86 Gonzalez, G.H., Tahsin, T., Goodale, B.C. et al. (2016) Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform., 17, 33–42. doi: 10.1093/bib/bbv087
  87. 87 Krallinger, M., Krallinger, M., Erhardt, R.A.A. et al. (2005) Text mining approaches in molecular biology and biomedicine. Drug Discov. Today, 10, 439–445. doi: 10.1016/S1359-6446(05)03376-3
  88. 88 Kavlock, R.J., Ankley, G., Blancato, J. et al. (2008) Computational toxicology – a state of the science mini review. Toxicol. Sci., 103, 14–27. doi: 10.1093/toxsci/kfm297
  89. 89 Davis, A.P., Wiegers, T.C., Johnson, R.J. et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One., 8, e58201. doi: 10.1371/journal.pone.0058201
  90. 90 Chung, M.H., Wang, Y., Tang, H. et al. (2015) Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics. Front. Pharmacol., 6, 1–7. doi: 10.3389/fphar.2015.00081
  91. 91 Lee, M., Liu, Z., Kelly, R., and Tong, W. (2014) Of text and gene – using text mining methods to uncover hidden knowledge in toxicogenomics. BMC Syst. Biol., 8, 93. doi: 10.1186/s12918-014-0093-3
  92. 92 Bizer, C., Boncz, P., Brodie, M.L., and Erling, O. (2011) The meaningful use of big data: four perspectives – four challenges. ACM SIGMOD Rec., 40, 56–60. doi: 10.1145/2094114.2094129
  93. 93 Coveney, P.V., Dougherty, E.R., and Highfield, R.R. (2016) Big data need big theory too. Philos. Trans. Royal Soc. A, 374, 20160153. doi: 10.1098/rsta.2016.0153
  94. 94 Williams, A.J. and Ekins, S. (2011) A quality alert and call for improved curation of public chemistry databases. Drug Discov. Today, 16, 747–750. doi: 10.1016/j.drudis.2011.07.007
  95. 95 Prinz, F., Schlange, T., and Asadullah, K. (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov., 10, 712. doi: 10.1038/nrd3439-c1
  96. 96 Ioannidis, J.P.A., Allison, D.B., Ball, C.A. et al. (2009) Repeatability of published microarray gene expression analyses. Nat. Genet., 41, 149–155. doi: 10.1038/ng.295
  97. 97 Bell, A.W., Deutsch, E.W., Au, C.E. et al. (2009) A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat. Methods, 6, 423–430. doi: 10.1038/nmeth.1333
  98. 98 Fourches, D., Muratov, E., and Tropsha, A. (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model., 50, 1189–1204. doi: 10.1021/ci100176x
  99. 99 Young, D., Martin, T., Venkatapathy, R., and Harten, P. (2008) Are the chemical structures in your QSAR correct? QSAR Comb. Sci., 27, 1337–1345. doi: 10.1002/qsar.200810084
  100. 100 Fourches, D., Muratov, E., and Tropsha, A. (2015) Curation of chemogenomics data. Nat. Chem. Biol., 11, 535–535. doi: 10.1038/nchembio.1881
  101. 101 Zhao L., Wang W., Sedykh A. et al. (2017) Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do. ACS Omega, 2 2805–2812.
  102. 102 J. Jaworska, N. Nikolova-Jeliazkova, T. Aldenberg, Review of methods for QSAR applicability domain estimation by the training set, ATLA 33 (2005) 445–459. https://www.ncbi.nlm.nih.gov/pubmed/16268757 (accessed 27 September 2017).
  103. 103 Tetko, I.V., Sushko, I., Pandey, A.K. et al. (2008) Critical assessment of QSAR models of environmental toxicity against tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J. Chem. Inf. Model., 48, 1733–1746. doi: 10.1021/ci800151m
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset