Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11
Big Data in Computational Toxicology: Challenges and Opportunities

Linlin Zhao¹ and Hao Zhu^1,2

¹Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA

²Department of Chemistry, Rutgers University, Camden, NJ, USA

Chapter Menu

Big Data Scenario of Computational Toxicology
Fast-Growing Chemical Toxicity Data
The Use of Big Data Approaches in Modern Computational Toxicology
Challenges of Big Data Research in Computational Toxicology and Relevant Forecasts

11.1 Big Data Scenario of Computational Toxicology

In the past decade, along with the vibrant and rapid development of chemical synthesis and biological screening technologies, immense data were generated daily and most of these data are available to the public [1, 2]. Biological data generated from high-throughput screening (HTS) of large chemical libraries contains rich toxicology information that has the potential to be integrated into toxicity research [3]. Currently available toxicity data exist both in structured formats (e.g., deposited into PubChem and other data-sharing web portals) and as unstructured data (papers, laboratory reports, toxicity web-site information, etc.). These data, even in structured formats, become so large and complex that it is difficult to process and use them with traditional database management, data processing, and modeling tools.

Image described by caption/surrounding text. — **Figure 11.1** The “four V's” of big data can be used to describe the properties of these fast growing chemical toxicity data.

The term “big data” describes this phenomena, which can be viewed as the biggest change in modern toxicology. Originally, the focus of big data was the development of advanced data storage and handling techniques, such as cloud-based computing or high-speed heterogeneous computational environments [4]. The concept of big data is gaining increasing recognition in clinical studies and other research areas driven by biological data, including the field of toxicology [1–3, 5]. The daily updated toxicology big data landscape consists of large-scale datasets from various sources, which contain enormous number of chemical toxicity endpoints. The terms volume (scale of data), velocity (growth of data), and variety (diversity of sources) have been used to characterize big data and fits to the current toxicity big data landscape. Besides these characteristics, owing to the nature of experimental protocols, one should also be aware of the veracity (uncertainty of data) of these datasets. Thus, the “four v's” (namely, volume, velocity, variety, and veracity) of big data also represent the relevant challenges of the modern toxicology research (Figure 11.1) [6]. The volume and velocity of the data-driven research for toxicology have been well recognized by multiple screening and data-sharing projects, which are described in the next section of this chapter. However, the use of all these available data sources is still based on traditional computational approaches (e.g., quantitative structure–activity relationships, QSAR) and the issues of variety and veracity were rarely considered in most previous studies. This chapter also highlights the recent progress of data-driven research to answer the above challenges.

11.2 Fast-Growing Chemical Toxicity Data

Since the NIH Roadmap for medical research was launched in 2004 [7], several molecular library screening centers have been funded [8] and several HTS projects have been performed to experimentally test large chemical libraries. The recent data generation efforts in the area of toxicology are toxicity forecaster (ToxCast) initiated by the US Environmental Protection Agency (EPA) [9] and Toxicity Testing in the twenty-first century (Tox21), which was launched by the National Toxicology Program (NTP), the National Institutes of Health (NIH) Chemical Genomics Center (NCGC), and EPA [10–12]. The direct results of these experimental screening method efforts, especially HTS, are the toxicity data currently public available, as summarized in Table 11.1. These databases construct the current toxicity big data landscape, which is described in detail in the following.

Table 11.1 Public available toxicity data resources (as of 10/22/2016)

Database	Size	Data description	Accessibility
PubChem [13, 14]	Over 60 million compounds, over 1 million bioassays, over 13 billion data points	Toxicity, genomics, and literature data	https://pubchem.ncbi.nlm.nih.gov/
ChEMBL [15, 16]	Over 1 million compounds, over 13 million data points	Drugs, drug-like small molecules, and their bioactivity data	https://www.ebi.ac.uk/chembl/
ACToR [17, 18]	Over 800,000 compounds, over 500,000 assays	Both in vitro and in vivo toxicity data	https://actor.epa.gov
TOXNET [19]	Multiple sub-databases contain over 300,000 compounds and related toxicity data	Toxicity, diseases, and literature data	https://toxnet.nlm.nih.gov/
REACH [20]	816,048 studies for 9800 substances and 3600 study types	Data submitted in EU chemical legislation, machine-readable	http://www.reach.lu/mmp/online/website/menu_hori/homepage/index_EN.html
SEURAT-1 [21, 22]	Over 5500 cosmetic-type compounds in the current COSMOS database web portal	Animal toxicity data	http://www.seurat-1.eu/
ISSTOX [23]	Five sub-databases	(i) Long-term carcinogenicity bioassay on rodents (rat and mouse) (ISSCAN); (ii) In vitro Salmonella typhimurium mutagenesis (Ames test) (ISSSTY); (iii) In vivo mutagenesis (micronucleus test) (ISSMIC); (iv) Cell transformation Assays (ISSCTA); (v) Mutagenicity and carcinogenicity of biocides (ISSBIOC)	https://www.dimdi.de/static/en/index.html
HESS DB [24]	Repeated-dose toxicity study of over 2000 compounds and toxicity mechanism information of about 80 compounds were added recently	High curated results of repeat dose toxicity tests	http://www.nite.go.jp/en/chem/qsar/hess-e.html
RepDOSE [25]	364 chemicals investigated in 1018 studies which resulted in 6002 specific effects	Repeat-dose study data for dog, mouse, and rat	http://fraunhofer-repdose.de/
GEO [26]	Over 4000 sub-datasets	Microarray, next-generation sequencing and other forms of high-throughput functional genomics data submitted by the research community	https://www.ncbi.nlm.nih.gov/geo/
Open TG-GATEs [27]	170 compounds at multiple dosages and time points	Toxicity data, gene expression data, and metadata	http://toxico.nibiohn.go.jp/english/
CEBS [28]	About 10,000 toxicity bioassays from various sources	Gene expression data	http://www.niehs.nih.gov/research/resources/databases/cebs/index.cfm
CTD [29–31]	Over 13,000 compounds, over 32,000 genes, over 6000 diseases	Compound, gene, and disease relationships	https://ctdbase.org/
DrugMatrix [32]	About 600 drug molecules and 10,000 genes	Gene expression data	https://ntp.niehs.nih.gov/drugmatrix/index.html
Connectivity Map [33]	About 1300 compounds and 7000 genes	Gene expression data	http://portals.broadinstitute.org/cmap/

The well-known chemical/biological data-sharing projects were not initially designed for toxicants but contain considerable amounts of toxicity-relevant data. For example, PubChem is a public repository for chemical structures and their biological data, including the toxicity data from the screening centers as described above [13, 14]. Figure 11.2 shows the yearly increase in the number of PubChem compounds and bioassays. Over the past eight years, the number of PubChem compounds increased from 19 million in September 2008 [34] to over 60 million in September 2015 [35]. During the same period, the number of bioassays that were used to test these compounds increased from 1197 in September 2008 [34] to over 1 million in September 2015, resulting in over five terabytes of data [35]. Another large reservoir of bioassay data is the ChEMBL database, which is a manually curated chemical database maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL)[15]. The EBI's goal is to provide freely available data and bioinformatics services to the scientific community. As part of this goal, the ChEMBL database was constructed for experimental data of both chemical toxicity and absorption, distribution, metabolism, and excretion (ADME) properties. ChEMBL version 22 (ChEMBL_22) was released in 2016. It contains over 1 million compounds and over 14 million data points [16].

Plots for Increase in the number of compounds and bioassays recorded in PubChem within eight years. — **Figure 11.2** Increase in the number of compounds and bioassays recorded in PubChem within eight years (from September 2008 to September 2015).

The databases to specifically share toxicity data also made considerable progress. The Aggregated Computational Toxicology Online Resource (ACToR) portal, which was developed by the US EPA's National Center for Computational Toxicology (NCCT) program, aggregates toxicity data for over 800,000 compounds from thousands of public sources which include HTS, chemical exposure, sustainable chemistry (chemical structures and physicochemical properties), and virtual tissue data [17, 18]. Similarly, TOXNET, which was developed by the National Library of Medicines' (NLM) Division of Specialized Information Services (SIS), contains 16 separate sub-databases of various toxicity categories for thousands of diverse chemicals. By grouping these databases together, TOXNET allows users to access all toxicity data for target compounds from one query form [19]. Another famous example is that of the European Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) legislation which aims to collect comprehensive safety information for all substances available on the European market. The current REACH database contains more than 9000 unique substances and about 800,000 safety assessment documents [20].

Some other toxicity resources were designed to support data for one or several specific toxicity categories. The Safety Evaluation Ultimately Replacing Animal Testing (SEURAT-1) was launched by the European Commission and the European Cosmetics Trade Association, Cosmetics Europe in 2011 and it is composed of six complementary research projects [21, 22]. Among them, the COSMOS project was dedicated to the development of methods to predict the safety of cosmetic ingredients [36]. The COSMOS database web portal contains over 5500 unique cosmetic-relevant compounds with their respective in vivo toxicity data. Another European institute, Istituto Superiore di Sanita (ISS) developed ISSTOX database [23] which includes five sub-databases, each relative to a different toxicological endpoint. There are several databases containing multidose toxicity data, such as the Hazard Evaluation Support System Integrated Platform Database (HESS DB) [24] and RepDOSE [25]. HESS DB is a knowledge database containing toxicity data of compounds being tested in various doses/concentrations and relevant toxicity mechanism information. The RepDOSE database consists of toxicity data for 400 chemicals investigated in about 1000 multidose studies [25].

Another type of data source are toxicogenomics (TGx) studies, which generate enormous amounts of “omics” data that are meant to predict toxicity or genetic susceptibility induced by chemicals. Many modern in vitro toxicity studies now address that genomics information is sensitive to toxicants and these findings can be translated into biomarkers that is useful for chemical toxicity assessments [37, 38]. Several publicly available TGx databases have been initiated, such as the Gene Expression Omnibus (GEO) [26], the Japanese Toxicogenomics Project (TGP) database [39], the Chemical Effects in Biological Systems (CEBS) database [28], and the Comparative Toxicogenomics Database (CTD) [29–31]. GEO is a public repository that archives diverse forms of high-throughput functional genomic datasets and it contains a large quantity of gene expression data that can be used in computational toxicology studies [26]. TGP data are available to the public through the Open TG-GATEs database [27], which includes the toxicology data for 170 compounds. The CEBS database [28] developed by the NIEHS is now the public repository for all NTP conventional toxicology and carcinogenicity data as well as NCGC HTS data along with the CTD, which aims to promote comparative studies of genes and proteins across species [29–31]. The DrugMatrix [32] and the Connectivity Map [33] are similar efforts in toxicogenomics data curation, but with more specific research goals.

11.3 The Use of Big Data Approaches in Modern Computational Toxicology

The massive chemical toxicity data available to research communities poses a great opportunity for modern toxicology research. However, in the current big data era, traditional computational approaches are not suitable to deal with big data resources. The typical big data research in toxicology that has taken place over the past five years; along with the fast-growing data, this field was assisted by “super” computers with powerful computational ability.

11.3.1 Profiling the Toxicants with Massive Biological Data

With all these screening efforts, especially the HTS projects, as mentioned above, a significant number of “popular” compounds (e.g., well-known toxicants) have been tested multiple times. For example, Table 11.2 shows 20 toxicants obtained from the Integrated Risk Information System (IRIS) database. On the basis of the search result on PubChem, these toxicants were reported to be tested in hundreds of PubChem bioassays. As shown in Table 11.2, hexachlorophene (CAS 70-30-4), a disinfectant banned from the market, showed active responses in 366 bioassays. Other toxicants have similarly rich response information in PubChem (Table 11.2). The rich biological data for these toxicants can be viewed as response profiles which are useful for computational toxicology studies.

Table 11.2 Twenty human toxicants with their relevant PubChem bioassay responses

Chemicals	CAS	No. of active responses	No. of inactive responses
Hexachlorophene	70-30-4	366	721
Captan	133-06-2	212	198
Phenylmercuric acetate	62-38-4	203	115
Chlordecone (Kepone)	143-50-0	166	442
Hydroquinone	123-31-9	121	450
2-Chloroacetophenone	532-27-4	78	191
Pentachlorophenol	87-86-5	78	679
Tributyltin oxide (TBTO)	56-35-9	77	37
Endosulfan	115-29-7	75	189
Hexachlorocyclopentadiene (HCCPD)	77-47-4	69	203
Propachlor	1918-16-7	69	379
Sodium diethyldithiocarbamate	148-18-5	68	77
p,p′-Dichlorodiphenyl dichloroethane (DDD)	72-54-8	66	170
Acetaldehyde	75-07-0	63	173
Phenol	108-95-2	62	369
Bisphenol A	80-05-7	61	739
p,p′-Dichlorodiphenyl trichloroethane (DDT)	50-29-3	61	279
Naphthalene	91-20-3	61	697
Chlorobenzilate	510-15-6	59	62
α-Hexachlorocyclohexane (α-HCH)	319-84-6	58	619

There have been some pioneering studies that use bioassay data to profile toxicants in the early stages. For example, the major goal of the ToxCast project is to use hundreds of bioassays [40–48] to profile the compounds which have been tested for their animal toxicity as shown in the Toxicity Reference Database (ToxRefDB) [49–51]. The profiling studies of ToxCast tried to develop predictive models for toxic compounds using a set of in vitro assays and/or in silico predicted results, for example, ToxPi [52]. The disadvantage of this type of study is the selection of biological data for profiling; besides, prediction is arbitrary and only limited to in-house data.

In the current big data era, all the public toxicity data (e.g., shown in Table 11.1) can be used for profiling toxicants. In 2014, we developed an automatic virtual profiling tool to evaluate potential animal toxicants using all PubChem bioassay data [53]. The core of this study is a scoring system to evaluate the relationship between PubChem bioassays and animal toxicity. The top-ranked bioassays were used to profile the compounds of interest and make predictions of potential toxicants. Recently, similar effort was reported by Helal et al. [54]. They used PubChem bioassays to create bioprofiles for more than 300,000 chemicals and showed that these bioprofiles can be used in toxicity model development. Meanwhile, bioassayR, which is distributed as an R® package, can be used for simultaneous analysis of large numbers of biological data, especially those obtained from HTS [55]. Using bioassayR, bioassays can be clustered with the same targets and compound-target information can also be generated for further analysis.

The direct benefit of these toxicant profiling efforts is to provide new chemical toxicity mechanisms by analyzing the relevant biological data. Recently, we developed the virtual Adverse Outcome Pathway (vAOP) of oxidative stress by profiling hepatotoxicants [56]. In this study, four PubChem assays obtained from bioprofiles of hepatotoxicants were used to create a vAOP. If a new compound contains the initial chemical features described in this vAOP and shows active responses in any of these four assays, it will be predicted to cause liver damage in vivo through inducing oxidative stress. The ToxCast project generated several similar analyses [57–59]. For example, they studied estrogen receptor (ER)-binding potentials by profiling ER binders by diverse in vitro assays [57, 58] and gene expression data [59]. The AOP generated by these assay results can not only predict potential ER binders but also illustrate the relevant toxicity mechanisms.

11.3.2 Read-Across Study to Fill Data Gap

Traditional read-across approaches of computational toxicology, which were widely used to fill data gaps of new compounds without relevant toxicity data, are usually based on chemical similarity search [60, 61] or QSAR predictions. The basic hypothesis of this type of studies is “similar compounds have similar bioactivities” and is not suitable for most chemical toxicity phenomenon in vivo with complicated toxicity mechanisms. Only using chemical similarity to justify the read-across will be error-prone, especially when chemically similar compounds show dissimilar toxicity phenomena. For example, “activity cliffs” (i.e., similar compounds having different toxicity) result in prediction errors in many toxicity modeling studies [62–67]. Thus, in the big data era, the use of biological data besides chemical structural information adds extra strength to the read-across process.

In some early studies, integrating biological data as biological descriptors into the QSAR modeling procedure was found to be beneficial to the resultant toxicity models. For example, we developed enhanced acute toxicity models by integrating external biological data as extra descriptors [68, 69]. The resulting hybrid models, based on the combination of the chemical and biological descriptors, showed better performance than traditional QSAR models. Research studies were carried out Low et al., who concluded that the read-across study should be based on both chemical and biological data [70]. Similar results were reported by Garcia-Serna et al. [71]. They stated that combining chemical and biological data could enhance the ability to assess the toxicity of small molecules with higher confidence than that using chemical data alone.

Recent read-across studies have been reported to be performed by using various sources with massive amounts of toxicity data. Kleinstreuer et al. [72] used the US EPA's ToxCast dataset to perform read-across for a uterotrophic database collected from a large number of historical reports. The purpose of this study was to evaluate estrogenic endocrine disruption of new compounds. Another example was given by Luechtefeld et al. [5, 20, 73–75], who collected all the available historic REACH toxicity data for 9801 compounds. They performed a read-across study of REACH compounds for their acute oral toxicity [73], Draize eye irritation testing results [74], and skin sensitization activity [75]. A review of using big data to perform read-across of chemical toxicity [5] and two other strategy papers [76, 77] were also published. Recently, to fill data gaps, we developed a new read-across portal using public large-scale chemical and biological data called the Chemical In vitro–In vivo Profiling (CIIPro) portal [78], which can automatically extract biological data from public resources (i.e., PubChem) for compounds of interest. The read-across analysis based on biosimilarity, which was defined on the extracted biological data of target compounds, showed higher predictivity for estrogen receptor binding agents [79].

11.3.3 Unstructured Data Curation

Chemical toxicity data are rapidly increasing not only in structured data formats (e.g., those databases shown in Table 11.1), but also in various types of text documents such as scientific articles, patents, industry reports, and media reports, which can be classified as unstructured toxicity data. The use of unstructured toxicity data has motivated the development of text mining approaches in computational toxicity. Text mining refers to the automatic processes of deriving high-quality information from text. It usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluating and interpreting the output [80]. In computational toxicity, challenges always come from the step of structuring input text. For example, in documents, chemicals are usually referenced in different ways: common names, systematic nomenclatures, database identifiers (e.g., CAS number), InChI strings, or even as structures shown in images. Thus, text mining studies of chemicals of specific interest (e.g., animal toxicants) are strongly dependent on the named entity recognition (NER), which was reviewed by Vazquez et al. [81]. Furthermore, for a popular word (e.g., a commonly used term), it can refer to different meanings under different contexts. For example, CYP can be used in names of gene, protein, mutation, drug, or an adverse event [82–85]. As a part of biomedical science, the application of text mining in computational toxicity was for pathway extraction and reasoning and pharmacogenomics [86]. Text mining approaches contribute to computational toxicity studies by bringing useful knowledge from literature, either extracted or curated, together with in-house biological datasets to identify relationships between genes, pathways, drugs, environmental contaminants and diseases [87, 88]. There have been several studies applying text mining methods within toxicogenomics studies [89–91].

11.4 Challenges of Big Data Research in Computational Toxicology and Relevant Forecasts

The rise of big data heralds a profound change in the way that toxicologists perform their research. The big data era brings not only big progress but also big challenges [1, 92, 93]. Although there are some preliminary studies, as described above, which successfully apply big data sources in computational toxicology studies, the urgent needs of new approaches in this area are described in the following.

Experimental error is inevitable in public data sources. It is understandable that the quality of data may be vary on the basis of the nature of experimental protocols. Currently, the usefulness of public data sources is questionable owing to a lack of necessary quality control [94]. A general worry has been raised regarding irreproducible experimental data [95–97], which is relatively common in complicated biological testing (e.g., animal models). There is also a golden rule in computational modeling studies, which is the “trash in, trash out” principle [5]. For this reason, the veracity of big data, represented by the potential data quality of public data resources, is a critical issue that affects all relevant studies. There have been many studies [98–100] which have tried to address the incorrect chemical structure information. However, studies to automatically correct biological data errors are rare [101].

Although the current data growth (i.e., velocity of big data) is exceptional and there are many available data for well-known toxicants, the missing data (i.e., lacking of necessary toxicity data for target compounds) is still a common issue. As described above, read-across studies can be used to fill the data gap in some cases. However, a good read-across practice can only be performed when an “unknown” compound has reliable predictions from its nearest neighbors [5]. For the “outliers” that are excluded because they are out of the applicability domain (AD) of available models [102, 103], extra experimental testing is still necessary. For this reason, a well-defined and applicable AD is critical for any chemical risk assessment studies. Currently, the AD is normally defined by chemical similarity between the test set and modeling set compounds. To make the AD more applicable in big data studies, new methods need to be developed, such as the biosimilarity confidence that we have recently reported [78].

Toxicology research becomes more complicated when various types of data (i.e., a variety of big data) are used in a single study. This is the ultimate challenge of computational toxicology and new computational approaches are always needed to realize this goal. In the above section, we described hybrid models and new computational approaches to use various types of toxicity data in the computational toxicity field (e.g., vAOP studies). However, this type of work is far from achieving final success. The current bioinformatics and cheminformatics modeling approaches and data analysis methods that have been developed in the past decade are not suitable for the requirements of big data analysis.

Big data research will be one of the major efforts of modern toxicology in the future. With all these challenges, there is an urgent need for novel techniques in data mining/generation, curation, and analysis to fulfill the requirements of big data research in computational toxicity. The recent progress in computational toxicology described in this book can be viewed as leading in this direction. The success of data-driven studies will assist toxicologists by highlighting the value of the publicly available toxicity data and providing guidance for future experimental testing.

References

1 Marx, V. (2013) Biology: the big challenges of big data. Nature, 498, 255–260.
2 Swarup, V. and Geschwind, D.H. (2013) Alzheimer's disease: from big data to mechanism. Nature, 500, 34–35.
3 Zhu, H., Zhang, J., Kim, M.T. et al. (2014) Big data in chemical toxicity research: the use of high-throughput screening assays to identify potential toxicants. Chem. Res. Toxicol., 27, 1643–1651.
4 Schadt, E.E., Linderman, M.D., Sorenson, J. et al. (2011) Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nat. Rev. Genet., 12, 224. doi: 10.1038/nrg2857-c2
5 Hartung, T. (2016) Making big sense from big data in toxicology by read-across. ALTEX, 33, 83–93. doi: 10.14573/altex.1603091
6 McAfee, A. and Brynjolfsson, E. (2012) “Big data.” The management revolution. Harvard Business Rev., 90, 61–67. doi: 10.1007/s12599-013-0249-5
7 Zerhouni, E. (2003) The NIH Roadmap. Science, 302, 63–72.
8 Austin, C.P., Brady, L.S., Insel, T.R., and Collins, F.S. (2004) NIH molecular libraries initiative. Science, 306, 1138–1139.
9 Dix, D.J., Houck, K.A., Martin, M.T. et al. (2007) The toxcast program for prioritizing toxicity testing of environmental chemicals. Toxicol. Sci., 95, 5–12. doi: 10.1093/toxsci/kfl103
10 Collins, F.S., Gray, G.M., and Bucher, J.R. (2008) Toxicology: transforming environmental health protection. Science, 319, 906–907. doi: 10.1126/science.1154619
11 Hukkanen, R.R., Halpern, W.G., and Vidal, J.D. (2016) Regulatory forum opinion piece: review of FDA draft guidance testicular toxicity – evaluation during Drug Development Guidance for Industry. Toxicol. Pathol., 44, 927–930. doi: 10.1177/0192623316656416
12 Shukla, S.J., Huang, R., Austin, C.P., and Xia, M. (2010) The future of toxicity testing: a focus on in vitro methods using a quantitative high throughput screening platform. Drug Discov. Today, 15, 997–1007.
13 Wang, Y., Xiao, J., Suzek, T.O. et al. (2009) PubChem : a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res., 37, W623–W633. doi: 10.1093/nar/gkp456
14 Wang, Y., Bolton, E., Dracheva, S. et al. (2009) An overview of the PubChem BioAssay resource. Nucleic Acids Res., 38, D255–D266. doi: 10.1093/nar/gkp965
15 Gaulton, A., Bellis, L.J., Bento, A.P. et al. (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res., 40, D1100–D1107. doi: 10.1093/nar/gkr777
16 Bento, A.P., Gaulton, A., Hersey, A. et al. (2014) The ChEMBL bioactivity database: an update. Nucleic Acids Res., 42, D1083–D1090. doi: 10.1093/nar/gkt1031
17 Judson, R., Richard, A., Dix, D. et al. (2008) ACToR – aggregated computational toxicology resource. Toxicol. Appl. Pharmacol., 233, 7–13. doi: 10.1016/j.taap.2007.12.037
18 Judson, R.S., Martin, M.T., Egeghy, P. et al. (2012) Aggregating data for computational toxicology applications: the U.S. environmental protection agency (EPA) aggregated computational toxicology resource (ACToR) system. Int. J. Mol. Sci., 13, 1805–1831. doi: 10.3390/ijms13021805
19 Fonger, G.C., Stroup, D., Thomas, P.L., and Wexler, P. (2000) TOXNET: a computerized collection of toxicological and environmental health information. Toxicol. Ind. Health., 16, 4–6. doi: 10.1177/074823370001600101
20 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Global analysis of publicly available safety data for 9,801 substances registered under REACH from 2008-2014. ALTEX, 33, 95–109. doi: 10.14573/altex.1510052
21 Vinken, M., Pauwels, M., Ates, G. et al. (2012) Screening of repeated dose toxicity data present in SCC(NF)P/SCCS safety evaluations of cosmetic ingredients. Arch. Toxicol., 86, 405–412. doi: 10.1007/s00204-011-0769-z
22 Gocht, T., Berggren, E., Ahr, H.J. et al. (2015) The SEURAT-1 approach towards animal free human safety assessment. ALTEX, 32, 9–24. doi: 10.14573/altex.1408041
23 Benigni, R., Battistelli, C.L., Bossa, C. et al. (2013) New perspectives in toxicological information management, and the role of ISSTOX databases in assessing chemical mutagenicity and carcinogenicity. Mutagenesis, 28, 401–409. doi: 10.1093/mutage/get016
24 Sakuratani, Y., Zhang, H.Q., Nishikawa, S. et al. (2013) Hazard evaluation support system (HESS) for predicting repeated dose toxicity using toxicological categories. SAR QSAR Environ. Res., 24, 351–363.
25 Bitsch, A., Jacobi, S., Melber, C. et al. (2006) REPDOSE: a database on repeated dose toxicity studies of commercial chemicals – a multifunctional tool. Regul. Toxicol. Pharmacol., 46, 202–210. doi: 10.1016/j.yrtph.2006.05.013
26 B. Tesar, Gene Expression Omnibus, (2013) 1–10. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69034 (accessed 20 Aug, 2017).
27 Igarashi, Y., Nakatsu, N., Yamashita, T. et al. (2014) Open TG-GATEs: a large-scale toxicogenomics database. Nucleic Acids Res., gku955. doi: 10.1093/nar/gku955
28 Waters, M.D., Boorman, G., Bushel, P. et al. (2003) Systems toxicology and the chemical effects in biological systems (CEBS) knowledge base. Environ. Health Perspect., 111, 811–824. doi: 10.1289/txg.5971
29 Mattingly, C.J., Colby, G.T., Forrest, J.N., and Boyer, J.L. (2003) The comparative toxicogenomics database (CTD). Environ. Health Perspect., 111, 793. doi: 10.1289/ehp.6028
30 Davis, A.P., Grondin, C.J., Lennon-Hopkins, K. et al. (2015) The comparative toxicogenomics database's 10th year anniversary: update 2015. Nucleic Acids Res., 43, D914–D920. doi: 10.1093/nar/gku935
31 Davis, A.P., Murphy, C.G., Johnson, R. et al. (2013) The comparative toxicogenomics database: update 2013. Nucleic Acids Res., 41, 1104–1114. doi: 10.1093/nar/gks994
32 Ganter, B., Snyder, R.D., Halbert, D.N., and Lee, M.D. (2006) Toxicogenomics in drug discovery and development: mechanistic analysis of compound/class-dependent effects using the DrugMatrix® database. Pharmacogenomics, 1025–1044.
33 Lamb, J., Crawford, E.D., Peck, D. et al. (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 313, 1929–1935. doi: 10.1126/science.1132939
34 Sayers, E.W., Barrett, T., Benson, D.A. et al. (2009) Database resources of the national center for biotechnology information. Nucleic Acids Res., 37, D5–D15. doi: 10.1093/nar/gkn741
35 Acland, A., Agarwala, R., Barrett, T. et al. (2014) Database resources of the national center for biotechnology information. Nucleic Acids Res., 42, D7–D17. doi: 10.1093/nar/gkt1146
36 Yang, C., Ambrosio, M., Arvidson, K. et al. (2013) Development of new COSMOS oRepeatDose and non-cancer threshold of toxicological concern (TTC) databases to support alternative testing methods for cosmetics related chemicals. Toxicol. Lett., 221, S80–S80. doi: 10.1016/j.toxlet.2013.05.082
37 McHale, C.M., Zhang, L., Hubbard, A.E., and Smith, M.T. (2010) Toxicogenomic profiling of chemically exposed humans in risk assessment. Mutat. Res., 705, 172–183. doi: 10.1016/j.mrrev.2010.04.001.Toxicogenomic
38 Blaauboer, B.J., Boekelheide, K., Clewell, H.J. et al. (2012) The use of biomarkers of toxicity for integrating in vitro hazard estimates into risk assessment for humans. ALTEX, 29, 411–425. doi: 10.14573/altex.2012.4.411
39 Uehara, T., Ono, A., Maruyama, T. et al. (2010) The Japanese toxicogenomics project: Application of toxicogenomics. Mol. Nutr. Food Res., 54, 218–227. doi: 10.1002/mnfr.200900169
40 Judson, R.S., Houck, K.A., Kavlock, R.J. et al. (2010) In vitro screening of environmental chemicals for targeted testing prioritization: The toxcast project. Environ. Health Perspect., 118, 485–492. doi: 10.1289/ehp.0901392
41 Reif, D.M., Martin, M.T., Tan, S.W. et al. (2010) Endocrine profling and prioritization of environmental chemicals using toxcast data. Environ. Health Perspect., 118, 1714–1720. doi: 10.1289/ehp.1002180
42 Martin, M.T., Knudsen, T.B., Reif, D.M. et al. (2011) Predictive model of rat reproductive toxicity from ToxCast high throughput screening. Biol. Reprod., 85, 327–339. doi: 10.1095/biolreprod.111.090977
43 Sipes, N.S., Martin, M.T., Reif, D.M. et al. (2011) Predictive models of prenatal developmental toxicity from toxcast high-throughput screening data. Toxicol. Sci., 124, 109–127. doi: 10.1093/toxsci/kfr220
44 Kavlock, R., Chandler, K., Houck, K. et al. (2012) Update on EPA's ToxCast program: providing high throughput decision support tools for chemical risk management. Chem. Res. Toxicol., 25, 1287–1302. doi: 10.1021/tx3000939
45 Kleinstreuer, N.C., Dix, D.J., Houck, K.A. et al. (2013) In vitro perturbations of targets in cancer hallmark processes predict rodent chemical carcinogenesis. Toxicol. Sci., 131, 40–55. doi: 10.1093/toxsci/kfs285
46 Chemicals, E., Rotroff, D.M., Dix, D.J. et al. (2013) Using in vitro high throughput screening assays to identify potential endocrine-disrupting chemicals. Environ. Health Perspect., 121, 7–14.
47 Sipes, N.S., Martin, M.T., Kothiya, P. et al. (2013) Profiling 976 ToxCast chemicals across 331 enzymatic and receptor signaling assays. Chem. Res. Toxicol., 26, 878–895. doi: 10.1021/tx400021f
48 Tice, R.R., Austin, C.P., Kavlock, R.J., and Bucher, J.R. (2013) Improving the human hazard characterization of chemicals: a Tox21 update. Environ. Health Perspect., 121, 756–765. doi: 10.1289/ehp.1205784
49 Martin, M.T., Judson, R.S., Reif, D.M. et al. (2009) Profiling chemicals based on chronic toxicity results from the U.S. EPA ToxRef database. Environ. Health Perspect., 117, 392–399. doi: 10.1289/ehp.0800074
50 Martin, M.T., Mendez, E., Corum, D.G. et al. (2009) Profiling the reproductive toxicity of chemicals from multigeneration studies in the toxicity reference database. Toxicol. Sci., 110, 181–190. doi: 10.1093/toxsci/kfp080
51 Knudsen, T.B., Martin, M.T., Kavlock, R.J. et al. (2009) Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA's ToxRefDB. Reprod. Toxicol., 28, 209–219. doi: 10.1016/j.reprotox.2009.03.016
52 Reif, D.M., Sypa, M., Lock, E.F. et al. (2013) ToxPi GUI: an interactive visualization tool for transparent integration of data from diverse sources of evidence. Bioinformatics, 29, 402–403. doi: 10.1093/bioinformatics/bts686
53 Zhang, J., Hsieh, J.H., and Zhu, H. (2014) Profiling animal toxicants by automatically mining public bioassay data: a big data approach for computational toxicology. PLoS One, 9, e99863. doi: 10.1371/journal.pone.0099863
54 Helal, K.Y., Maciejewski, M., Gregori-Puigjane, E. et al. (2016) Public domain HTS fingerprints: design and evaluation of compound bioactivity profiles from PubChem's bioassay repository. J. Chem. Inf. Model., 56, 390–398. doi: 10.1021/acs.jcim.5b00498
55 Backman, T.W.H. and Girke, T. (2016) bioassayR: cross-target analysis of small molecule bioactivity. J. Chem. Inf. Model., 56, 1237–1242. doi: 10.1021/acs.jcim.6b00109
56 Kim, M.T., Huang, R., Sedykh, A. et al. (2016) Mechanism profiling of hepatotoxicity caused by oxidative stress using antioxidant response element reporter gene assay models and big data. Environ. Health Perspect., 124, 634–641.
57 Judson, R.S., Magpantay, F.M., Chickarmane, V. et al. (2015) Integrated model of chemical perturbations of a biological pathway using 18 in vitro high-throughput screening assays for the estrogen receptor. Toxicol. Sci., 148, 137–154. doi: 10.1093/toxsci/kfv168
58 Judson, R., Houck, K., Martin, M. et al. (2016) Analysis of the effects of cell stress and cytotoxicity on in vitro assay activity across a diverse chemical and assay space. Toxicol. Sci., 152, 323–339. doi: 10.1093/toxsci/kfw092
59 Ryan, N., Chorley, B., Tice, R.R. et al. (2016) Moving toward integrating gene expression profiling into high-throughput testing: a gene expression biomarker accurately predicts estrogen receptor a modulation in a microarray compendium. Toxicol. Sci., 151, 88–103. doi: 10.1093/toxsci/kfw026
60 Patlewicz, G., Ball, N., Becker, R.A. et al. (2014) Read-across approaches – misconceptions, promises and challenges ahead. ALTEX, 31, 387–396. doi: 10.14573/altex.1410071
61 Schultz, T.W., Amcoff, P., Berggren, E. et al. (2015) A strategy for structuring and reporting a read-across prediction of toxicity. Regul. Toxicol. Pharmacol., 72, 586–601. doi: 10.1016/j.yrtph.2015.05.016
62 Guha, R. and Van Drie, J.H. (2008) Structure–activity landscape index: identifying and quantifying activity cliffs. J. Chem. Inf. Model., 48, 646–658. doi: 10.1021/ci7004093
63 Johnson, S.R. (2008) The trouble with QSAR (or how i learned to stop worrying and embrace fallacy). J. Chem. Inf. Model., 48, 25–26. doi: 10.1021/ci700332k
64 Bajorath, J., Peltason, L., Wawer, M. et al. (2009) Navigating structure–activity landscapes. Drug Discov. Today, 14, 698–705. doi: 10.1016/j.drudis.2009.04.003
65 Medina-Franco, J.L., Martínez-Mayorga, K., Bender, A. et al. (2009) Characterization of activity landscapes using 2D and 3D similarity methods: consensus activity cliffs. J. Chem. Inf. Model., 49, 477–491. doi: 10.1021/ci800379q
66 Hu, X., Hu, Y., Vogt, M. et al. (2012) MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J. Chem. Inf. Model., 52, 1138–1145. doi: 10.1021/ci3001138
67 Stumpfe, D. and Bajorath, J. (2012) Exploring activity cliffs in medicinal chemistry. J. Med. Chem., 55, 2932–2942. doi: 10.1021/jm201706b
68 Sedykh, A., Zhu, H., Tang, H. et al. (2011) Use of in vitro HTS-derived concentration-response data as biological descriptors improves the accuracy of QSAR models of in vivo toxicity. Environ. Health Perspect., 119, 364–370.
69 Zhu, H., Ye, L., Richard, A. et al. (2009) A novel two-step hierarchical quantitative structure–activity relationship modeling work flow for predicting acute toxicity of chemicals in rodents. Environ. Health Perspect., 117, 1257–1264. doi: 10.1289/ehp.0800471
70 Low, Y., Sedykh, A., Fourches, D. et al. (2013) Integrative chemical–biological read-across approach for chemical hazard classification. Chem. Res. Toxicol., 26, 1199–1208. doi: 10.1021/tx400110f
71 Garcia-Serna, R., Vidal, D., Remez, N., and Mestres, J. (2015) Large-scale predictive drug safety: from structural alerts to biological mechanisms. Chem. Res. Toxicol., 28, 1875–1887. doi: 10.1021/acs.chemrestox.5b00260
72 Kleinstreuer, N.C., Ceger, P.C., Allen, D.G. et al. (2016) A curated database of rodent uterotrophic bioactivity. Environ. Health Perspect., 124, 556–562.
73 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of public oral toxicity data from REACH registrations 2008–2014. ALTEX, 33, 111–122. doi: 10.14573/altex.1510054
74 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of draize eye irritation testing and its prediction by mining publicly available 2008–2014 REACH data. ALTEX, 33, 123–134. doi: 10.14573/altex.1510053
75 Luechtefeld, T., Maertens, A., Russo, D.P. et al. (2016) Analysis of publically available skin sensitization data from REACH registrations 2008–2014. ALTEX., 33, 135–148. doi: 10.14573/altex.1510055
76 Ball, N., Cronin, M.T.D., Shen, J. et al. (2016) Toward good read-across practice (GRAP) guidance. ALTEX, 33, 149–166. doi: 10.14573/altex.1601251
77 Zhu, H., Bouhifd, M., Donley, E. et al. (2016) Supporting read-across using biological data. ALTEX, 33, 167–182. doi: 10.14573/altex.1601252
78 Russo, D.P., Kim, M., Wang, W. et al. (2017) CIIPro: A new read-across portal to fill data gaps using public large scale chemical and biological data. Bioinformatics, 33, 464–466.
79 Ribay, K., Kim, M.T., Wang, W. et al. (2016) Predictive modeling of estrogen receptor binding agents using advanced cheminformatics tools and massive public data. Front. Environ. Sci., 4, 1–9. doi: 10.3389/fenvs.2016.00012
80 Tan, A.-H. (1999) Text mining: the state of the art and the challenges. Proc. PAKDD 1999 Work. Knowl. Discov. Adv. Databases, 8, 65–70. doi: 10.1.1.132.6973
81 Vazquez, M., Krallinger, M., Leitner, F., and Valencia, A. (2011) Text mining for drugs and chemical compounds: methods, tools and applications. Mol. Inform., 30, 506–519. doi: 10.1002/minf.201100005
82 Pelkonen, O., Mäenpää, J., Taavitsainen, P. et al. (1998) Inhibition and induction of human cytochrome P450 (CYP) enzymes. Xenobiotica, 28, 1203–1253. doi: 10.1080/004982598238886
83 Ingelman-Sundberg, M. (2001) Genetic susceptibility to adverse effects of drugs and environmental toxicants: the role of the CYP family of enzymes. Mutat. Res. Mol. Mech. Mutagen, 482, 11–19. doi: 10.1016/S0027-5107(01)00205-6
84 Honkakoski, P. and Negishi, M. (2000) Regulation of cytochrome P-450 (CYP) genes by nuclear receptors. Biochem. J., 347, 321–337. doi: 10.1042/0264-6021:3470321
85 Nebert, D.W. and Russell, D.W. (2002) Clinical importance of the cytochromes P450. Lancet, 360, 1155–1162. doi: 10.1016/S0140-6736(02)11203-7
86 Gonzalez, G.H., Tahsin, T., Goodale, B.C. et al. (2016) Recent advances and emerging applications in text and data mining for biomedical discovery. Brief. Bioinform., 17, 33–42. doi: 10.1093/bib/bbv087
87 Krallinger, M., Krallinger, M., Erhardt, R.A.A. et al. (2005) Text mining approaches in molecular biology and biomedicine. Drug Discov. Today, 10, 439–445. doi: 10.1016/S1359-6446(05)03376-3
88 Kavlock, R.J., Ankley, G., Blancato, J. et al. (2008) Computational toxicology – a state of the science mini review. Toxicol. Sci., 103, 14–27. doi: 10.1093/toxsci/kfm297
89 Davis, A.P., Wiegers, T.C., Johnson, R.J. et al. (2013) Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One., 8, e58201. doi: 10.1371/journal.pone.0058201
90 Chung, M.H., Wang, Y., Tang, H. et al. (2015) Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics. Front. Pharmacol., 6, 1–7. doi: 10.3389/fphar.2015.00081
91 Lee, M., Liu, Z., Kelly, R., and Tong, W. (2014) Of text and gene – using text mining methods to uncover hidden knowledge in toxicogenomics. BMC Syst. Biol., 8, 93. doi: 10.1186/s12918-014-0093-3
92 Bizer, C., Boncz, P., Brodie, M.L., and Erling, O. (2011) The meaningful use of big data: four perspectives – four challenges. ACM SIGMOD Rec., 40, 56–60. doi: 10.1145/2094114.2094129
93 Coveney, P.V., Dougherty, E.R., and Highfield, R.R. (2016) Big data need big theory too. Philos. Trans. Royal Soc. A, 374, 20160153. doi: 10.1098/rsta.2016.0153
94 Williams, A.J. and Ekins, S. (2011) A quality alert and call for improved curation of public chemistry databases. Drug Discov. Today, 16, 747–750. doi: 10.1016/j.drudis.2011.07.007
95 Prinz, F., Schlange, T., and Asadullah, K. (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov., 10, 712. doi: 10.1038/nrd3439-c1
96 Ioannidis, J.P.A., Allison, D.B., Ball, C.A. et al. (2009) Repeatability of published microarray gene expression analyses. Nat. Genet., 41, 149–155. doi: 10.1038/ng.295
97 Bell, A.W., Deutsch, E.W., Au, C.E. et al. (2009) A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat. Methods, 6, 423–430. doi: 10.1038/nmeth.1333
98 Fourches, D., Muratov, E., and Tropsha, A. (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J. Chem. Inf. Model., 50, 1189–1204. doi: 10.1021/ci100176x
99 Young, D., Martin, T., Venkatapathy, R., and Harten, P. (2008) Are the chemical structures in your QSAR correct? QSAR Comb. Sci., 27, 1337–1345. doi: 10.1002/qsar.200810084
100 Fourches, D., Muratov, E., and Tropsha, A. (2015) Curation of chemogenomics data. Nat. Chem. Biol., 11, 535–535. doi: 10.1038/nchembio.1881
101 Zhao L., Wang W., Sedykh A. et al. (2017) Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do. ACS Omega, 2 2805–2812.
102 J. Jaworska, N. Nikolova-Jeliazkova, T. Aldenberg, Review of methods for QSAR applicability domain estimation by the training set, ATLA 33 (2005) 445–459. https://www.ncbi.nlm.nih.gov/pubmed/16268757 (accessed 27 September 2017).
103 Tetko, I.V., Sushko, I., Pandey, A.K. et al. (2008) Critical assessment of QSAR models of environmental toxicity against tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J. Chem. Inf. Model., 48, 1733–1746. doi: 10.1021/ci800151m

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 11: Big Data in Computational Toxicology: Challenges and Opportunities

Create new playlist

Sign In

Sign Up