CHAPTER 2

Challenges Faced in the Integration of Biological Information

Su Yun Chung and John C. Wooley

Biologists, in attempting to answer a specific biological question, now frequently choose their direction and select their experimental strategies by way of an initial computational analysis. Computers and computer tools are naturally used to collect and analyze the results from the largely automated instruments used in the biological sciences. However, far more pervasive than this type of requirement, the very nature of the intellectual discovery process requires access to the latest version of the worldwide collection of data, and the fundamental tools of bioinformatics now are increasingly part of the experimental methods themselves. A driving force for life science discovery is turning complex, heterogeneous data into useful, organized information and ultimately into systematized knowledge. This endeavor is simply the classic pathway for all science, Data ⇒ Information ⇒ Knowledge ⇒ Discovery, which earlier in the history of biology required only brainpower and pencil and paper but now requires sophisticated computational technology.

In this chapter, we consider the challenges of information integration in biology from the perspective of researchers using information technology as an integral part of their discovery processes. We also discuss why information integration is so important for the future of biology and why and how the obstacles in biology differ substantially from those in the commercial sector—that is, from the expectations of traditional business integration. In this context, we address features specific to the biological systems and their research approaches. We then discuss the burning issues and unmet needs facing information integration in the life sciences. Specifically, data integration, meta-data specification, data provenance and data quality, ontology, and Web presentations are discussed in subsequent sections. These are the fundamental problems that need to be solved by the bioinformatics community so that modern information technology can have a deeper impact on the progress of biological discovery. This chapter raises the challenges rather than trying to establish specific, ideal solutions for the issues involved.

2.1 THE LIFE SCIENCE DISCOVERY PROCESS

In the last half of the 20th century, a highly focused, hypothesis-driven approach known as reductionist molecular biology gave scientists the tools to identify and characterize molecules and cells, the fundamental building blocks of living systems. To understand how molecules, and ultimately cells, function in tissues, organs, organisms, and populations, biologists now generally recognize that as a community they not only have to continue reductionist strategies for the further elucidation of the structure and function of individual components, but they also have to adopt a systems-level approach in biology. Systems analysis demands not just knowledge of the parts—genes, proteins, and other macromolecular entities—but also knowledge of the connection of these molecular parts and how they work together. In other words, the pendulum of bioscience is now swinging away from reductionist approaches and toward synthetic approaches characteristic of systems biology and of an integrated biology capable of quantitative and/or detailed qualitative predictions. A synthetic or integrated view of biology obviously will depend critically on information integration from a variety of data sources. For example, neuroinformatics includes the anatomical and physiological features of the nervous system, and it must interact with the molecular biological databases to facilitate connections between the nervous system and molecular details at the level of genes and proteins.1 In phylogeny and evolution biology, comparative genomics is making new impacts on evolutionary studies. Over the past two decades, research in evolutionary biology has come to depend on sequence comparisons at the gene and protein level, and in the future, it will depend more and more on tracking not just DNA sequences but how entire genomes evolve over time [1]. In ecology there is an opportunity ultimately to study the sequences of all genomes involved in an entire ecological community. We believe integration bioinformatics will be the backbone of 21st-century life sciences research.

Research discovery and synthesis will be driven by the complex information arising intrinsically from biology itself and from the diversity and heterogeneity of experimental observations. The database and computing activities will need to be integrated to yield a cohesive information infrastructure underlying all of biology. A conceptual example of how biological research has increasingly come to depend on the integration of experimental procedures and computation activities is illustrated in Figure 2.1. A typical research project may start with a collection of known or unknown genomic sequences (see Genomics in Figure 2.1). For unknown sequences, one may conduct a database search for similar sequences or use various gene-finding computer algorithms or genome comparisons to predict the putative genes. To probe expression profiles of these genes/sequences, high-density microarray gene expression experiments may be carried out. The analysis of expression profiles of up to 100,000 genes can be conducted experimentally, but this requires powerful computational correlation tools. Typically, the first level of experimental data stream output for a microarray experiment (laboratory information management system [LIMS] output) is a list of genes/sequences/identification numbers and their expression profile. Patterns or correlations within the massive data points are not obvious by manual inspection. Different computational clustering algorithms are used simultaneously to reduce the data complexity and to sort out relationships among genes/sequences according to their expression levels or changes in expression levels.

image

FIGURE 2.1 Information-driven discovery.

These clustering techniques, however, have to deal with a high-dimensional data element space; the possibility for correlation by chance is high because a set of genes clustered together does not necessarily imply participation in a common biological process. To back up the clustering results, one may proceed to proteomics (see Figure 2.1) to connect the gene expression results with available protein expression patterns, known protein structures and functions, and protein–protein interaction data. Ultimately, the entire collection of interrelated macromolecular information may be considered in the context of systems biology (see Figure 2.1), which includes analyses of protein or metabolic pathways, regulatory networks, and other, more complex cellular processes. The connections and interactions among areas of genomics, gene expression profiles, proteomics, and systems biology depend on the integration of experimental procedures with database searches and the applications of computational algorithms and analysis tools.

As one moves up in the degree of complexity of the biological processes under study, our understanding at each level depends in a significant way on the levels beneath it. In every step, database searches and computational analysis of the data are an integral part of the discovery process. As we choose complex systems for study, experimentally generated data must be combined with data derived from databases and computationally derived models or simulations for best interpretation. On the other hand, modeling and simulation of protein-protein interactions, protein pathways, genetic regulatory networks, biochemical and cellular processes, and normal and disease physiological states are in their infancy and need more experimental observations to fill in missing quantitative details for mature efforts. In this close interaction, the boundaries between experimentally generated data and computationally generated data are blurring. Thus, accelerating progress now requires multidisciplinary teams to conduct integrated approaches. Thus, in silico discovery, that is, experiments carried out with a computer, is fully complementary to traditional wet-laboratory experiments. One could say that an information infrastructure, coupled with continued advances in experimental methods, will facilitate computing an understanding of biology.

2.2 AN INFORMATION INTEGRATION ENVIRONMENT FOR LIFE SCIENCE DISCOVERY

Biological data sources represent the collective research efforts and products of the life science communities throughout the world. The growth of the Internet and the availability of biological data sources on the Web have opened up a tremendous opportunity for biologists to ask questions and solve problems in unprecedented ways. To harness these community resources and assemble all available information to investigate specific biological problems, biologists must be able to find, extract, merge, and synthesize information from multiple, disparate sources. Convergence of biology, computer science, and information technology (IT) will accelerate this multidisciplinary endeavor. The basic needs are:

1. On demand access and retrieval of the most up-to-date biological data and the ability to perform complex queries across multiple heterogeneous databases to find the most relevant information

2. Access to the best-of-breed analytical tools and algorithms for extraction of useful information from the massive volume and diversity of biological data

3. A robust information integration infrastructure that connects various computational steps involving database queries, computational algorithms, and application software

This multidisciplinary approach demands close collaboration and clear understanding between people with extremely different domain knowledge and skill sets. The IT professionals provide the knowledge of syntactic aspects of data, databases, and algorithms, such as how to search, access, and retrieve relevant information, manage and maintain robust databases, develop information integration systems, model biological objects, and support a user-friendly graphical interface that allows the end user to view and analyze the data. The biologists provide knowledge of biological data, semantic aspects of databases, and scientific algorithms. Interpreting biological relationships requires an understanding of the biological meaning of the data beyond the physical file or table layout. Particularly, the effective usage of scientific algorithms or analytical tools (e.g., sequence alignment, protein structure prediction, and other analysis software) depends on having a working knowledge of the computer programs and of biochemistry, molecular biology, and other scientific disciplines. Before we can discuss biological information integration, we need first to consider the specific nature of biological data and data sources.

2.3 THE NATURE OF BIOLOGICAL DATA

The advent of automated and high-throughput technologies in biological research and the progress in the genome projects has led to an ever-increasing rate of data acquisition and exponential growth of data volume. However, the most striking feature of data in life science is not its volume but its diversity and variability.

2.3.1 Diversity

The biological data sets are intrinsically complex and are organized in loose hierarchies that reflect our understanding of the complex living systems, ranging from genes and proteins, to protein-protein interactions, biochemical pathways and regulatory networks, to cells and tissues, organisms and populations, and finally the ecosystems on earth. This system spans many orders of magnitudes in time and space and poses challenges in informatics, modeling, and simulation equivalent to or beyond any other scientific endeavor. A notional description of the vast scale of complexity, population, time, and space in the biological systems is given in Figure 2.2 [2]. Reflecting the complexity of biological systems, the types of biological data are highly diverse. They range from the plain text of laboratory records and literature publications, nucleic acid and protein sequences, three-dimensional atomic structures of molecules, and biomedical images with different levels of resolutions, to various experimental outputs from technology as diverse as microarray chips, gels, light and electronic microscopy, Nuclear Magnetic Resonance (NMR), and mass spectrometry. The horizontal abscissa in Figure 2.2 shows time scales ranging from femtoseconds to eons that represent the processes in living systems from chemical and biochemical reactions, to cellular events, to evolution. The vertical ordinate shows the numerical scale, the range of number of atoms involved in molecular biology, the number of macromolecules in cellular biology, the number of cells in physiological biology, and the number of organisms in population biology. The third dimension indicated by rectangles illustrates the hierarchical nature of biology from subcellular structures to ecosystems. The fourth dimension, indicated by ovals, represents the current state of computation biology in modeling and simulation of biological systems.

image

FIGURE 2.2 Notional representation of the vast and complex biological world.

2.3.2 Variability

Different individuals and species vary tremendously, so naturally biological data does also. For example, structure and function of organs vary across age and gender, in normal and different disease states, and across species. Essentially, all features of biology exhibit some degree of variability. Biological research is in an expanding phase, and many fields of biology are still in the developing stages. Data for these systems are incomplete and very often inconsistent. This presents a great challenge in modeling biological objects.

2.4 DATA SOURCES IN LIFE SCIENCE

In response to current advances in technology and research scope, massive amounts of data are routinely deposited in public and private databases. In parallel, there is a proliferation of computational algorithms and analysis tools for data analysis and visualization. Because most databases are accompanied by specific computational algorithms or tools for analysis and presentation and vice versa, we use the term data source to refer to a database or computational analysis tool or both. There are more than 1000 life science data sources scattered over the Internet (see the Biocatalog and the Public Catalog of Databases), and these data sources vary widely in scope and content. Finding the right data sources alone can be a challenge. Searching for relevant information largely relies on a Web information retrieval system or on published catalog services. Each January, the Journal of Nucleic Acid Research provides a yearly update of molecular biology database collections. The current issue lists 335 entries in molecular biology databases alone [3]. Various Web sites provide a catalog and links to biological data sources (see “biocat” and “dbcat” cited previously). In addition to the public sources, there are numerous private, proprietary data sources created by biotechnology or pharmaceutical companies.

The scope of the public data sources ranges from the comprehensive, multidisciplinary, community informatics center, supported by government public funds and sustained by teams of specialists, to small boutique data sources by individual investigators. The content of databases varies greatly, reflecting the broad disciplines and sub-disciplines across life sciences from molecular biology and cell biology, to medicine and clinical trials, to ecology and biodiversity. A sampling of various public biological databases is given in the Appendix.

2.4.1 Biological Databases Are Autonomous

Biological data sources represent a loose collection of autonomous Web sites, each with its own governing body and infrastructure. These sites vary in almost every possible instance such as computer platform, access, and data management system. Much of the available biological data exist in legacy systems in which there are no structured information management systems. These data sources are inconsistent at the semantic level, and more often than not, there is no adequate attendant meta-data specification. Until recently, biological databases were not designed for interoperability [4].

2.4.2 Biological Databases Are Heterogeneous in Data Formats

Data elements in public or proprietary databases are stored in heterogeneous data formats ranging from simple files to fully structured database systems that are often ad hoc, application-specific, or vendor-specific. For example, scientific literature, images, and other free-text documents are commonly stored in unstructured or semi-structured formats (plain text files, HTML or XML files, binary files). Genomic, microarray gene expression, and proteomic data are routinely stored in conventional spreadsheet programs or in structured relational databases (Oracle, Sybase, DB2, Informix). Major data depository centers have implemented various data formats for operations; the National Center for Biotechnology Information (NCBI) has adopted the highly nested data system ASN.1 (Abstract Syntax Notation) for the general storage of gene, protein, and genomic information [5]; the United States Department of Agriculture (USDA) Plant Genome Data and Information Center has adopted the object-oriented, A C. elegans Data Base (Ace DB) data management systems and interface [6].

2.4.3 Biological Data Sources Are Dynamic

In response to the advance of biological research and technology, the overall features of biological data sources are subjected to continuous changes including data content and data schema. New databases spring up at a rapid rate and older databases disappear.

2.4.4 Computational Analysis Tools Require Specific Input/Output Formats and Broad Domain Knowledge

Computational software packages often require specific input and output data formats and graphic display of results, which pose serious compatibility and interoperability issues. The output of one program is not readily suitable as direct input for the next program or for a subsequent database search. Development of a standard data exchange format such as XML will alleviate some of the interoperability issues.

Understanding application semantics and the proper usages of computer software is a major challenge. Currently, there are more than 500 software packages or analysis tools for molecular biology alone (reviewed in the Biocatalog at the European Bioinformatics Institute [EBI] Web site given previously). These programs are extremely diverse, ranging from nucleic and protein sequence analysis, genome comparison, protein structure prediction, biochemical pathway and genetic network analysis, and construction of phylogenetic trees, to modeling and simulation of biological systems and processes. These programs, developed to solve specific biological problems, rely on input from other domain knowledge such as computer science, applied mathematics, statistics, chemistry, and physics. For example, protein folding can be approached using ab initio prediction based on first principles (physics) or on knowledge-based (computer science) threading methods [7]. Many of these software packages, particularly those available through academic institutions, lack adequate documentation describing the algorithm, functionality, and constraints of the program. Given the multidisciplinary nature and the scope of domain knowledge, proper usage of a scientific analysis program requires significant (human) expertise. It is a daunting task for the end users to choose and evaluate the proper software programs for analyses, so they will be able to understand and interpret the results.

2.5 CHALLENGES IN INFORMATION INTEGRATION

With the expansion of the biological data sources available across the World Wide Web, integration is a new, major challenge facing researchers and institutions that wish to explore these rich deposits of information. Data integration is an ongoing active area in the commercial world. However, information integration in biology must consider the characteristics of the biological data and data sources as discussed in the previous two sections (2.3 and 2.4): (1) diverse data are stored in autonomous data sources that are heterogeneous in data formats, data management systems, data schema, and semantics; (2) analysis of biological data requires both database query activities and proper usage of computational analysis tools; (3) a broad spectrum of knowledge domains divide traditional biological disciplines.

For a typical research project, a user must be able to merge data derived from multiple, diverse, heterogeneous sources freely and readily. As illustrated in Figure 2.3, the LIMS output from microarray gene expression experiments must be interpreted and analyzed in the context of the information and tools available across the Internet, including genomic data, literature, clinical data, analysis algorithms, etc. In many cases, data retrieved from several databases may be selected, filtered, and transformed to prepare input data sets for particular analytic algorithms or applications. The output of one program may be submitted as input to another program and/or to another database search. The integration process involves an intricate network of multiple computational steps and data flow. Information integration in biology faces challenges at the technology level for data integration architectures and at the semantic level for meta-data specification, maintenance of data provenance and accuracy, ontology development for knowledge sharing and reuse, and Web presentations for communication and collaboration.

image

FIGURE 2.3 Integration of experimental data, data derived from multiple database queries, and applications of scientific algorithms and computational analysis tools (Refer to the Appendix for the definitions of acronyms).

2.5.1 Data Integration

First-generation bioinformatics solutions for data integration employ a series of non-interoperable and non-scalable quick fixes to translate data from one format into another. This means writing programs, usually in programming language such as Perl, to access, parse, extract, and transform necessary data for particular applications. Writing a translation program requires intensive coding efforts and knowledge of the data and structures of the source databases. These ad hoc point-to-point solutions are very inefficient and are not scalable to the large number of data sources to be integrated. This is dubbed the N2 factor because it would require N (N-1)/2 programs to connect N data sources. If one particular data source changes formats, all of the programs involved with this data source must be upgraded. Upgrades are inevitable because changes in Web page services and schema are very common for biological data sources.

The second generation of data integration solutions provides a more structured environment for code re-use and flexible, scalable, robust integration. Over the past decade, enormous efforts and progress have been made in many data integration systems. They can be roughly divided into three major categories according to access and architectures: the data warehousing approach, the distributed or federated approach, and the mediator approach. However, the following fundamental functions or features are desirable for a robust data integration system:

1. Accessing and retrieving relevant data from a broad range of disparate data sources

2. Transforming the retrieved data into designated data model for integration

3. Providing a rich common data model for abstracting retrieved data and presenting integrated data objects to the end user applications

4. Providing a high-level expressive language to compose complex queries across multiple data sources and to facilitate data manipulation, transformation, and integration tasks

5. Managing query optimization and other complex issues

The Data Warehouse Approach

The data warehouse approach assembles data sources into a centralized system with a global data schema and an indexing system for integration and navigation. The data warehouse world is dominated by relational database management systems (RDBMS), which offer the advantage of a mature and widely accepted database technology and a high level standard query language (SQL) [8]. These systems have proven very successful in commercial enterprises, health care, and government sectors for resource management such as payroll, inventory, and records. They require reliable operation and maintenance, and the underlying databases are under a controlled environment, are fairly stable, and are structured. The biological data sources are very different from those contained in the commercial databases. The biological data sources are much more dynamic and unpredictable, and few of the public biological data sources use structured database management systems. Given the sheer volume of data and the broad range of biological databases, it would require substantial effort to develop any monolithic data warehouses encompassing diverse biological information such as sequence and structure and the various functions of biochemical pathways and genetic polymorphisms. As the number of databases in a data warehouse grows, the cost of storage, maintenance, and updating data will be prohibitive. A data warehouse has an advantage in that the data are readily accessed without Internet delay or bandwidth limitation in network connections. Vigorous data cleansing to remove potential errors, duplications, and semantic inconsistency can be performed before entering data in the warehouse. Thus, limited data warehouses are popular solutions in the life sciences for data mining of large databases, in which carefully prepared data sets are critical for success [9].

The Federation Approach

The distributed or federated integration approaches do not require a centralized persistent database, and thus the underlying data sources remain autonomous. The federated systems maintain a common data model and rely on schema mapping to translate heterogeneous source database schema into the target schema for integration. A data dictionary is used to manage various schema components. In the life science arena, in which schema changes in data sources are frequent, the maintenance of a common schema for integration could be costly in large federated systems. As the database technology progresses from relational toward object-oriented technology [10], many distributed integration solutions employ object-oriented paradigms to encapsulate the heterogeneity of underlying data sources in life science. These systems typically rely on client–server architectures and software platforms or interfaces such as Common Object Request Broker Architecture (CORBA), an open standards by the Object Management Group (OMG) to facilitate interoperation of disparate components [11, 12].

The Mediator Approach

The most flexible data integration designs adopt a mediator approach that introduces an intermediate processing layer to decouple the underlying heterogeneous distributed data sources and the client layer of end users and applications. The mediator layer is a collection of software components performing the task of data integration. The concept was first introduced by Wiederhold to provide flexible modular solutions for integration of large information systems with multiple knowledge domains [13, 14].

Most database mediator systems use a wrappers layer to handle the tasks of data access, data retrieval, and data translation. The wrappers access specified data sources, extract selected data, and translate source data formats into a common data model designated for the integration system.

The mediator layer performs the core function of data transformation and integration and communicates with the wrappers and the user application layer. The integration system provides an internal common data model for abstraction of incoming data derived from heterogeneous data sources. Thus, the internal data model must be sufficiently rich to accommodate various data formats of existing biological data sources, which may include unstructured text files, semi-structured XML and HTML files, and structured relational, object-oriented, and nested complex data models. In addition, the internal data model facilitates structuring integrated biological objects to present to the user application layer. The flat, tabular forms of the relational model encounter severe difficulty in model complex and hierarchical biological systems and concepts. XML and other object-oriented models are more natural in model biological systems and are gaining popularity in the community.

In addition to the core integration function, the mediator layer also provides services such as filtering, managing meta-data, and resolving semantic inconsistency in source databases. Ideally, instead of relying on low-level programming efforts, a full integration system supports a high-level query language for data transformation and manipulation. This would greatly facilitate the composition of complex queries across multiple data sources and the management of architecture layers and software components.

The advantage of the mediator approach is its flexibility, scalability, and modularity. The heterogeneity and dynamic nature of the data sources is isolated from the end user applications. Wrappers can readily handle data source schema changes. New data sources can be added to the system by simply adding new wrappers. Scientific analytical tools are simply treated as data sources via wrappers and can be seamlessly integrated with database queries. This approach is most suitable for scientific investigations that need to access the most up-to-date data and issue queries against multiple heterogeneous data sources on demand.

There are many flavors of mediator approaches in life science domains, which differ in database technologies, implementations, internal data models, and query languages. The Kleisli system provides an internal, nested, complex data model and a high-power query and transformation language for data integration [1517]. The K2 system shares many design principles with Kleisli in supporting a complex data model, but it adopts more object-oriented features [18, 19] (see Chapter 8). The Object-Protocol Model (OPM) supports a rich object model and a global schema for data integration [20, 21]. The IBM Discovery Link middleware system is rooted in the relational database technology and supports a full SQL3 [22, 23] (see Chapter 11). The Transparent Access to Multiple Bioinformatics Information Sources (TAMBIS) provides a global ontology to facilitate queries across multiple data sources [24, 25] (see Chapter 7). The Stanford-IBM Manager of Multiple Information Sources (TSIMMIS) is a mediation system for information integration with its own data model, the Object-Exchange Model (OEM), and query language [26].

2.5.2 Meta-Data Specification

Meta-data is data describing data, that is, data that provides documentation on other data managed within an application or environment.

In a structured database environment, the meta-data are formally included in the data schema and type definition. However, few of the biomedical databases use commercial, structured database management systems. The majority of biological data are stored and managed in collections of flat files in which the structure and meaning of the data are not well documented. Furthermore, most biological data are presented to the end users as loosely structured Web pages, even with those databases that have underlying structured database management systems (DBMS).

Many biological data sources provide keyword-search querying interfaces with which a user can input specified Boolean combinations of search terms to access the underlying data. Formulating effective Boolean queries requires domain expertise and knowledge of the contents and structure of the databases. Without meta-data specification, users are likely to formulate queries that return no answers or return an excessively large number of irrelevant answers. In such unstructured or semi-structured data access environments, the introduction of meta-data in the databases across the Web would be important for information gathering and to enhance the user’s ability to capture the relevant information independent of data formats.

The need for adequate meta-data specification for scientific analytical algorithms and software tools is particularly acute. Very little attention has been given to meta-data specification in existing programs, especially those available in the public domain from academic institutions. In general, they lack adequate documentation on algorithms, data formats, functionality, and constraints. This could lead to potential misunderstanding of computational tools by the end users. For example, sequence comparison programs are the most commonly used tools to search similar sequences in databases. There are many such programs in the public and private domains. The Basic Local Alignment Tool (BLAST) uses heuristic approximation algorithms to search for related sequences against the databases [27]. BLAST has the advantage of speed in searching very large databases and is a widely used tool. Very often it is an overly used tool in the molecular biology community. The BLAST program trades speed for sensitivity and may not be the best choice for all purposes. The Smith–Waterman dynamic programming algorithm, which strives for optimal global sequence alignment, is more sensitive in finding distantly related sequences [28]. However, it requires substantial computation power and a much slower search speed (50-fold or more). Recently, a number of other programs have been developed using hidden Markov models, Bayesian statistics, and neural networks for pattern matching [29]. In addition to algorithmic differences, these programs vary in accuracy, statistical scoring system, sensitivity, and performance. Without an adequate meta-data specification, it would be a challenge for users to choose the most appropriate program for their application, let alone to use the optimal parameters to interpret the results properly and evaluate the statistical significance of the search results.

In summary, with the current proliferation of biological data sources over the Internet and new data sources constantly springing up around the world, there is an urgent need for better meta-data specification to enhance our ability to find relevant information across the Web, to understand the semantics of scientific application tools, and to integrate information. Ultimately, the communication and sharing of biological data will follow the concept and development of the Semantic Web [30].2 The Resource Description Format (RDF) schema developed by the Semantic Web offers a general model for meta-data applications such that data sources on the Web can be linked and be understood by both humans and computers.3

2.5.3 Data Provenance and Data Accuracy

As databases move to the next stage of development, more and more secondary databases with value-added annotations will be developed. Many of the data providers will also become data consumers. Data provenance and data accuracy become major issues as the boundaries among primary data generated experimentally, data generated through application of scientific analysis programs, and data derived from database searches will be blurred. When users find and examine a set of data from a given database, they will have to be concerned about where the data came from and how the data were generated.

One example of this type of difficulty can be seen with the genome annotation pipeline. The raw experimental output of DNA sequences needs to be characterized and analyzed to turn into useful information. This may involve the application of sequence comparison programs or a sequence similarity search against existing sequence databases to find similar sequences that have been studied in other species to infer functions. For genes/sequences with unknown function, gene prediction programs can be used to identify open reading frames, to translate DNA sequences into protein sequences, and to characterize promoter and regulatory sequence motifs. For genes/sequences that are known, database searches may be performed to retrieve relevant information from other databases for protein structure and protein family classification, genetic polymorphism and disease, literature references, and so on. The annotation process involves computational filtering, transforming, and manipulating of data, and it frequently requires human efforts in correction and curation.

Thus, most curated databases contain data that have been processed with specific scientific analysis programs or extracted from other databases. Describing the provenance of some piece of data is a complex issue. These annotated databases offer rich information and have enormous value, yet they often fail to keep an adequate description of the provenance of the data they contain [31].

With increasingly annotated content, databases become interdependent. Errors caused by data acquisition and handling in one database can be propagated quickly into other databases, or data updated in one database may not be immediately propagated to the other related databases. At the same time, differences in annotations of the same object may arise in different databases because of the application of different scientific algorithms or to different interpretations of results.

Scientific analysis programs are well known to be extremely sensitive to input datasets and the parameters used in computation. For example, a common practice in annotation of an unknown sequence is to infer that similar sequences share common biochemical function or a common ancestor in evolution. The use of different algorithms and different cut-off values for similarity could potentially yield different results for remotely related sequences. Other forms of evidence are required to resolve the inconsistency. This type of biological reasoning also points to another problem. Biological conclusions derived by inference in one database will be propagated and may no longer be reliable after numerous transitive assertions.

Data provenance touches the issue of data accuracy and reliability. It is critical that databases provide meta-data specification on how the data are generated and derived. This has to be as rigorous as the traditional standards for experimental data for which the experimental methods, conditions, and material are provided. Similarly, computationally generated data should be documented with the computational conditions involved, including algorithms, input datasets, parameters, constraints, and so on.

2.5.4 Ontology

On top of the syntactic heterogeneity of data sources, one of the major stumbling blocks in information integration is at the semantic level. In naming and terminology alone, there are inconsistencies across different databases and within the same database. In the major literature database MEDLINE, multiple aliases for genes are the norm, rather than the exception. There are cases in which the same name refers to different genes that share no relationship with each other. Even the term gene itself has different meanings in different databases, largely because it has different meanings in various scientific disciplines; the geneticists, the molecular biologists, and the ecologists have different concepts at some levels about genes.

The naming confusion partly stems from the isolated, widely disseminated nature of life science research work. At the height of molecular cloning of genes in the 1980s and 1990s, research groups that cloned anew gene had the privilege of naming the gene. Very often, laboratories working on very different organisms or biological systems independently cloned genes that turned out to encode the same protein. Consequently, various names for the same gene are populated in the published scientific literature and in databases. Biological scientists have grown accustomed to the naming differences. This becomes an ontology issue when information and knowledge are represented in electronic form because of the necessity of communication between human and computers and between computer and computer. For the biological sciences community, the idea and the use of the term ontology is relatively new, and it generates controversy and confusion in discussions.

What Is an Ontology?

The term ontology was originally a philosophical term that referred to “the subject of existence.” The computer science community borrowed the term ontology to refer to a “specification of a conceptualization” for knowledge sharing in artificial intelligence [32]. An ontology is defined as a description of concepts and relationships that exist among the concepts for a particular domain of knowledge. In the world of structured information and databases, ontologies in life science provide controlled vocabularies for terminology as well as specifying object classes, relations, and functions. Ontologies are essential for knowledge sharing and communications across diverse scientific disciplines.

Throughout the history of the field, the biology community has made a continuous effort to strive for consensus in classifications and nomenclatures. The Linnaean system for naming of species and organisms in taxonomy is one of the oldest ontologies. The nomenclature committee for the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry and Molecular Biology (IUBMB) make recommendations on organic, biochemical, and molecular biology nomenclature, symbols, and terminology. The National Library of Medicine Medical Subject Headings (MeSH) provides the most comprehensive controlled vocabularies for biomedical literature and clinical records. The Systematized Nomenclature of Medicine International, a division of the College of American Pathologists, oversees the development and maintenance of a comprehensive and multi-axial controlled terminology for medicine and clinical information known as SNOMED.

Development of standards is and always has been complex and contentious because getting agreement has been a long and slow process. The computer and IT communities dealt with software standards long before the life science community. Recently, the Object Management Group (OMG), an established organization in the IT community, established a life sciences research group (LSR) to improve communication and interoperability among computational resources in life sciences.4 LSR uses the OMG technology adoption process to standardize models and interfaces for software tools, services, frameworks, and components in life sciences research.

Because of its longer history and diverse scientific disciplines and constituents, developing standards in the life science community is harder than doing so in the information technology community. Besides the great breadth of academic and research communities in the life sciences, some fields of biology are a century or more older than molecular biology. Thus, the problems are sociological and technological. Standardization further requires a certain amount of stability and certainty in the knowledge content of the field. In contrast, the level, extent, and nature of biological knowledge is still extensively, even profoundly, dynamic in content. The meaning attached to a term may change over time as new facts are discovered that are related to that term. So far, the attempts to standardize the gene names alone have met a tremendous amount of resistance across different biological communities. The Gene Nomenclature Committee (HGNC) led by the Human Genome Organization (HUGO) made tremendous progress to standardize gene names for humans with the support of the mammalian genetics community [33]. However, the attempt to expand the naming standard across other species turned out to be more difficult [34]. Researchers working in different organisms or fields have their own established naming usages, and it takes effort to convert to a new set of standards.

An ontology is domain-knowledge specific and context dependent. For example, the term vector differs (not surprisingly or problematically) in meaning between its usage in biology and in the physical sciences, as in a mathematical vector. However, within biology, the specific meaning of a term also can be quite different: Molecular biologists use vector to mean a vehicle, as in cloning vector, whereas parasitologists use vector to refer to an organism as an agent in transmission of disease. Thus, the development of ontologies is a community effort and the adoption of a successful ontology must have wide endorsement and participation of the users. The ecological and biodiversity communities have made major efforts in developing meta-data standards, common taxonomy, and structural vocabulary for their Web site with the help of the National Science Foundation and other government agencies [35].5 The molecular biology community encompasses a much more diverse collection of sub-disciplines, and for researchers in the molecular biology domain, reaching a community-wide consensus is much harder. To circumvent these issues, there is a flurry of grassroots movements to develop ontologies in specific areas or research such as sequence analysis, gene expression, protein pathways, and so on [36].6 These group or consortium efforts usually adopt a use case and open source approach for community input. The ontologies are not meant to be mandatory, but instead they serve as a reference framework to go forward for further development. For example, one of the major efforts in molecular biology is the Gene Ontology (GO) consortium, which stems from the annotation projects for the fly genome and the human genome. Its goal is to design a set of structured, controlled vocabularies to describe genes and gene products in organisms [37]. Currently, the GO consortium is focused on building three ontologies for molecular function, biological process, and cellular components, respectively. These ontologies will greatly facilitate queries across genetic and genome databases. The GO consortium started with the core group from the genome databases for the fruit fly, FlyBase; budding yeast, Saccharomyces Genome Database (SGD); and mouse genome database (MGD). It is gaining momentum with growing participants from other genome databases. With such a grassroots approach, interactions between different domain ontologies are critical in future development. For example, brain ontology will inevitably relate to ontologies of other anatomical structures or at the molecular level will share ontologies for genes and proteins [38]. A sample collection of ontology resources in life science is listed in the Appendix.

A consistent vocabulary is critical in querying across multiple data sources. However, given the diverse domains of knowledge and specialization of scientific disciplines, it is not foreseeable that in the near future a global, common ontology covering broad biological disciplines will be developed. Instead, in biomedical research alone, there will be multiple ontologies for genomes, gene expression, proteomes, and so on. Semantic interoperability is an active area of research in computer science [39]. Information integration across multiple biological disciplines and sub-disciplines would depend on the close collaborations of domain experts and IT professionals to develop algorithms and flexible approaches to bridge the gaps among multiple biological ontologies.

2.5.5 Web Presentations

Much of the biological data is delivered to end users via the Web. Currently, the biological Web sites resemble a collection of rival medieval city-states, each with its own design, accession methods, query interface, services, and data presentation format [40]. Much of the data retrieval efforts in information integration rely on brittle, screen scraping methods to parse and extract data from HTML files. In an attempt to reduce redundancy and share efforts, an open source movement in the bioinformatics community has began to share various scripts for parsing HTML files from popular data sources such as GenBank report [3], Swiss-Prot report [41], and so forth.

Recently, the biological IT community has been picking up momentum to adopt the merging XML technology for biological Web services and for exchange of data. Many online databases already make their data available in XML format.7 Semi-structured XML supports user-defined tags to hold data, and thus an XML document contains both data and meta-data. The ability for data sources to exchange information in an XML document strictly depends on their sharing a special document known as Data Type Declaration (DTD), which defines the terms (names for tags) and their data types in the XML document [42]. Therefore, DTD serve as data schema and can be viewed as a very primitive ontology in which DTD defines a set of terms, but not the relationship between terms. XML will ease some of the incompatibility problems of data sources, such as data formats. However, semantic interoperability and consistency remain a serious challenge. With the autonomous nature of life science Web sites, one can envision that the naming space of DTD alone could easily create an alphabet soup of confusing terminology as encountered in the naming of genes. Recently, there has been a proliferation of XML-based markup languages to represent models of biological objects and to facilitate information exchange within specific research areas such as microarray and gene expression markup language,8 systems biology markup language,9 and bio-polymer markup language.10 Many of these are available through the XML open standard organization.11 However, we caution that development of such documents must be compatible with existing biological ontologies or viewed as a concerted community effort.

CONCLUSION

IT professionals and biologists have to work together to address the level of challenges presented by the inherent complexity and vast scales of time and space covered by the life sciences. The opportunities for biological science research in the 21st century require a robust, comprehensive information integration infrastructure underlying all aspects of research. As discussed in the previous sections, substantial progress has been made for data integration at the technical and architectural level. However, data integration at the semantic level remains a major challenge. Before we will be able to seize any of these opportunities, the biology and bioinformatics communities have to overcome the current limitations in metadata specification, maintenance of data provenance and data quality, consistent semantics and ontology, and Web presentations. Ultimately, the life science community must embrace the concept of the Semantic Web [30] as a web of data that is understandable by both computers and people. The bio-ontology efforts for the life sciences represent one important step toward this goal. The brave, early efforts to build computational solutions for biological information integration are discussed in subsequent chapters of this book.

REFERENCES

[1] Pennisi, E. Genome Data Shake Tree of Life. Science. 1998;280(no. 5364):672–674.

[2] Wooley, J.C. Trends in Computational Biology. Journal of Computational Biology. 1999;6(no. 314):459–474.

[3] Baxevanis, A.D. The Molecular Biology Database Collection: 2002 Update. Nucleic Acid Research. 2002;30(no. 1):1–12.

[4] Karp, P.D. Database Links are a Foundation for Interoperability. Trends in Biotechnology. 1996;14(no. 7):273–279.

[5] Wheeler, D.L., Church, D.M., Lash, A.E., et al. Database Resources of the National Center of Biotechnology Information: 2002 Update. Nucleic Acids Research. 2002;30(no. 1):13–16.

[6] Thierry-Meig, J., Durbin, R., Syntactic Definitions for the ACeDB Data Base Manager, AceDB-A C. elegans Database. 1992. http://www.genome.cornell.edu/acedocs/syntax.html

[7] Head-Gordon, T., Wooley, J.C. Computational Challenges in Structural Genomics. IBM Systems Journal. 2001;40(no. 2):265–296.

[8] Ullmann, J.D., Widom, J. A First Course in Database Systems. Upper Saddle River, NJ: Prentice Hall; 1997.

[9] October Resnick, R. Simplified Data Mining. Drug Discovery and Development. 2000:51–52.

[10] revised ed Cattell, R.G.G. Object Data Management: Object-Oriented and Extended Relational Database Systems. Reading, MA: Addison-Wiley, 1994.

[11] Jungfer, K., Cameron, G., Flores, T. EBI: CORBA and the EBI Databases. In: Letovsky S., ed. Bioinformatics: Databases and Systems. Norwell, MA: Kluwer Academic Publishers; 1999:245–254.

[12] Siepel, A.C., Tolopko, A.N., Farmer, A.D., et al. An Integration Platform for Heterogeneous Bioinformatics Software Components. IBM Systems Journal. 2001;40(no. 2):570–591.

[13] Wiederhold, G. Mediators in the Architecture of Future Information Systems. IEEE Computer. 1992;25(no. 3):38–49.

[14] Wiederhold, G., Genesereth, M. The Conceptual Basis for Mediation Services. IEEE Expert, Intelligent Systems and Their Applications. 1997;12(no. 5):38–47.

[15] Davidson, S., Overton, C., Tannen, V., et al. BioKleisli: A Digital Library for Biomedical Researchers. International Journal of Digital Libraries. 1997;1(no. 1):36–53.

[16] Wong, L. Kleisli, A Functional Query System. Journal of Functional Programming. 2000;10(no. 1):19–56.

[17] Chung, S.Y., Wong, L. Kleisli: A New Tool for Data Integration in Biology. Trends in Biotechnology. 1999;17:351–355.

[18] Crabtree, J., Harker, S., Tannen, V., The Information Integration System K2. 1998. http://db.cis.upenn.edu/K2/K2.doc

[19] Davidson, S.B., Crabtree, J., Brunk, B.P., et al. K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal. 2001;40(no. 2):512–531.

[20] Chen, I-M.A., Markowitz, V.M. An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools. Information Systems. 1995;20(no. 5):393–418.

[21] Chen, I-M.A., Kosky, A.S., Markowitz, V.M., et al. Constructing and Maintaining Scientific Database Views in the Framework of the Object-Protocol Model. In: Proceedings of the Ninth International Conference on Scientific and Statistical Database Management. New York: IEEE; 1997:237–248.

[22] Haas, L.M., Schwartz, P.M., Kodali, P., et al. DiscoveryLink: A System for Integrated Access to Life Science Data Sources. IBM Systems Journal. 2001;40(no. 2):489–511.

[23] Haas, L.M., Miller, R.J., Niswonger, B., et al. Transforming Heterogeneous Data With Database Middleware: Beyond Integration. IEEE Data Engineering Bulletin. 1999;22(no. 1):31–36.

[24] Patton, N.W., Stevens, R., Baker, P., et al. Query Processing in the TAMBIS Bioinformatics Source Integration System. In: Proceedings of the 11th International Conference on Scientific and Statistical Database Management. New York: IEEE; 1999:138–147.

[25] Stevens, R., Baker, P., Bechhofer, S., et al. TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. Bioinformatics. 2000;16(no. 2):184–186.

[26] Papakonstantinou, Y., Garcia-Molina, H., Widom, J. Object Exchange Across Heterogeneous Information Sources. In: Proceedings of the IEEE Conference on Data Engineering. New York: IEEE; 1995:251–260.

[27] Altschul, S.F., Gish, W., Miller, W., et al. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(no. 3):403–410.

[28] Smith, T.F., Waterman, M.S. Identification of the Common Molecular Subsequences. Journal of Molecular Biology. 1981;147(no. 1):195–197.

[29] Mount, D.W. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 2001.

[30] May Berners-Lee, T., Hendler, J., Lassila, O. The Semantic Web. Scientific American. 2001;278(no. 5):35–43.

[31] Buneman, P., Khanna, S., Tan, W-C., Why and Where: A Characterization of Data Provenance. Vander Bussche, J., Vianu, V., eds. Proceedings of the Eighth International Conference on Database Theory (ICDT), 316–330. Heidelberg, Germany: Springer-Verlag, 2001.

[32] Gruber, T.R. A Translation Approach to Portable Ontology Specification. Knowledge Acquisition. 1993;5(no. 2):199–220.

[33] Wain, H.M., Lush, M., Ducluzeau, F., et al. Genew: The Human Gene Nomenclature Database. Nucleic Acids Research. 2002;30(no. 1):169–171.

[34] Pearson, H. Biology’s Name Game. Nature. 2001;411(no. 6838):631–632.

[35] Edwards, J.L., Lane, M.A., Nielsen, E.S. Interoperability of Biodiversity Databases: Biodiversity Information on Every Desk. Science. 2000;289(no. 5488):2312–2314.

[36] Oliver, D.E., Rubin, D.L., Stuart, J.M., et al, Ontology Development for a Pharmacogenetics Knowledge Base, Pacific Symposium on Biocomputing. Singapore: World Scientific; 2002;65–76.

[37] Ashburner, M., Ball, C.A., Blacke, J.A., et al. Gene Ontology: Tool for the Unification of Biology. Nature Genetics. 2000;25(no. 1):25–29.

[38] Gupta, A., Ludäscher, B., Martone, M.E., Knowledge-Based Integration of Neuroscience Data Source, Proceedings of the 12th International Conference on Scientific and Statistical Database Management (SSDBM). New York: IEEE; 2000;39–52.

[39] Mitra, P., Wiederhold, G., Kersten, M., A Graph-Oriented Model for Articulation of Ontology Interdependencies, Proceedings of the Conference on Extending Database Technology (EDBT). Heidelberg, Germany: Springer-Verlag; 2000;86–100.

[40] Stein, L. Creating a Bioinformatics Nation. Nature. 2002;417(no. 6885):119–120.

[41] Bairoch, A., Apweiler, R. The SWISS-PROT Protein Sequence Database and Its Supplement TrEMBL in 2000. Nucleic Acids Research. 2000;28(no. 1):45–48.

[42] Oay, E.T. Learning XML: Guide to Creating Self-Describing Data. San Jose, CA: O’Reilly; 2001.


1For information about neuroinformatics, refer to the Human Brain Project at the National Institute of Mental Health (http://www.nimh.nih.gov/neuroinformatics/abs.cfm).

2See also http://www.w3.org/2001/sw.

3The RDF Schema is given and discussed at http://www.w3.org/RDF/overview.html and http://www.w3.org/DesignIssues/Semantic.html.

4This is discussed on the OMG Web site: http://lsr.omg.org. OMG is an open-membership, not-for-profit consortium that produces and maintains computer industry specifications for interoperable enterprise application.

5See also http://www.nbii.gov/disciplines/systematics.html, a general systematics site, and http://www.fgdc.gov, for geographic data.

6See the work by the gene expression ontology working group at http://www.mged.org.

7See the Distributed System Annotation, http://www.biodas.org, and the Protein Information Resource, http://nbrfa.georgetown.edu/pir/databases/pir/xml.

8The Micro Array and Gene Expression (MAGE) markup language is being developed by the Microarray Gene Expression Data Society (see http://www.mged.org/Workgroups/mage.html).

9The Systems Biology Workbench (SBW) is a modular framework designed to facilitate data exchange by enabling different tools to interact with each other (see http://www.cds.caltech.edu/erato).

10The Biopolymer Markup Language (BioML) is an XML encoding schema for the annotation of protein and nucleic acid sequence (see http://www.bioml.com).

11OASIS is an international, not-for-profit consortium that designs and develops industry standard specifications for interoperability based on XML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset