CHAPTER 1

Introduction

Zoé Lacroix and Terence Critchlow

1.1 OVERVIEW

Bioinformatics and the management of scientific data are critical to support life science discovery. As computational models of proteins, cells, and organisms become increasingly realistic, much biology research will migrate from the wet-lab to the computer. Successfully accomplishing the transition to biology in silico, however, requires access to a huge amount of information from across the research community. Much of this information is currently available from publicly accessible data sources, and more is being added daily. Unfortunately, scientists are not currently able to identify easily and exploit this information because of the variety of semantics, interfaces, and data formats used by the underlying data sources. Providing biologists, geneticists, and medical researchers with integrated access to all of the information they need in a consistent format requires overcoming a large number of technical, social, and political challenges.

As a first step in helping to understand these issues, the book provides an overview of the state of the art of data integration and interoperability in genomics. This is accomplished through a detailed presentation of systems currently in use and under development as part of bioinformatics efforts at several organizations from both industry and academia. While each system is presented as a stand-alone chapter, the same questions are answered in each description. By highlighting a variety of systems, we hope not only to expose the different alternatives that are actively being explored, but more importantly, to give insight into the strengths and weaknesses of each approach. Given that an ideal bioinformatics environment remains an unattainable dream, compromises need to be made in the development of any real-world system. Understanding the tradeoffs inherent in different approaches, and combining that knowledge with specific organizational needs, is the best way to determine which alternative is most appropriate for a given situation.

Because we hope this book will be useful to both computer scientists and life scientists with varying degrees of familiarity with bioinformatics, three introductory chapters put the discussion in context and establish a shared vocabulary. The challenges faced by this developing technology for the integration of biological information are presented in Chapter 2. The complexity of use cases and the variety of techniques needed to support these needs are exposed in Chapter 3. This chapter also discusses the translation from specification to design, including the most common issues raised when performing this transformation in the life sciences domain. The difficulty of face-to-face communication between demanding users and developers is evoked in Chapter 4, in which examples are used to highlight the difficulty involved in directly transferring existing data management approaches to bioinformatics systems. These chapters describe the nuances that differentiate real-world bioinformatics from technology transferred from other domains. Whereas these nuances may be skeptically viewed as simple justifications for working on solved problems, they are important because bioinformatics occurs in the real world, complete with its ugly realities, not in an abstract environment where convenient assumptions can be used to simplify problems.

These introductory chapters are followed by the heart of this book, the descriptions of eight distinct bioinformatics systems. These systems are the results of collaborative efforts between the database community and the genomics community to develop technology to support scientists in the process of scientific discovery. Systems such as Kleisli (Chapter 6) were developed in the early stages of bioinformatics and matured through meetings on the Interconnection of Molecular Biology Databases (the first of the series was organized at Stanford University in the San Francisco Bay Area, August 9–12, 1994). Others, such as DiscoveryLink (Chapter 11), are recent efforts to adapt sophisticated data management technology to specific challenges facing bioinformatics. Each chapter has been written by the primary contributor(s) to the system being described. This perspective provides precious insight into the specific problem being addressed by the system, why the particular architecture was chosen, its strengths, and any weakness it may have. To provide an overall summary of these approaches, advantages and disadvantages of each are summarized and contrasted in Chapter 13.

1.2 PROBLEM AND SCOPE

In the last decade, biologists have experienced a fundamental revolution from traditional research and development (R&D) consisting in discovering and understanding genes, metabolic pathways, and cellular mechanisms to large-scale, computer-based R&D that simulates the disease, the physiology, the molecular mechanisms, and the pharmacology [1]. This represents a shift away from life science’s empirical roots, in which it was an iterative and intuitive process. Today it is systematic and predictive with genomics, informatics, automation, and miniaturization all playing a role [2]. This fusion of biology and information science is expected to continue and expand for the foreseeable future. The first consequence of this revolution is the explosion of available data that biomolecular researchers have to harness and exploit. For example, an average pharmaceutical company currently uses information from at least 40 databases [1], each containing large amounts of data (e.g., as of June 2002, GenBank [3, 4] provides access to 20,649,000,000 bases in 17,471,000 sequences) that can be analyzed using a variety of complex tools such as FASTA [5], BLAST [6], and LASSAP [7].

Over the past several years, bioinformatics has become both an all-encompassing term for everything relating to computer science and biology, and a very trendy one.1 There are a variety of reasons for this including: (1) As computational biology evolves and expands, the need for solutions to the data integration problems it faces increases; (2) the media are beginning to understand the implications of the genomics revolution that has been going on for the last 15 or more years; (3) the recent headlines and debates surrounding the cloning of animals and humans; and (4) to appear cutting edge, many companies have relabeled the work that they are doing as bioinformatics, and similarly many people have become bioinformaticians instead of geneticists, biologists, or computer scientists. As these events have occurred, the generally accepted meaning of the word bioinformatics has grown from its original definition of managing genomics data to include topics as diverse as patient record keeping, molecular simulations of protein sequences, cell and organism level simulations, experimental data analysis, and analysis of journal articles. A recent definition from the National Institutes of Health (NIH) phrases it this way:

Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. [8]

This definition could be rephrased as: Bioinformatics is the design and development of computer-based technology that supports life science. Using this definition, bioinformatics tools and systems perform a diverse range of functions including: data collection, data mining, data analysis, data management, data integration, simulation, statistics, and visualization. Computer-aided technology directly supporting medical applications is excluded from this definition and is referred to as medical informatics. This book is not an attempt at authoritatively describing the gamut of information contained in this field. Instead, it focuses on the area of genomics data integration, access, and interoperability as these areas form the cornerstone of the field. However, most of the presented approaches are generic integration systems that can be used in many similar scientific contexts.

This emphasis is in line with the original focus of bioinformatics, which was on the creation and maintenance of data repositories (flat files or databases) to store biological information, such as nucleotide and amino acid sequences. The development of these repositories mostly involved schema design issues (data organization) and the development of interfaces whereby scientists could access, submit, and revise data. Little or no effort was devoted to traditional data management issues such as storage, indexing, query languages, optimization, or maintenance. The number of publicly available scientific data repositories has grown at an exponential rate, to the point where, in 2000, there were thousands of public biomolecular data sources. In 2003, Baxevanis listed 372 key databases in molecular biology only [9]. Because these sources were developed independently, the data they contain are represented in a wide variety of formats, are annotated using a variety of methods, and may or may not be supported by a database management system.

1.3 BIOLOGICAL DATA INTEGRATION

Data integration issues have stymied computer scientists and geneticists alike for the last 20 years, and yet successfully overcoming them is critical to the success of genomics research as it transitions from a wet-lab activity to an electronic-based activity as data are used to drive the increasingly complicated research performed on computers. This research is motivated by scientists striving to understand not only the data they have generated, but more importantly, the information implicit in these data, such as relationships between individual components. Only through this understanding will scientists be able to successfully model and simulate entire genomes, cells, and ultimately entire organisms.

Whereas the need for a solution is obvious, the underlying data integration issues are not as clear. Chapter 4 goes into detail about the specific computer science problems, and how they are subtly different from those encountered in other areas of computer science. Many of the problems facing genomics data integration are related to data semantics—the meaning of the data represented in a data source—and the differences between the semantics within a set of sources. These differences can require addressing issues surrounding concept identification, data transformation, and concept overloading. Concept identification and resolution has two components: identifying when data contained in different data sources refer to the same object and reconciling conflicting information found in these sources. Addressing these issues should begin by identifying which abstract concepts are represented in each data source. Once shared concepts have been identified, conflicting information can be easily located. As a simple example, two sources may have different values for an attribute that is supposed to be the same. One of the wrinkles that genomics adds to the reconciliation process is that there may not be a “right” answer. Consider that a sequence representing the same gene should be identical in two different data sources. However, there may be legitimate differences between two sources, and these differences need to be preserved in the integrated view. This makes a seemingly simple query, “return the sequence associated with this gene,” more complex than it first appears.

In the case where the differences are the result of alternative data formats, data transformations may be applied to map the data to a consistent format. Whereas mapping may be simple from a technical perspective, determining what it is and when to apply it relies on the detailed representation of the concepts and appropriate domain knowledge. For example, the translation of a protein sequence from a single-character representation to a three-character representation defines a corresponding mapping between the two representations. Not all transformations are easy to perform—and some may not be invertible. Furthermore, because of concept overloading, it is often difficult to determine whether or not two abstract concepts really have the same meaning—and to figure out what to do if they do not. For example, although two data sources may both represent genes as DNA sequences, one may include sequences that are postulated to be genes, whereas the other may only include sequences that are known to code for proteins. Whether or not this distinction is important depends on a specific application and the semantics that the unified view is supporting. The number of subtly distinct concepts used in genomics and the use of the same name to refer to multiple variants makes overcoming these conflicts difficult.

Unfortunately, the semantics of biological data are usually hard to define precisely because they are not explicitly stated but are implicitly included in the database design. The reason is simple: At a given time, within a single research community, common definitions of various terms are often well understood and have precise meaning. As a result, the semantics of a data source are usually understood by those within that community without needing to be explicitly defined. However, genomics (much less all of biology or life science) is not a single, consistent scientific domain; it is composed of dozens of smaller, focused research communities. This would not be a significant issue if researchers only accessed data from within a single domain, but that is not usually the case. Typically, researchers require integrated access to data from multiple domains, which requires resolving terms that have slightly different meanings across the communities. This is further complicated by the observations that the specific community whose terminology is being used by the data source is usually not explicitly identified and that the terminology evolves over time. For many of the larger, community data sources, the domain is obvious—the Protein Data Bank (PDB) handles protein structure information, the Swiss-Prot protein sequence database provides protein sequence information and useful annotations, etc.—but the terminology used may not be current and can reflect a combination of definitions from multiple domains. The terminology used in smaller data sources, such as the drosophila database, is typically selected based on a specific usage model. Because this model can involve using concepts from several different domains, the data source will use whatever definitions are most intuitive, mixing the domains as needed.

Biology also demonstrates three challenges for data integration that are common in evolving scientific domains but not typically found elsewhere. The first is the sheer number of available data sources and the inherent heterogeneity of their contents. The World Wide Web has become the preferred approach for disseminating scientific data among researchers, and as a result, literally hundreds of small data sources have appeared over the past 10 years. These sources are typically a “labor of love” for a small number of people. As a result, they often lack the support and resources to provide detailed documentation and to respond to community requests in a timely manner. Furthermore, if the principal supporter leaves, the site usually becomes completely unsupported. Some of these sources contain data from a single lab or project, whereas others are the definitive repositories for very specific types of information (e.g., for a specific genetic mutation). Not only do these sources complicate the concept identification issue previously mentioned (because they use highly specialized data semantics), but their number make it infeasible to incorporate all of them into a consistent repository.

Second, the data formats and data access methods (associated interfaces) change regularly. Many data providers extend or update their data formats approximately every 6 months, and they modify their interfaces with the same frequency. These changes are an attempt to keep up with the scientific evolution occurring in the community at large. However, a change in a data source representation can have a dramatic impact on systems that integrate that source, causing the integration to fail on the new format or worse, introducing subtle errors into the systems. As a result of this problem, bioinformatics infrastructures need to be more flexible than systems developed for more static domains.

Third, the data and related analysis are becoming increasingly complex. As the nature of genomics research evolves from a predominantly wet-lab activity into knowledge-based analysis, the scientists’ need for access to the wide variety of available information increases dramatically. To address this need, information needs to be brought together from various heterogeneous data sources and presented to researchers in ways that allow them to answer their questions. This means providing access not only to the sequence data that is commonly stored in data sources today, but also to multimedia information such as expression data, expression pathway data, and simulation results. Furthermore, this information needs to be available for a large number of organisms under a variety of conditions.

1.4 DEVELOPING A BIOLOGICAL DATA INTEGRATION SYSTEM

The development of a biological data integration and management system has to overcome the difficulties outlined in Section 1.3. However, there is no obvious best approach to doing this, and thus each of the systems presented in this book addresses these issues differently. Furthermore, comparing and contrasting these systems is extremely difficult, particularly without a good understanding of how they were developed. This is because the goals of each system are subtly different, as reflected by the system requirements defined at the outset of the design process. Understanding the development environment and motivation behind the initial system constraints is critical to understanding the tradeoffs that were made later in the design process and the reasons why.

1.4.1 Specifications

The design of a system starts with collecting requirements that express, among other things:

image Who the users of the system will be

image What functionality the system is expected to have

image How this functionality is to be viewed by the users

image The performance goals for the system

System requirements (or specifications) describe the desired system and can be seen as a contract agreed upon by the target users (or their surrogates) and the developers. Furthermore, these requirements can be used to determine if a delivered system performs properly.

The user profile is a concise description of who the target users for a system are and what knowledge and experience they can be assumed to have. Specifying the user profile involves agreeing on the level of computer literacy expected of users (e.g., Are there programmers helping the scientists access the data? Are the users expected to know any programming language?), the type of interface the users will have (e.g., Will there be a visual interface? A user customizable interface?), the security issues that need to be addressed, and a multitude of other concerns.

Once the user profile is defined, the tasks the system is supposed to perform must be analyzed. This analysis consists in listing all the tasks the system is expected to perform, typically through use cases, and involves answering questions such as: What are the sources the system is expected to integrate? Will the system allow users to express queries? If so, in what form and how complex will they be? Will the system incorporate scientific applications? Will it allow users to navigate scientific objects?

Finally, technical issues must be agreed upon. These issues include the platforms the system is expected to work on (i.e., UNIX, Microsoft, Macintosh), its scalability (i.e., the amount of data it can handle, the number of queries it can simultaneously support, and the number of data sources that can be integrated), and its expected efficiency with respect to data storage size, communication overhead, and data integration overhead.

The collection of these requirements is traditional to every engineering task. However, in established engineering areas there are often intermediaries that initially evaluate the needs for new technology and significantly facilitate the definition of system specifications. Unfortunately, this is not the case in life sciences. Although technology is required to address complex user needs, the scientists generally directly communicate their needs to the system designers. While communication between specialists in different domains is inherently difficult, bioinformatics faces an additional challenge—the speed at which the underlying science is evolving. A common result of this is that both scientists and developers become frustrated. Scientists are frustrated because systems are not able to keep up with their ever-changing requirements, and developers are frustrated because the requirements keep changing on them. The only way to overcome this problem is to have an intermediary between the specialists. A common goal can be formulated and achieved by forging a bridge between the communities and accurately representing the requirements and constraints of both sides.

1.4.2 Translating Specifications into a Technical Approach

Once the specifications have been agreed upon, they can be translated into a set of approaches. This can be thought of as an optimization problem in which the hard constraints define a feasibility region, and the goal is to minimize the cost of the system while maximizing its usefulness and staying within that region. Each attribute in the system description can be mapped to a dimension. Existing data management approaches can then be mapped to overlapping regions in this space. Once the optimal location has been identified, these approaches can be used as a starting point for the implementation.

Obviously, this problem is not always formally specified, but considering it in this way provides insight into the appropriate choices. For example, in the dimension of storage costs, two alternatives can be considered: materializing the data and not materializing it. The materialized approach collects data from various sources and loads them into a single system. This approach is often closely related to a data warehousing approach and is favored when the specifications include characteristics such as data curation, infrequent data updates, high reliability, and high levels of security. The non-materialized approach integrates all the resources by collecting the requested data from the distributed data sources at query execution time. Thus, if the specifications require up-to-date data or the ability to easily include new resources in the integration, a non-materialized approach would be more appropriate.

1.4.3 Development Process

The system development implements the approaches identified in Section 1.4.2, possibly extending them to meet specific constraints. System development is often an iterative process in which the following steps are repeatedly performed as capabilities are added to the system:

image Code design: describing the various software components/objects and their respective capabilities

image Implementation: actually writing the code and getting it to execute properly

image Testing: evaluating the implementation, identifying and correcting bugs

image Deployment: transferring the code to a set of users

The formal deployment of a system often includes an analysis of the tests and training the users. The final phases are the system migration and the operational process. More information on managing a programming project can be found in Managing a Programming Project—Processes and People [10].

1.4.4 Evaluation of the System

Two systems may have the same specifications and follow the same approach yet end up with radically different implementations. The eight systems presented in the book (Chapters 5 through 12) follow various approaches. Their design and implementation choices lead to vastly different systems. These chapters provide few details on the numerous design and implementation decisions and instead focus on the main characteristics of their systems. This will provide some insight into the vast array of tradeoffs that are possible while still developing feasible systems.

There are several metrics by which a system can be evaluated. One of the most obvious is whether or not it meets its requirements. However, once the specifications are satisfied, there are many characteristics that reflect a system’s performance. Although similar criteria may be used to compare two systems that have the same specifications, these same criteria may be misleading when the specifications differ. As a result, evaluating systems typically requires insight into the system design and implementation and information on users’ satisfaction. Although such a difficult task is beyond the scope of this book, in Chapter 13 we outline a set of criteria that can be considered a starting point for such an evaluation.

REFERENCES

[1] Peitsch, M., From Genome to Protein Space, Presentation at the Fifth Annual Symposium in Bioinformatics. Singapore, October. 2000.

[2] Valenta, D., Trends in Bioinformatics: An Update, Presentation at the Fifth Annual Symposium in Bioinformatics. Singapore, October. 2000.

[3] Benson, D., Karsch-Mizrachi, I., Lipman, D., et al, GenBank. Nucleic Acids Research. 2003;31(no. 1):23–27. www.ncbi.nlm.nih.gov/Genbank

[4] . Growth of GenBank. 2003. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

[5] April Pearson, W., Lipman, D. Improved Tools for Biological Sequence Comparison. Proceedings of the National Academy of Sciences of the United States of America. 1988;85(no. 8):2444–2448.

[6] Octoberhttp://www.ncbi.nlm.nih.gov/BLAST. Altschul, S., Gish, W., Miller, W., et al. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(no. 3):403–410.

[7] Glenet, E., Codani, J-J. LASSAP: A Large Scale Sequence Comparison Package. Bioinformatics. 1997;13(no. 2):137–143.

[8] November NCBI, Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources, A Science Primer. 2002. http://www4.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

[9] Baxevanis, A., The Molecular Biology Database Collection: 2003 Update. Nucleic Acids Research. 2003;31(no. 1):1–12. http://nar.oupjournals.Org/cgi/content/full/31/1/1

[10] Metzger, P., Boddie, J. Managing a Programming Project—Processes and People. Upper Saddle River, NJ: Prentice Hall; 1996.


1The sentence claims that computer science is relating to biology. Whenever one refers to this “relationship,” one uses the term bioinformatics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset