Concluding Remarks

As the first book focusing on management systems for biological data, this material is a detailed introduction to the variety of problems and issues facing data integration and the presentation of numerous systems. The major issue these systems are trying to address is the large number of distributed, semantically disparate data sources that need to be combined into a useful and usable system for geneticists and biologists to perform their research. This issue is complicated by the variety of data formats, inconsistent semantics, and custom interfaces supported by these sources—as well as the highly dynamic nature of these characteristics and the data themselves. Ideally, a data integration system would provide consistent access to all of the data and tools needed by scientists. However, no single system meets this ideal for all users. This final section provides a brief summary and a peek into the future of bioinformatics.

SUMMARY

The introductory chapters establish a terminology shared by computer scientists and life scientists. They focus on the different steps in the design of a system and highlight the differences between the problems faced by those in bioinformatics and other facets of these respective disciplines. Upon first glance, these differences may seem insignificant, but understanding them is the first step in understanding the realities of the environment in which bioinformatics solutions must work. The desire to simplify this environment is common in people starting out in this domain, but overcoming it is critical to successfully addressing the problems being faced. Many of the challenges in bioinformatics are derived from the inherent complexity of the domain, and failure to embrace this results in approaches that, while acceptable in theory, are not workable in the real, complex world in which bioinformatics solutions must be applied.

Once a common background has been established, the following chapters present several bioinformatics systems that are currently in use. The wide variety of systems described in this book provides significant insight into the complexity of performing data integration in the rapidly changing domain of genomics. The fact that these systems are still evolving indicates that none of these approaches has yet led to an ideal solution for all applications. This is a testament to the difficulty of creating a bioinformatics solution that addresses the needs of all users. Most of these systems evolved independently, and many began as attempts at addressing specific challenges facing scientists in a particular organization. The challenges focused on by a given solution are generally the most important problems facing the associated organization or its customers. While each system presented here has met its original goals, as the scope of its usage evolved, it has encountered new challenges.

As discussed in Chapter 13, evaluating a system requires detailed knowledge of the environment in which the system will be deployed. Part of the reason no single approach is clearly better than another is that the bioinformatics community places conflicting goals on systems. As a simple example, notice that although providing a semantically consistent view of the data greatly improves the usability of the system, it also places practical limits on the number of data sources to which the system can provide access. This is because each data source provides its own unique semantics for the data it contains, and an expert is required to perform the mapping from these semantics to the global ones. However, the more sources to which a system provides access, the more valuable it is in general. As scalability and semantic consistency are mutually exclusive goals, a system can excel in only one of them, providing at best marginal performance in the other. Whether such a system is better than another depends on the users’ values. This example illustrates only one of the tradeoffs bioinformatics systems strive to meet. Because of these conflicting constraints, it is currently impossible for a single system to provide the bioinformatics solution that meets every scientist’s needs. Although this is a discouraging realization, it is not a situation unique to bioinformatics. Indeed, it appears to be a characteristic of any rapidly evolving scientific domain, and as such, the techniques used by bioinformaticians are more generally applicable than typically thought.

LOOKING TOWARD THE FUTURE

As one becomes familiar with the problems facing bioinformatics and the approaches being pursued to address them, it is easy to become disenchanted. The problems are daunting, and there is no clear path that will lead to a unifying solution. Some issues, such as query optimization and data caching, are just now being investigated seriously in this context. Other issues appear as the result of applying existing technology in new ways and the development of new technology. Indeed, sometimes it feels as if we are moving in the wrong direction: As it becomes increasingly easy to distribute data via the Web, the number and heterogeneity of data sources containing information relevant to scientists keeps increasing. Unfortunately, a lack of community standards results in each source publishing its own distinct semantics and interfaces. The number of tools available to researchers, and their complexity, continues to increase without significant progress at making them interoperable. Multimedia data is becoming more common as genomics research continues to move onto computers and out of the wet-lab, which causes problems for data integration systems that are expecting textual data. Large-scale data are also becoming more common as access to powerful computers and related infrastructures increases. This changes the value of bandwidth and requires rethinking many assumptions about the underlying data. Grid technology is emerging and will likely soon allow data and computation to be spread transparently among a large number of machines. How this technology will be used is not entirely clear, but it will likely have a significant impact on computational biology.

While each of these issues raises significant data integration and access challenges, they also provide new opportunities to solve existing bioinformatics problems and, in turn, to advance the state of genomics research. For example, grid technology may be able to minimize the impact of large data sets by moving the computation to the place where the data resides. Thus, there is still hope that we will achieve the goal of providing scientists with intuitive access to all the relevant data they need.

One of the more promising emerging trends is an effort to define data semantics precisely through ontologies. A possible, although not necessarily probable, result of this effort is a single unifying ontology that is able to identify accurately the information contained in all data sources. Having this global ontology would allow mappings between related concepts to be easily identified, and thus would greatly reduce the burden placed on integration systems. Unfortunately, this vision may take decades to be realized, if it happens at all. The major reason for this is that life science is an inherently complex domain, and there is a lot of information that is not yet understood. Thus, the ability to correctly define the semantics between these complex concepts is severely limited by this lack of comprehension. Because of this difficulty, the ontologies currently being developed are generally small and define semantic concepts only for a specific sub-community of life science. The creation and adoption of these smaller ontologies are likely to occur over the next few years. Although a less than ideal solution, these ontologies could be extremely useful to bioinformatics by reducing the number of semantic definitions that need to be integrated.

Integrating data from multiple resources also raises challenging issues related to data provenance, data ownership, data quality, privacy, and security, which will need to be addressed in the short future. Indeed, integrated data is often composed of several data items, each coming from a different resource. Tracking data provenance is critical to scientific applications as it enables users to know where each data item comes from. This knowledge is relevant to data ownership and quality. For example, when exploiting data, it is important to give credit to the researcher who has generated or annotated the data. In addition, data provenance may affect the expected quality of the data (e.g., when they are not curated or validated) and thus the way it should be exploited. But if scientific integration systems evolve to track down data provenance, they might also enable to reconstitute the original datasets, which raises privacy and security issues as scientific discovery will need to integrate more and more clinical data. Biological integration systems may have to comply with regulations such as the privacy provisions and the standards for the security of electronic health information of the U.S. federal law, the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

Which trends will continue and impact the bioinformatics community as a whole has yet to be seen. The only thing certain is that bioinformatics will continue to be an exciting and evolving discipline for years to come. As comprehensive as we have tried to be, this book provides only an introduction to the fascinating world of bioinformatics data integration. Furthermore, while the challenges outlined herein are daunting, addressing them is only the first step the evolving, multidisciplinary field of bioinformatics must take. Once these challenges have been overcome, there is still a huge amount of work to be done to use that information effectively to understand the mechanics of life. Despite the tremendous amount of work still to do, the path is fascinating and the rewards for successfully unraveling the mysteries of the genome are unparalleled. We hope that this book has provided not just insight into the challenges currently being addressed in bioinformatics, but also inspiration to help overcome them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset