Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 5

SRS: An Integration Platform for Databanks and Analysis Tools in Bioinformatics

Thure Etzold, Howard Harris and Simon Beaulah

The Sequence Retrieval System (SRS) approach to data integration has evolved over many years to address the needs of researchers in the life sciences to query, retrieve, and analyze complex, ever increasing, and changing biological data. SRS follows a federation approach to data integration, leaving the underlying data sources in their original formats. For example, Genbank [1] is used in flat file format; the Genome Ontology (GO) [2] is used in either XML format or as relational tables stored in MySQL [3]. Databanks generated and provided by the major technologies available are integrated through meta-data, which is provided for the majority of the common public data sources. SRS customers use this functionality to integrate their own in-house data, such as gene expression databases, with third-party data such as Incyte LifeSeq Foundation data [4], and public data such as EMBL [5] and Swiss-Prot [6].

Databanks in SRS can be queried and analyzed via a Web interface or through a variety of application programming interfaces (APIs) as described in Section 5.8. Using one of a variety of query forms, the user can search a single or a combination of databanks. Search results can be further analyzed using a suite of tools like BLAST [7] and FASTA [8] for sequence similarity searching. SRS provides support for about 200 tools including a major part of EMBOSS [9]. This is further described in Section 5.7.

Meta-data is at the heart of SRS. Each data source is fully described, including the type and structure of data, relationships to other data sources, how the data should be indexed or presented to users, and how it can be mapped to external object models. SRS uses a meta-data only approach, which is based on its internal programming language, Icarus. Administrators can customize SRS by editing Icarus files or through the use of a graphical user interface. No access programs or wrappers need to be written by programmers as they do with other integration systems like DiscoveryLink (Chapter 11) and Kleisli (Chapter 6). An exception is the set of syntactic and semantic rules that need to be composed for the integration of flat file databanks. The result of the meta-data approach is a flexible and modular system that has adapted to all the changes and developments in bioinformatics over the past 10 years. Many approaches to data integration have been proposed over this time to address the needs described in this book, but SRS has surpassed them all to provide the only proven and widely used flexible data integration environment.

SRS aims to remain independent of the technology used for data storage. Extensible markup language (XML), flat files, and relational databases bring with them a range of benefits and problems that often create a particular mind-set for the people who use and maintain them. Flat file databanks are the “dinosaurs” in this field, albeit very successful in defying extinction. Flat file data are compact and generally very flexible to work with. They are mostly semi-structured and are presented in a vast variety of formats, which in their multitude make parser writing an almost impossible task. This is further described in Section 5.1. XML is an elegant way of representing data and is ideally suited for transferring information between tools and applications (e.g., communicating genomic data to a genome browser). XML offers great flexibility, which can present formidable challenges for integration. How SRS meets these challenges is described in Section 5.2. Relational databases create a world of tables, columns, and relationships, providing a structured and maintainable data store. However, the Structured Query Language (SQL) is not a common skill of most researchers, and the scientific concepts researchers wish to analyze are often lost somewhere in the ever-growing database schema. For almost all relational databases in molecular biology, a bespoke interface had to be built. Section 5.3 covers this technology. SRS supports these three technologies (flat file, XML, and relational databases) and can map all data into flexible and extendable object models as described in Section 5.6. Using the object loader, users can define their own views of the data to display, for example, gene expression data with information from GenBank and from InterPro [10]. This type of view combines XML, relational, and flat file data seamlessly and is completely in the control of the user. Section 5.6 also describes how data can be exported as XML to other applications in a standard or customized way.

Providing access to all data sources is only the first step of data integration. The relationships between the different data sources are represented in SRS and are used to form an interconnected set of data sources referred to as the SRS Universe. Mapping of attributes and approaches to semantic integration is addressed in Section 5.5. When a new data source is added to the SRS Universe, relationships to already integrated resources can be defined. This allows the SRS administrator and a domain expert to combine their knowledge to implement an SRS Universe that reflects the intricacies of their own data alongside a tried and tested public data SRS Universe. In this way, SRS provides the flexibility and extensibility that is required in a changing environment.

Figure 5.1 gives an overview of the SRS architecture. It shows the three data source types: XML, relational, and flat file databanks. The output of analysis tools is treated in the same way as flat file databanks. SRS provides specific technology to deal with each data source type. For example, flat file databanks use the token server, whereas relational databanks are integrated through object relational mapping and SQL generation modules. On top of these technologies are services such as the query service and the object loader. They can be applied to all data sources in a transparent way. The APIs can be used by programmers to make use of these services. The SRS Web server gives an example of such an application. Meta-data plays an important role in SRS. All data sources and analysis tools are fully described (e.g., file location, ftp source address, format) and all SRS modules are configured through meta-data. A visual editor can be used to access, modify, and create all SRS meta-data. All of the components are described in this book, if only briefly. Unfortunately, the scope of this book does not allow a complete description of them all.

FIGURE 5.1 The SRS architecture.

The SRS server at the European Bioinformatics Institute (EBI) [11] has provided genomic and related data to the European bioinformatics community since 1994. It now serves more than 4 million hits per month, returning results in seconds, and supporting thousands of researchers. The EBI SRS server has approximately 200 data sources integrated with many analysis tools. It also links to an access page with other major academic SRS servers and gives access to the freely available SRS meta-definition files for (currently) more than 700 public databases (see also Krell and Etzold’s article “Data banks” [12]). SRS is extensively used in large pharmaceutical and biotech companies and is the basis of the Celera Discovery System [13], Incyte LifeSeq Foundation distribution, Affymetrix NetAffyx portal [14], and Thomson Derwent Geneseq portal [15]. The SRS community of academic and commercial companies makes SRS the most widely used life science integration product.

Note: Throughout this text the words databank, database, and library are used in a seemingly interchangeable way. To clarify, this database is used to refer to the sum of data and the actual system within which it is stored, databank to refer to the data only, and library to refer to the representation of a databank or database within SRS.

5.1 INTEGRATING FLAT FILE DATABANKS

Before the advent of XML, almost all data collections in molecular biology were available as sets of text files, also called flat file databanks. New databanks are now generally available in XML and, increasingly, flat file databanks can be obtained in an alternative XML format. However, flat files continue to be the only available form for many data collections and will stay an important source of information for years to come.

Overall, flat file databanks have a simple structure and usually consist of a single stream of entries represented in a text format with a special syntax. The entries can be very rich, containing comprehensive information about a protein, a DNA sequence, or a tertiary 3D structure. The formats for these flat file databanks vary greatly and are only rarely shared. Once defined, individual formats will change continuously to reflect the growth of complexity in the associated content. Hundreds of these formats have been created, which makes parsing data in molecular biology a highly daunting task. SRS meets this challenge by providing tools that make writing parsers easy and very maintainable. Parsers to disseminate flat file entries, or token servers, are written in Icarus, the internal programming language for SRS.

5.1.1 The SRS Token Server

SRS has a unique approach for parsing data sources that has proved effective for supporting many hundreds of different formats. With traditional approaches, a parser would be written as a program. This program would then be run over the data source and would return with a structure, such as a parse tree that contains the data items to be extracted from the source. In the context of structured data retrieval, the problem with that approach is that depending on the task (e.g., indexing or displaying) different information must be extracted from the input stream. For instance, for data display, the entire description field must be extracted, but to index the description field needs to be broken up into separate words.

A new approach was devised called token server. A token server can be associated with a single entity of the input stream, such as a databank entry, and it responds to requests for individual tokens. Each token type is associated with a name (e.g., description Line) that can be used within this request. The token server parses only upon request (lazy parsing), but it keeps all parsed tokens in a cache so repeated requests can be answered by a quick look-up into the cache.

A token server must be fed with a list of syntactic and optionally semantic rules. The syntactic rules are organized in a hierarchic manner. For a given databank there is usually a rule to parse out the entire entry from the databank, rules to extract the data fields within that entry, and rules to process individual data fields. The parsed information can be extracted on each level as tokens (e.g., the entire entry, the data fields, and individual words within an individual data field). Semantic rules can transform the information in the flat file. For example, amino acid names in three-letter code can be translated to one-letter code or a particular deoxyribonucleic acid (DNA) mutation can be classified as a missense mutation.

Figure 5.2 shows an example of how two different identifiers for protein mutations can be transformed into four different tokens using a combination of syntactic and semantic rules.

FIGURE 5.2 The SRS token server applied on mutation identifiers.

The following example of Icarus code defines rules for the tokens AaChange, ProteinChangePos, and AaMutType for the first variation of the mutation key Leu39Arg.

The rules are specified in Icarus, the internal programming language of SRS. Icarus is in many respects similar to Perl [16]. It is interpreted and object-oriented with a rich set of functionality. Icarus extends Perl with its ability to define formal rule sets for parsing. Within SRS it is also used extensively to define and manipulate the SRS meta-data.

The previous code example contains seven rule definitions. Each starts with a name, followed by a colon and then the actual rule, which is enclosed within ∼ characters. These rules are specified in a variant of the Extended Backus Naur Form (EBNF) [17] and contain symbols such as literals, regular expressions (delimited by / characters), and references to other rules. In addition they can have commands (delimited by { and }), which are applied either after the match of the entire rule (command at the beginning of the rule) or after matching a symbol (command directly after the symbol). In the example three different functions are called within commands. $In specifies the input tokens to which the rule can be applied. For example, AaChange specifies as input the token table Key, which is produced by the rule Key. $Out specifies that the rule will create a token table with the current rule. The command $Wrt writes a string into the token table opened by the current rule. For example, the $Wrt command following the reference to the In rule in the rule Key will write the line matched by In into the token table Key. Using commands $In and $Out, rules can be chained by feeding each other with the output token tables they produce. For example, rule AaChange processes the tokens in token table Key and provides the input for rule AaMutType. Lazy parsing means that only the rules necessary to produce a token table will be activated. To retrieve the Key tokens, the rules Key and Entry need to be processed. To obtain AaMutType, the rules AaMutType, AaChange, Key, and Entry are invoked. Production AaChange uses an associative list to convert three-letter amino acid codes to their one-letter equivalents. AaMutType is a semantic rule that uses standard mutation descriptions provided by AaChange to determine whether a mutation is a simple substitution or leads to termination of the translation frame by introducing a stop codon.

Advantages of the token server approach are:

It is easy to write a parser where the overall complexity can be divided into layers: entry, fields, and individual field contents.

The parser is very robust; a problem parsing a particular data field will not break the overall parsing process.

A rule set consists of simple rules that can be easily maintained.

Lazy parsing allows adding rules that will only be used in special circumstances or by only a few individuals.

Lazy parsing allows alternative ways of parsing to be specified (e.g., retrieval of author names as encoded in the databank or converted to a standard format).

The parser can perform reformatting tasks on the output (e.g., insertion of hypertext links).

5.1.2 Subentry Libraries

Flat file databanks are often described as being semi-structured. This stems from the lack of a formal description of the contents, which may just be mentioned briefly in a readme file or user manual. While individual entries in a flat file databank describe a real world object such as a gene or protein, it is often possible to discover entities within these entries that are worth querying and retrieving as independent entities.

Consider a nucleotide sequence entry in EMBL or GenBank that describes an entire genome, or a large part of it, encoding hundreds of genes. It contains for every gene, or coding sequence, a sub-entity, or sub-entry, which can look like the one shown in Figure 5.3.

FIGURE 5.3 A Protein Coding Sequence (CDS) Feature in EMBL.

With SRS these sub-entries can be parsed, indexed, and retrieved as separate entities. There is still a tight association to the parent entry, but a separate databank of sub-entries is created effectively next to the databank of parent entries. Sequence features have a special property in that they are contained within the sequence of the parent entry. The exact location of that sequence can be specified in the sub-entry as shown in Figure 5.3 as a complex join statement following the CDS keyword. SRS uses this information to retrieve the sub-sequence of the sequence feature as part of the sub-entry.

Many other flat file libraries have sub-entries (e.g., literature citations and comments). In addition, sequence feature tables of the sequence databanks are parsed to produce a new sub-entry type counter, which is a list of counts of each feature type within an entry. Indexing these allows the scientist to make highly specific queries such as “all Swiss-Prot entries with exactly seven trans-membrane segments.”

5.2 INTEGRATION OF XML DATABASES

XML is becoming increasingly important within the bioinformatics community. There are several good reasons for using XML as a medium for the storage and transmission of bioinformatics data.

Because XML has a universally recognized format built on a stable foundation [18], it has become the primary means of exchanging information over the Internet.

A variety of tools make it relatively easy to manage XML data and transform it into other formats (e.g., an extensible stylesheet language transformation (XSLT) [19] style sheet may be used to transform XML data to hypertext markup language (HTML) format for display in a Web browser).

By allowing users to create their own syntax (element and attribute names) and structure (hierarchical parent-child relationships between elements), XML gives database designers great freedom to transform their mental models of an information system into a concrete form.

However, people conceptualize information in very different ways, particularly in a complex field like bioinformatics. This makes it difficult, if not impossible, to create widely accepted XML standards for bioinformatics data. Furthermore, different organizations are interested in different aspects and constellations of the bioinformatics data universe, which includes DNA sequences, proteins, structures, expressed sequence tags (ESTs), transcripts, metabolic pathways, patents, mutations, publications, and so forth. If all of these data types were incorporated into a single format, it would be extremely complex and unwieldy.

For these reasons, many companies and organizations have given up the quest for a universal XML standard for bioinformatics data. Instead, they have created their own standards, which are often customized versions of existing standards, optimized for use in internal applications. SRS has remained neutral in the standards war by striving to develop flexible tools that support all the existing and emerging bioinformatics XML formats.

5.2.1 What Makes XML Unique?

Data formatting in XML is similar to data formatting in flat files. Figure 5.4 shows how the EMBL flat file data in Figure 5.3 might appear if rendered in XML format. The key features that make XML formats different from flat file formats are as follows. Figure 5.4 illustrates both types of data encapsulation (the only piece of data expressed as an attribute value is the feature ID, CDS).

FIGURE 5.4 EMBL flat file data rendered as XML.

1. XML uses two distinct kinds of tags for wrapping data: elements and attributes. There is no hard-and-fast rule for what kinds of data should be encapsulated in attributes rather than in elements. In general, attributes tend to be used for short pieces of data that have a one-to-one relationship with the data in the parent element, such as IDs and classifications.

2. There are two types of syntax that can be used for XML elements.

a. Normal syntax encloses the data belonging to an element between a start tag (e.g.,<join>) and an end tag (e.g., </join>).

b. Empty syntax may be used for elements that either have no data content or have content that may be stored efficiently in attribute values. The InterPro format created by the EBI uses empty db_xref elements for specifying references to external databases:

3. Some XML elements (e.g., feature_list) are used as structural components that define hierarchical relationships between other elements but contain no data of their own.

4. Empty elements may be used as structure-only elements to delimit entries (or sub-entries). To support both normal and empty element entry delimiters, the SRS XML parser must have two different types of behavior.

a. For entries delimited by start and end tags, entry processing terminates when the end tag is found.

b. For entries delimited by empty element tags, there is no end tag, so entry processing terminates when the start tag of the next entry is found or when the end of the file is reached.

5. XML allows users to define shorthand expressions to represent commonly used strings. These expressions are called general entities. For example, the entity &spdb; could stand for the name of a database (e.g., SwissProt-Release). When an XML parser encounters a general entity reference like &spdb; in an attribute value or element content, it must replace the reference with the replacement text.

6. Some commonly used characters have special meaning in XML.

a. Less thans [<] and greater thans [>] are used in markup tags.

b. Apostrophes [’] and quotation marks [”] are used to delimit attribute values.

c. Ampersands [&] are used to specify general entity references.

If these characters occur within XML attribute values or element content, they can create ambiguities for an XML parser, so they must be handled with care.

7. XML data may also be encapsulated in CDATA sections that may appear wherever character data may appear. Inside CDATA sections, less thans and ampersands are treated as literals (i.e., they do not need to be replaced with entity references).

5.2.2 How Are XML Databanks Integrated into SRS?

XML is fully integrated into the SRS universe of databanks, and it is relatively easy to incorporate XML libraries into an SRS installation. The only prerequisite is a document type definition (DTD) that accurately describes the structure of the XML. If a DTD does not exist, a utility such as Michael Kay’s DTDGenerator [20] can be used to create one.

The first step in the configuration process is to run an SRS utility, which analyzes the DTD and creates templates for all the meta-data objects needed to define the new library. The user must then edit the resulting object definitions. Initially, the user must supply all of the extra information needed to perform the basic indexing and loading tasks. The next step is to register the new databank with SRS and index the library. Once the library has been indexed, all the standard library operations become available.

If the new XML library contains sub-entry libraries or takes advantage of any special indexing or loading features, the administrator must perform additional editing to define the sub-entry libraries or to activate these features. Integrating an XML library into SRS is easier than integrating a flat file library because SRS does most of the work of creating the library meta-information. Also, the use of a built-in generic XML parser eliminates the need for writing a library-specific parser.

5.2.3 Overview of XML Support Features

Support for Complex DTDs

A DTD is a set of declarations that defines the syntax and structure of a particular class of XML documents. DTDs may consist of an internal subset (inside an XML document) and/or any number of external subsets in separate files. External subsets may be invoked recursively from within other external subsets. DTDs may also incorporate INCLUDE and IGNORE blocks (conditional sections) containing different sets of declarations to be used in different applications, and these blocks may be activated or deactivated using variables called parameter entities. Thus, DTDs can be quite complex.

The SRS utility used to parse DTD files employs a sophisticated algorithm to process external DTDs recursively in accordance with the guidelines laid down in the World Wide Web Consortium‘s XML Version 1.0 Recommendation [18]. This ensures that if a DTD includes multiple declarations of the same general entity or default attribute value, the correct values are used in generating the SRS meta-information. The utility also supports the use of parameter entities and correctly processes conditional sections.

Support for Indexing and Querying

SRS provides several powerful features to give users control over the way XML data is indexed and queried. Micro-parsing allows users to pre-process data before it is written to an index field. For example, suppose the data contains an author element that uses initials-first formatting (e.g., <author>J. K. Rowling</author>), but the user would like to index this in initials-last format (e.g., Rowling, J. K.). The indexing metaphor for the author element would refer to an Icarus syntax file containing a production to transform the data. Micro-parsing allows users to apply the same types of syntactic and semantic rules used for flat file parsing to the contents of individual XML tags.

Splitting allows users to subdivide input data strings containing lists into their component index values. For example, suppose the data contains an authors element containing a list of author names separated by commas and white space (e.g., <authors>J. K. Rowling, William Shakespeare, Stephen King </authors>), but the user would like to index this list as three separate author names. The indexing metaphor for the authors element would include a split attribute (e.g., split: [, ]) specifying a regular expression containing a list of the characters used to separate individual items in the list.

Conditional indexing allows users to process meta-data specified within the XML stream. For example, the Bioinformatic Sequence Markup Language (BSML) format [21] uses an XML element called Attribute as a container for three different types of data: version, source, and organism. The name attribute is a meta-data field that identifies the type of data contained in the associated content attribute.

Conditional indexing may be used to channel the data contained in the three content attributes into three separate index fields designed to hold version, source, and organism data.

SRS provides solid support for indexing and querying subentry libraries. In some XML formats, a single type of element is used in more than one sub-entry library, and the element may have a different meaning in each library. To index the data contained in these elements into the correct set of target index fields, SRS allows users to create separate fields and indexing metaphors for each unique instance of the element. The indexing metaphors use a special path attribute to determine which sub-entry library the element currently being processed belongs to so that the data can be indexed into the correct field. Conversely, SRS also allows users to index data from a single element or attribute into multiple index fields.

5.2.4 How Does SRS Meet the Challenges of XML?

Problems with managing XML data can be divided into two main categories: syntactical/semantic (microscopic) and structural (macroscopic). Data formatting varies widely between standard XML formats, and pre-processing is often required before data can be indexed or loaded. Table 5.1 describes several common syntactical problems and explains how SRS solves them.

TABLE 5.1

Syntactical problems and SRS solutions.

Problem	SRS Solution
Fields may include special characters (e.g., colons, square brackets, dashes, and asterisks) that can interfere with SRS query syntax.	Use micro-parsing to purify data fields during indexing.
Fields may contain data whose type depends on the value of another (meta-data) field.	Use conditional indexing to index data into different fields based on the value in a condition (meta-data) field.
Fields may contain characters that require special handling in XML (e.g., less thans, greater thans, apostrophes, quotation marks, and ampersands).	Use micro-parsing to replace problematic characters with pre-defined character entity references.
Entity references (both pre-defined and user-defined) must be replaced before data is indexed or loaded. Entity references may include markup.	SRS provides sophisticated entity replacement functionality.
A single element may be used in two or more subentry libraries to contain different types of information.	Users can create separate fields and indexing metaphors for each instance of an element used in a different subentry library. The indexing metaphors use a path attribute to index the correct data into the correct fields.
Mixed content elements are difficult to parse because content belonging to the parent element is interspersed with content belonging to child elements.	SRS provides two special loading commands (xsl: copy-of and xsl: value-of) that emulate useful features found in the extensible stylesheet language transformations (XSLT) language [19].
Fields may contain lists of values that must be separated into individual values.	Indexing metaphors can include a split attribute to split a string into sub-strings using a set of separator characters contained in a regular expression.

XML formats, like flat file formats, can be large, complex, and unwieldy, making data access difficult and inefficient. Table 5.2 describes several common structural problems that occur in XML formats used in bioinformatics. SRS provides solutions to some of these problems, but some can only be solved by restructuring the data.

TABLE 5.2

Structural problems and SRS solutions.

Problem	SRS Solution
Some libraries use large numbers of structure-only tags. Tags can take up a lot of disk space without providing much useful information.	No solution; inherent in XML.
Excessive numbers of tags make a format difficult to understand and manage.	The SRS utility that generates library definition files uses intelligent parsing to eliminate structure-only elements from the set of metadata objects that are included in the library definition file.
Deep nesting and large numbers of sub-entries slow down querying and loading performance.	The SRS loading algorithm builds a document object model (DOM) object for each entry. This approach provides both optimal performance and highly reliable handling of sub-entries. It also provides some special loading commands that improve performance for particular types of loading tasks.
Content belonging to a single entity may be spread across multiple files.	No solution; inherent in certain XML formats. Data should be restructured.
A single entity may appear repeatedly in multiple files.	No solution; inherent in certain XML formats. Data should be restructured.
Excessive data redundancy slows down performance.	No solution; inherent in certain XML formats. Data should be restructured.

5.3 INTEGRATING RELATIONAL DATABASES

5.3.1 Whole Schema Integration

For relational databases, a schema organizes the data defining the data entities and their relationships to each other. Because individual entities can only be modeled as flat tables, real world concepts such as genes or metabolic pathways often use many tables to store the information faithfully. Conversely, to make full use of this data, the whole schema needs to be made available to the user. The user must be able to query one or more tables and then collect the necessary data from all related tables. For example, in a relational database storing genes, the user may query the author table for Lee, which will return the set of genes published by the author Lee. Behind the scenes the results from the query in the author table need to be related to other data (e.g., accession code, keyword, references, sequence), which is stored in other tables. The data is represented as a whole and must be assembled from many different tables before it is presented to the user.

The problem of mapping a table structure into a more complex object structure has been addressed before by object relational mapping techniques. Traditional approaches start with a class description of the objects to be stored and then generate the relational schema from the class information. This is in conflict with the SRS approach of integrating existing schemas where often an object model has not yet been defined. The overwhelming majority of the schemas relevant to life science informatics (LSI) have been obtained by more traditional methods, such as entity-relationship (ER) modeling, and not by object-relational modeling.

The SRS approach is to use a semi-automated process to define object-relational mapping on top of an existing schema. This is achieved by selecting a table manually to be the hub table, or the table containing values equivalent to an object ID (usually an accession number or unique ID), and other tables that can be defined to belong to the object. Using the hub table, the selection of tables, and foreign key relationships, SRS can automatically create an object model, which is introduced to the system as a dynamic type. The resulting object can then be queried and retrieved as a fixed entity, much like an entry in a flat file or XML databank.

When a relational databank is integrated, no indexing on the SRS side needs to be done. SRS will generate SQL statements for querying and retrieval of objects that will emulate the same behavior as users expect when dealing with flat file and XML databanks.

5.3.2 Capturing the Relational Schema

SRS Relational includes a Java program, schemaXML, which uses a standard Java Database Connectivity (JDBC) [22] interface to capture the relational database schema, including all the tables, columns, keys, and foreign key relationships. The program schemaXML passes this information to SRS providing the base information to integrate the relational databank. All further meta-data for customization can be added by editing this schema information using a graphical interface or by direct manipulation of Icarus files. This provides a much simpler solution than would be required by writing individual integration programs for each relational databank to be integrated. If the schema changes the program, schemaXML just needs to be re-run and the edits reapplied. A tool is being developed that reapplies edits to an updated version of the original schema definition.

5.3.3 Selecting a Hub Table

SRS Relational is based on the concept of hub tables, which are used, conceptually, to relate relational database tables to data objects. Hub tables are central points of interest in a relational schema and must contain a unique name (typically a primary key) that can be used as an entry ID (e.g., an accession code in a sequence database). Using foreign key relationships, all data held in surrounding tables can be linked directly or indirectly back to the hub table and entry ID using table joins. All tables that belong to a hub table must be directly or indirectly linked with it. In cases where these links are not apparent from the schema information retrieved by schemaXML, they can be set manually within the visual administration interface.

Figure 5.5 shows a section of the relational schema that is used to maintain the GO databank in the MySQL relational database management system (RDBMS).

FIGURE 5.5 Visual representation of part of the GO term schema within the SRS Visual Administration Tool. The table term is selected as a hub table. Individual lines between the tables represent foreign key relationships.

The term table is clearly the central point of interest in the schema with related tables surrounding it. It would be selected by the SRS administrator as a hub table for use in SRS. In other databases, the hub table selection may be less clear. For example, a Laboratory Information Management System (LIMS) database has many concepts such as sample, project, and experiment, each with its own collection of related tables. In these cases, multiple hub tables can be selected and associated to separate SRS libraries by the SRS administrator.

5.3.4 Generation of SQL

SRS sees the relational schema as a graph with tables as nodes and foreign key relationships as edges. The hub table is at the center of this graph. An idealized form of such a graph is shown in Figure 5.6. To translate an SRS query into SQL it maps the predicates to the appropriate columns and then to tables in the graph, and a shortest path is derived to relate these predicate queries to rows in the hub table using joins. An example is shown for three predicates in Figure 5.6 (A), which are all linked to the hub table. The SQL query will return a number of rows in the hub table, which are processed to create a list of entry IDs. To retrieve particular entries, a search path is again used, this time radiating out from the hub table and including the required tables using joins. See Figure 5.6 (B).

FIGURE 5.6 Hub table data access.

5.3.5 Restricting Access to Parts of the Schema

Once the relational schema held on the SRS side has been configured to define a library, it can be further modified to restrict or allow access to the tables within the schema. Individual tables can be hidden from SRS so general access to the data is not available. In addition, the SRS access permissions can also be used to control access to the whole or sections of the schema (when using multiple hub tables). It is also possible to modify, add, and remove links between tables without altering the original database schema.

5.3.6 Query Performance to Relational Databases

During the development and use of SRS Relational a number of performance optimizations have been added. A few of these are outlined as follows.

SRS is case insensitive, and it is well known that case insensitive queries in relational databases can be expensive. Therefore, if all the values in a column are known to be in the same case, this can be indicated within the metadescription of the schema and used to reduce the querying time significantly.

When required to do many table joins with One-to-Many (1:N) relationships within a single SQL query, the creation of the result table will suffer from combinatorial explosion. The definition of a table link contains a junction option, which, if turned on, will generate multiple small queries that can be run simultaneously and joined externally. This provides significant performance improvements for the object assembly process.

For text and pattern searches it is possible to make use of text indices produced by the relational database. This results in much faster searches for text-based queries, such as keywords or author name.

All query results are cached in a user-owned space. This is inexpensive because the query result is represented as a simple list of entry IDs. To repeat a query, it can be looked up in the cache and retrieved. The cache speeds up reinspection of queries, combining them with other queries, or displaying individual chunks of the result list in a Web interface.

5.3.7 Viewing Entries from a Relational Databank

As mentioned previously, the selection of a hub table and associated tables is used to build an object model automatically. The resulting object can be displayed as an XML stream, which is, however, inconvenient for the user. One option for presenting the object in a human, readable form is to apply XSLT to the XML output. Another more convenient one is to use the mechanism SRS provides to present objects to the Web by a layout description. Section 5.6 will describe how data from relational databanks can be combined with data from flat file or XML databanks into a single data structure.

5.3.8 Summary

Consistent with the SRS philosophy, relational databanks can be added through a meta-data only approach. With the exception of defining the HTML representation of the entry data, the entire process of creating and editing the meta-data can be done through mouse clicks in the visual administration interface of SRS.

Not all the options in the configuration have been described here, including setting up sub-entry libraries, automatic handling of binary data (such as images and Microsoft Office documents), and table cloning to handle recursive and conditional relationships between tables.

SRS uses a simple interface class to interact with the relational systems. For speed and efficiency the C/C++ interfaces are preferred over JDBC. At present the following Relational Database Management Systems (RDBMS) are supported:

Oracle [23]

MySQL [24]

Microsoft SQLServer [25]

DB2[26]

Relational databanks offer a lot of functionality, which needs to be matched by any system that mediates access to them. The meta-data approach of SRS Relational has proven to provide the flexibility to cope with new user requirements to exploit this functionality.

5.4 THE SRS QUERY LANGUAGE

SRS has its own query language. It supports string comparison including wildcards or regular expressions, numeric range queries, Boolean operators, and the unique link operators (see Section 5.5). Queries always return sets of entries or lists of entry IDs. Sets obtained by querying all databanks can be sorted using various criteria. The query language has no provision to specify sorting. Instead it is invoked using a method of the result set object that has been obtained by evaluating a query. To extract information from entries of result sets, further methods are available. These methods can retrieve the entire entry as a text or XML stream, retrieve individual token or field values, or load entries into data structures using predefined object loaders (see Section 5.6).

5.4.1 SRS Fields

A query predicate must refer to an SRS field, which has been assigned to the fields in the databank before query time. A query into the Swiss-Prot description field with the the word “kinase” looks as follows in the SRS query language:

This denotes a string search enclosed in [ and ]. The databank name swissprot is followed by the field name description. The search term kinase follows the delimiter:. Because the description field is shared with the databank EMBL, the query can be extended to search both Swiss-Prot and EMBL, which then have to be enclosed in curly braces:

Importantly, SRS fields are entities outside a given library definition, which must be mapped onto each field in a library. Whenever possible, the same SRS field is mapped to equivalent fields in different libraries. Through that mechanism each SRS library has a list of associated SRS fields that can be used for searching. Whenever the user selects multiple libraries for searching at the same time, it is possible to find out all the SRS fields that these have in common and represent them in a query form. SRS fields are an important mechanism to integrate heterogeneous databanks with different, but overlapping, content, and they also provide an important simplification because no knowledge of the internal structure of a given databank is required when retrieving and using the list of SRS fields.

A special SRS field exists with the name AllText. It is shared by all databanks and refers to all the text fields in all databanks. Through the use of this field, full-text queries can be specified.

5.5 LINKING DATABANKS

A common theme in databanks in molecular biology is that they all have explicit cross-references to other databanks. Especially now, in the postgenomic era in which many known proteins can be linked to a genome location and where results from gene expression and proteomics experiments can be used to understand how these proteins are regulated within the cell, individual data items have very limited value if they are not connected to other databanks. SRS supports and makes use of explicit cross-references in three ways:

Hypertext links

Indexed links

Composite structures (see Section 5.6)

Hypertext links are the simplest mechanism and are ubiquitous on the Web. They are inserted into the appropriate places when displaying information to the user. Linking in this form can be operated on single entries and is very convenient and easy to understand. These links are easy to set up for an SRS Web server. Definitions can be shared among libraries and include options like displaying a link only if it contains a valid reference to an existing entry.

More powerful is the use of indexed links. A simple example of a query using indexed links is “give me all entries in Swiss-Prot that are linked to EMBL.” SRS has a general capability to index links based on explicit or even implicit cross-reference information. Link indices are built using information from one side only. All links, once indexed, become bi-directional. An SRS server with many libraries and links can be seen as a graph where nodes are libraries and the edges are the links between them. Figure 5.7 shows such a graph for a comparatively small installation.

FIGURE 5.7 An SRS library network.

In this graph it is possible to link databanks that are not directly connected. For any pair of databanks, the shortest route can be determined and carried out by a multi-step linking process. SRS knows the topology of a given installation and can therefore always determine and execute this shortest path. If the shortest path is not what is desired, this can be specified explicitly within an SRS query language statement.

5.5.1 Constructing Links

Links can be constructed by identifying two SRS fields (see Section 5.4), each from one of the two SRS libraries to be linked that contain identical field values. For instance, to create a link between Swiss-Prot and EMBL, you would select the accession field from EMBL and the data reference (DR) field from Swiss-Prot with explicit cross-references to other databanks. Another example is to link Swiss-Prot and Enzyme [27], which can be defined by the ID field from Enzyme and the description field of Swiss-Prot. The description field of Swiss-Prot carries one or more Enzyme IDs if the protein in question is known to have an enzymatic function. For flat file and XML databanks, link indices must be built to make the link queryable. Indices can be built by comparing existing indices or by parsing the databank defined to contain the cross-reference information.

Links between and from relational databanks are defined in the same way. However, no indices need to be built. A link query can be executed by querying the information provided in the table structure.

5.5.2 The Link Operators

Link operators are unique to the SRS query language. The two link operators, < and >, allow sets of data from different databanks to be combined. Figure 5.8 shows two databanks, A and B, in which some entries in A have cross-references to entries in B. These cross-references are processed to build link indices, which provide the basis for the link operation. Figure 5.8 also shows the results of two link queries between sets A and B, using the operators < and >.

FIGURE 5.8 The SRS link operators.

By combining predicate queries with link operators it is possible to perform complicated cross-databank queries such as “retrieve all proteins in Swiss-Prot with calcium binding sites for which their tertiary structure is known with a resolution better than 2 Angstrom.”

Another important use of the link operator is to convert sub-entries (e.g., sequence features) into entries and vice versa. With this link it is possible to search in EMBL all human CDS features (i.e., all sequence features describing coding sequences or all human DNA sequences that have CDS features).

5.6 THE OBJECT LOADER

The SRS object loader is a technology originally designed to transform semi-structured text data into well-defined data structures that can be accessed in a programmatic way. The object loader processes data according to a loader specification, or a class definition, which, for all of its attributes, specifies how it can obtained from the text file.

The following example shows such a loader for the example of the mutation data in Figure 5.2.

This definition needs no information on how the required information is to be parsed out of the flat file. Only the name of the token is needed to make the association.

The object loader has been extended to support, in addition to flat file databanks, XML and relational databanks. A variety of ways have been added to specify the origin of the original data to be loaded. This includes using the SRS field abstraction, an XPath-like syntax for XML files (XPath is the XML Path language used for addressing parts of an XML document) [28] or pairs of table and row names for relational databanks. A single loader can be defined for a broad range of databanks. For example, a single sequence loader can be specified for all databanks with sequence information, and the original format can be flat file, XML, or relational.

In Section 5.8 an example is described for accessing the loaded objects within a client program.

5.6.1 Creating Complex and Nested Objects

The loader specification supports other useful features, such as class inheritance, and supports various value types like string, integer, and real values and various types of lists.

Using token indices (TINs), a feature of the token server that allows iteration over lists of complex structures inside a text entry, object classes can be nested to an arbitrary degree. Object loaders can build a structure to reflect the entry subentry structure used for indexing, but can have deeper levels of nesting. A good example is an EMBL entry, which contains a list of sequence feature objects, each of which contains a list of qualifier value and name pairs (see Figure 5.3 for an example of an EMBL sequence feature).

5.6.2 Support for Loading from XML Databanks

SRS provides several features to give users control over loading from XML databanks. The utility that generates library definition files from an XML document type definition can generate two types of loaders, a flat loader based on the traditional main library/sub-entry library structure or a special structured loader, which is a collection of separate loaders for each element that mimics the tree structure of the original XML document. The structured loader makes it easier to load data from libraries that are heavily nested, and it is particularly useful for tasks like writing SRS page layout modules for HTML display.

SRS provides special support for random access of sub-entities within XML files (e.g., individual entries, sub-entries, fields, or collections of fields) or for loading high-level structures without some of the data nested within them (e.g., main entries without sub-entries). In addition, SRS offers two access mechanisms that mimic functionality available in XSLT. Figure 5.9 shows a mixed content element called prolog that has child elements child_1 and child_2 interspersed with its content.

FIGURE 5.9 Mixed content sample data.

The xsl: value-of functionality allows users to extract and concatenate all of the character data content of the prolog element and its child elements. A field loaded using the xsl: value-of keyword word would contain the following text: “To be, or not to be, that is the question.” Note that none of the attribute values are included. The xsl: copy-of functionality allows users to extract XML tree fragments, complete with markup. A field loaded using the xsl: copy-of keyword word would contain the entire XML fragment shown in Figure 5.9. If neither of these special keywords were used, the prolog field would only include the content of the prolog element: “To be, to be, the question.”

5.6.3 Using Links to Create Composite Structures

An important feature of the object loader is that it can perform links to retrieve single attributes or entire data objects from another linked library. Assuming that the Mutation databank is linked to Swiss-Prot, an example is adding an attribute to the Mutation loader with the description line from a linked Swiss-Prot entry. The following line instructs the object loader to link to Swiss-Prot using the shortest path and to extract the description token from the linked entry.

Another possibility would be, rather than extracting a single token, to attach an entire object as defined by an already existing loader class for Swiss-Prot, in this case the loader SeqSimple.

As information about a certain real-world object is scattered across many databanks, object loaders can provide an extremely valuable foundation for writing programs to display or disseminate these real-world objects. It is possible to define and design these objects freely and in a second step decide where the individual information pieces can be retrieved for their assembly.

5.6.4 Exporting Objects to XML

SRS allows users to export data assembled by the object loader to a generic XML format or to any of the standard XML formats. When converting data to a generic format, SRS creates a well-formed XML document with an accompanying DTD. This functionality can be invoked from the SRS Web interface or from one of the APIs (see Section 5.8). This functionality is provided for every object loader specification by default. If users wish to convert data to a specific format, a public standard, or a format the user invented, they must use a set of XML print metaphor objects that represent and describe the elements and attributes in the target format.

The process for creating XML print metaphors is similar to the process for creating an XML databank definition file. Before a new set of print metaphors can be generated, the user must obtain an accurate DTD for the target XML format. An SRS utility analyzes the DTD and creates print metaphor object templates for all of the XML elements and attributes in the target format. The user must then edit the resulting file to identify data sources for each element and attribute in the target format. The new SRS Visual Administration Tool includes a graphical user interface (GUI) that greatly simplifies the process of editing print metaphors.

Data objects can also be exported to a target XML format using an XSLT style sheet. This process is slightly less convenient than using print metaphors because it involves an extra conversion step. The user must first export the data to the generic XML format, then invoke an XSLT style sheet that converts the generic format to the target format.

XML print metaphors can also be used to transform data from any source into an XML format that is compatible with Microsoft’s Office Web Components (OWC) [29]. This technology allows data to be displayed and manipulated using either an Excel spreadsheet or a pivot table embedded in the SRS browser. The pivot table component allows the user to do sophisticated sorting and grouping operations on the data. Both components have an “Export to Excel” button that allows the data to be easily saved to an Excel workbook file.

5.7 SCIENTIFIC ANALYSIS TOOLS

A key feature of SRS is its ability to integrate and use scientific analysis tools that can be applied to user data or to data resulting from database queries. The results generated by these tools can be stored, in turn, in tool-specific databanks, which can then be treated like any other SRS databank. The difference in these databanks is that they are user owned and constitute part of the user session with SRS.

All tools that can be integrated fulfill the following requirements:

It can be launched with a UNIX command line.

It receives input through command line argument or input files.

It writes output to files or to the standard output device.

In bioinformatics, hundreds of tools can be found with these properties. They include BLAST, FASTA for sequence similarity searching, or Clustal [30] for multiple sequence alignment. A selection of these can be combined within an automated annotation pipeline to predict all genes for a genome or, for all proteins derived from these genes, the protein function annotation. Pipelines like this, together with their output, can be integrated as a single tool into SRS. Currently SRS supports about 200 tools, including BLAST, FASTA, stackPACK [31], and the majority of the tools in EMBOSS.

A tool can be added to SRS through meta-data by defining the SRS library with the syntax and data fields of the tool output, information about all tool input options, validation rules to test a parameter set specified by the user, pre-defined parameter sets, association to a data type, and so forth.

SRS has a growing set of pre-defined data types, such as protein sequence, which can be extended by the administrator. These data types can be associated with databanks that contain data of this type, tools that take it as input, and tools that produce it as output. This information can be used to build user interfaces that know which tools apply to which databases or workflows that feed tools with outputs of other tools.

5.7.1 Processing of Input and Output

Many tools require some pre-processing steps like setting up the run-time environment or conversion of the input sequence to a format they recognize, and post-processing such as cleanup of additional output or preserving input data values. All of these can be specified as part of the tool definition or by using pre-defined hooks for shell scripting.

Output can be processed at many levels, depending on the detail required. A simple text view of the output is enough for some applications, but where the results can be parsed for object loaders, this is much preferred. A key decision is the level at which an entry in the output should be returned by a later query. The entire output is usually a single entry for simple analysis of a sequence, but for search tools like BLAST it is preferable to represent each hit in the sequence databases as a separate entry so these can be linked to the source data. This seriously complicates the task of developing a parser as the entry information is split in several sections of a file, which can be several megabytes in size, but the increased flexibility more than justifies the extra effort.

An important implication of parsing and indexing tool outputs is that the respective tool libraries can become part of the SRS Universe if link information exists. For instance, all outputs from sequence similarity search tools can be linked to the sequence databank searched. Assuming that the search databank is connected to the SRS Universe, questions like “How many proteins from a certain protein family or metabolic pathway were found?” can be asked. Links also can be used to compare results obtained by different search tools; for instance, through a single SRS query a list of hits that were found by both FASTA and BLAST can be obtained.

5.7.2 Batch Queues

Batch queues allow the administrator to specify where and when analyses can be run. Once batch queuing is enabled, it is possible to associate a tool with one or more queues with different characteristics. SRS provides support for several popular batch queuing systems such as LSF [32], the Network Queuing System (NQS) [33], the Distributed Queuing System (DQS) [34], or the SUN Grid Engine [35].

If a tool associated with a batch queue is launched, the job is submitted to this batch queue and the Web interface (see Section 5.8, Interfaces to SRS) reports the command line and provides a link to the job status page. This page displays the full list of batch runs. Selecting a completed run will bring up the results. When an application has not been assigned to a batch queue it will be run interactively.

5.8 INTERFACES TO SRS

Several interfaces to SRS exist, which provide full access to all its functions. They include:

Creating and managing a user session

Performing queries over the databanks

Sorting query result sets

Launching analysis tools

Accessing meta-information

The Web interface is implemented as a Common Gateway Interface (CGI), a stateless server that is invoked for every request. However, using the APIs of SRS Objects it is possible to write stateful and multi-threaded servers.

5.8.1 The Web Interface

The most popular access to SRS is through a Web interface. With it the user creates a session that can be temporary or permanent. Within the session the results of many user actions are stored. These include queries, tool launches, and creations of views. The Web interface provides several query forms, one of which is the highly customizable canned query form that allows administrators to set up intuitive forms that enable an inexperienced user to launch even complex queries.

5.8.2 SRS Objects

SRS Objects is a package of object-oriented interfaces to SRS. It is designed for software developers who want to access the functionality of SRS from within their own object-oriented application. SRS Objects includes four language-specific APIs, which are:

C++

Java

Perl

Python

SRS Objects also includes the SRS Common Object Request Broker Architecture (CORBA) Server, compliant with the CORBA 2.4 specification, which is generally referred to as SRSCS.

The C++ API represents the foundation both for the other three APIs (generated automatically from the C++ declarations using the public domain SWIG package) and SRSCS, whose interfaces and operations wrap the C++ API classes and methods. As a consequence, in terms of SRS interaction, the four APIs and SRS CORBA Server are almost identical and provide the same types and method signatures.

The package SRS Objects provides the following major functionalities:

Creation of temporary or permanent SRS sessions and interaction with them

Access to meta-information about the installed databank groups, databanks, tools, links, etc.

Querying of databanks using the SRS query language

Accessing databank entries in a variety of ways

Launching of analysis tools and managing their results

Use and dynamic creation of the SRS object loaders

Working with the SRS Objects manager system to create and use dynamic types

In addition, SRS Objects abstracts from the developer tasks such as the initialization of the SRS system, SRS memory management, and SRS error handling.

Central to SRS Objects, as in the Web server, is the session object. It must always be created at the beginning of the program. As in the Web server, the session is associated to a directory where the results of the user actions are stored. This means the Web client and a program written with SRS Objects can share a session and its contents.

The following program example in Perl illustrates the use of SRS Objects. It starts by creating a session object, then queries all Swiss-Prot entries with kinase in the description field, and finally prints a few attributes for each entry in the result.

5.8.3 SOAP and Web Services

Currently SRSCS is the only client server interface to SRS. The others are in-process APIs and require the client application to be run on the same computer as the SRS server. CORBA is well suited for client server applications on the same local area network (LAN), but it is of much more limited use across the Internet or an intranet. The simple object access protocol (SOAP) and the Web Services standard are much better suited for this type of application and are also very compatible with SRS functionality. A Web Services interface, which will provide the same functionality as the existing SRS Objects APIs is currently being built.

5.9 AUTOMATED SERVER MAINTENANACE WITH SRS PRISMA

SRS Prisma is an extension package for SRS that can assist a site administrator with the sometimes onerous task of keeping the flat files, XML files, and indices for installed libraries as up to date as possible. This is done by comparing the status of the local files and indices with the corresponding data files at an appropriate remote FTP site. Any files or indices found to be out of date are replaced by downloading new data and/or by rebuilding the appropriate indices. In addition, SRS Prisma can be used as a more general data management tool, carrying out tasks such as reformatting newly downloaded data files, or creating new data files from existing SRS data files and indices. SRS Prisma can be used on an ad hoc basis by the administrator, but it is also ideal for daily scheduling to ensure that all databases are kept as up to date as possible. To assist the administrator in monitoring the completion status of any update processes, Prisma creates a complete archive of Web reports from up to 31 days prior, including easy-to-use graphical views.

In a situation where many databases need to be updated and where a large range of tasks is involved (from downloading, to indexing, to data reformatting), Prisma will determine the minimum number of tasks to be carried out and the dependencies between these tasks. For example, the building of a link index requires the to and from indices to be up to date. In such a case, the link task would be delayed until any required rebuilding of the to and from indices was complete. In the event that any of the required tasks fails, the architecture employed by Prisma ensures that any other tasks that do not depend on failed tasks are completed. The Prisma job will finish when all the tasks that can be completed have been done. For instance, if the download phase fails for SWISSNEW, the downloading and indexing of other databases should be unaffected. Other important features of SRS Prisma are as follows:

Prisma allows a failed job to be re-run from the point at which it failed, thereby minimizing the repetition of tasks, which can be time-consuming and processor-intensive. For example, if the download phase for a particular databank has failed due to a transient external problem (e.g., a problem accessing the relevant FTP site), the Prisma job can be re-run once this problem has been resolved. In such a case only the failed tasks and those dependent on them will be run.

Tasks can be carried out in parallel to optimize performance on multiple processor machines. This type of parallelization includes indexing/merging and downloading. If a databank consists of several files that can be indexed in parallel, Prisma will interleave downloading and indexing of these files.

Offline processing of downloads and indexing ensures that during the updating job the SRS server continues to function in an uninterrupted way. The new databanks and indices are only moved online after completion of the entire job.

Staged Prisma runs allow controlled and automated decision making to ensure robustness and minimized maintenance. This allows Prisma to bring the update job to an end even if individual tasks fail.

An integral part of Prisma is a facility to check the quality of all integrated databanks. Every day every databank is checked for configuration errors, compliance of flat file data to the rules of the token server, the validity of the schema information that SRS holds for relational databanks, and so forth.

Prisma is normally set up to run every night. It provides extensive reporting for the jobs it ran during the night and all the quality checks. Prisma archives the reports within a sliding window of 31 days.

Apart from relational databanks that can be accessed through a network connection, flat files and XML sources must be on the same LAN as the SRS server. This is expensive because the storage must be provided, but it guarantees stability and speed. SRS Prisma can keep all local copies of the databanks current in a completely automated fashion, checking every day the integrity of the system and the consistency of each databank and tool.

5.10 CONCLUSION

SRS can integrate the main sources of structured or semi-structured data, flat file databanks, XML files, relational databanks, and analysis tools. It provides technology to access these data, but also to transform them to a common mind-set. Data from the different sources will look and behave in exactly the same way, effectively shielding users from the complexities of the underlying data sources. This is also true for developers who use SRS APIs to write custom programs. SRS forms a truly scalable data and analysis tool integration platform onto which developers can build new databases, analysis tools, user views, and canned queries to tailor the environment to the needs of their company or institution.

Using bi-directional and high-speed links, SRS transforms the multitude of integrated databanks into a network, which paves the way for the full exploration of the relationships between the data sources (e.g., through cross-databank queries). The different sources can be combined using object loaders, which are able to build data objects by extracting data fields from the entire network.

The federated approach to integration, in combination with the use of metadata, means that data can be maintained in its original format. This is important so there is no data loss due to normalization or reformatting. Object loaders can be designed either to provide standardized access to diverse data sources or to extract information transparently from across the entire databank network. SRS, therefore, is both capable of supporting the native structure of databanks and abstractions or unified versions. It supports data in their native format, but it also supports standards derived from them or imposed onto them.

SRS does not improve the data it integrates, nor does it create a super schema over the data, but with its linking capability and object loaders, it provides the perfect framework for the semantic integration of the data sources in bioinformatics. The inherent flexibility and extensibility of SRS means that bioinformaticians can use SRS as a solid foundation for development where they can incorporate their own data and knowledge of the scientific domain to provide a truly comprehensive view of genomic data.

REFERENCES

[1] Benson, D., Karsch-Mizrachi, I., Lipman, D., et al, GenBank. Nucleic Acids Research. 2000;1(no. 28):15–18. http://www.ncbi.nlm.nih.gov/Genbank

The Gene Ontology consortium, http://www.geneontology.com.

[3] MySQL open source database, http://www.mysql.com.

[4] Incyte LifeSeq Foundation database. http://www.incite.com/sequence.foundation.shtml.

[5] Stoesser, G., Baker, W., van den Broek, A., et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Research. 2002;30(no. 1):21–26.

[6] O’Donovan, C., Martin, M.J., Gattiker, A., et al. High-Quality Protein Knowledge Resources: SWISS-PROT and TrEMBL. Briefings in Bioinformatics. 2002;3(no. 3):275–284.

[7] Octoberhttp://www.ncbi.nlm.nih.gov/BLAST. Altschul, S., Gish, W., Miller, W., et al. Basic Local Alignment Search Tool. Journal of Molecular Biology. 1990;215(no. 3):403–410.

[8] Pearson, W.R. Flexible Sequence Similarity Searching with the FASTA3 Program Package. Methods in Molecular Biology. 2000;132:185–219.

[9] Rice, P., Longden, I., Bleasby, A., EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics. 2000;16(no. 6):276–277. http://www.emboss.org

[10] Mulder, N.J., Apweiler, R., Attwood, T.K., et al, InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites. Briefings in Bioinformatics. 2002;3(no. 3):225–235. http://www.ebi.ac.uk/interpro/

[11] Zodobnov, E.M., Lopez, R., Apweile, R., et al, The EBI SRS Server: New Features. Bioinformatics. 2002;18(no. 8):1149–1150. http://www.srs.ebi.ac.uk

[12] Kreil, D.P., Ezold, T. DATABANKS: A Catalogue Database of Molecular Biology Databases. Trends in Biochemical Sciences. 1999;24(no. 4):155–157.

[13] Celera Discovery System, http://www.celeradiscoverysystem.com.

[14] NetAffx Analysis Center from Affymetrix. http://www.netaffx.com.

[15] Thomson Derwent GENESEQ portal, http://www.derwent.com/geneseqweb/.

[16] Perl, http://www.perl.org, http://www.perl.com.

[17] Jensen, K., Wirth, N. Pascal User Manual and Report, second edition. Heidelberg, Germany: Springer-Verlag; 1974.

[18] October 6 Bray, T., Paoli, J., Sperberg-McQueen, C.M., et al, Extensible Markup Language (XML) 1.0: World Wide Web Consortium (W3C) Recommendation. 2nd edition. 2000. http://www.w3.org/TR/REC

[19] November 16 Clark, J., XSL Transformations (XSLT) Version 1.0: World Wide Web Consortium (W3C) Recommendation. 1999. http://www.w3.org/TR/xslt

[20] Michael Kay’s DTDGenerator Utility. http://users.iclway.co.uk/mhkay/saxon/saxon5-5-l/dtdgen.html.

[21] LabBook, Inc. BSML XML format, http://www.labbook.com.

[22] SUN Microsystems. Java Database Connectivity (JDBC). http://java.sun.com/products/jdbc.

[23] Oracle database, http://www.oracle.com.

[24] MySQL, http://www.mysql.com.

[25] Microsoft SQL Server. http://www.Microsoft.com/sql/.

[26] IBM. DB2 Database Software, http://www-3.ibm.com/softward/data/db2/.

[27] Bairoch, A., The ENZYME Database in 2000. Nucleic Acids Research. 2000;28(no. 7):304–305. http://www.expasy.ch/enzyme/

[28] November 16 Clark, J., DeRose, S., XML Path Language (XPath) Version 1.0: World Wide Web Consortium (W3C) Recommendation. 1999. http://www.w3.org/TR/xpath

[29] . The Office Web Components Add Analysis Tools to Your Web Page. Microsoft: ***, 2003. http://office.microsoft.com/Assistance/2000/owebcoml.aspx

[30] Thompson, J.D., Gibson, T.J., Plewniak, F., et al. The ClustalX Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools. Nucleic Acids Research. 1997;24:4876–4882.

[31] Miller, R.T., Christoffels, A.G., Gopalakrishnan, C., et al. A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base. Genome Research. 1999;9(no. 11):1143–1155.

[32] Platform LSF. http://www.platform.com/products/wm/LSF/.

[33] Network Queuing System (NQS). http://umbc7.umbc.edu/nqs/nqsmain.html.

[34] Distributed Queuing System (DQS). http://www.scri.fsu.edu/∼pasko/dqs.html.

[35] Sun ONE Grid Engine, http://www.sun.com/software/gridware/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: SRS: An Integration Platform for Databanks and Analysis Tools in Bioinformatics

Create new playlist

Sign In

Sign Up