Chapter 13. Building Biological Databases

Since the advent of the World Wide Web, biological databases have become a vital part of the biological literature. Knowing how to find information in and download information from the central biological data repositories is as important a skill for researchers now as traditional literature searching. Major online data resources, such as the Protein Data Bank and GenBank, are expertly designed to provide information to users who have no understanding of how the underlying databases function, and to allow the deposition of data to a central repository by people who wouldn't know how to, or want to, build their own private databases.

However, as web databases become more integral to sharing information within the scientific community, it is likely that more people will want to develop their own databases and allow their colleagues to access their data directly. Even something as simple as a web site for a research group can be improved greatly and made easier to maintain with databases. In this chapter, we introduce some elementary database terminology and give an example of how to set up a database for a simple data set.

If you're relatively new to the world of computers and software, you're not going to be able to read this chapter and proceed directly to setting up your own database. What we hope to give you is an idea of the steps involved in developing a database: designing a data model, choosing a database management system (DBMS), implementing your data model, and developing a user-friendly frontend to your database. What this chapter will give you is a general understanding of the issues in database development. That understanding will help you to move forward, whether on your own or with the help of a database expert.

You don't need to understand what makes a database tick in order to use it. However, providing access via the Web to data you generate is becoming more and more important in the biology community, and to do that you have to have at least a rudimentary knowledge of how databases work. Even if you've got enough money lying around the lab to spring for your own Oracle administrator, you still need to speak the language.

Types of Databases

There are two types of database management systems: flat file indexing systems and relational DBMSs. A third type, the object-oriented DBMS, is beginning to increase in popularity. Choosing to use a flat file indexing system or a relational database system is an important decision that will have long-range implications for the capacity and usefulness of your database.

Flat File Databases

Flat file databases are the easiest type of database for nonexperts to understand. A flat file database isn't truly a database, it's simply an ordered collection of similar files, usually (but not always) conforming to a standard format for their content. The emphasis in formatting data for a flat file database is at the character level; that is, at the level of how the data would appear if it were printed on a page.

A collection of flat files is analogous to having a large filing cabinet full of pieces of paper. Flat file databases are made useful by ordering and indexing. A collection of flat files on a computer filesystem can be ordered and stored in labeled folders exactly the same way as a collection of printed papers are ordered in a file cabinet drawer (Figure 13-1). When we suggested, in an earlier chapter, using the hierarchical nature of your filesystem and a sensible file-naming scheme to keep track of your files, what we were essentially encouraging you to do is to develop a rudimentary flat file database of your work. Creating a database means you can remember the rules of the database rather than the locations of individual files and so find your way around more easily.

The relationship of a flat file to a flat file database

Figure 13-1. The relationship of a flat file to a flat file database

Flat file databases are often made searchable by indexing. An index pulls out a particular attribute from a file and pairs the attribute value in the index with a filename and location. It's analogous to a book index, which for example tells you where in a book you will find the word "genome." Like book indexes, database indexes need to be carefully designed so that they point to a word only when it occurs in an informative context. Database indexes take note of context by separately indexing different fields within the file. The word "cytochrome" occurring in the Molecule Name field in a protein structure file is likely to be far more significant to the user than the same word occurring only in the file remarks. In the first context, finding the word "cytochrome" guarantees the file contains information for some kind of cytochrome molecule. In the second context, the word can appear as part of an article title or a comment about intermolecular interactions, even though the structure in the file actually belongs to a different molecule. If multiple indexes for a file are created, you can then search a particular index file based on keywords, which is less cumbersome than searching all the actual files in the database file by file.

There's nothing inherently bad about flat file databases. They do organize data in a sensible way, and with the proper indexing they can be made extensively searchable. However, as flat file collections grow larger and larger, working with them becomes inefficient. An index is one-dimensional, so it is difficult (though not impossible) to make connections between attributes within an indexed flat file database.

Flat file databases in biology

Many of the popular biological databases began as flat file databases, and it's because of their legacy that many of the programs and software packages we discussed in previous chapters have strict internal requirements for the line format of input data.

For example, the PDB began by using flat files in the well-known PDB format. The format of these flat files was designed to be read easily by FORTRAN programs, and in fact has its roots in the time when computer input data was encoded on punch cards. When there were just a few protein structure files, maintaining this database and accessing it was no problem. The PDB did not grow beyond a few hundred files until 1990, nearly 20 years after its inception.

As PDB growth increased in the 1990s, new solutions for storing data needed to be found. In practical terms, the full listing of the database was starting to be so long that, if a user entered a directory containing all the available PDB files and tried to list filenames, it could take several seconds to even produce a file list. Reading the contents of large directories slows down even simple Unix tools such as ls, and it is even more of a problem for computer programs that might repeatedly read a directory. At first, the PDB was split into subdirectories based on the letters of the PDB code. But as the database approached 8,000 entries, even that began to prove too cumbersome.

The PDB now uses an object-oriented database backend (the part of the operation that resides on the PDB servers and that users don't see) to support database queries and file access. However, files are still made available in the legacy PDB format, so that users can continue to work with software that was developed long before the PDB was modernized.

Beyond the PDB, flat file databases are still widely used by biologists. Many users of biological sequence data store and access sequences locally using the S equence Retrieval System (SRS), a flat file indexing system designed with biological data in mind.

Relational Databases

Like flat file databases, relational databases are just a way of collecting all the information about something and storing it in a computer. In a flat file database, all the information about the thing is stored in one big structured text file. In a relational database, the information is stored in a collection of tables.

The flat file that describes a protein structure is like a bound book. There are chapters about the origin of the sample, how the data was collected, the sequence, the secondary structure, and the positions of the atoms.

In a relational database, the information in each chapter is put into separate tables, and instead of having its own book, each protein has its own set of tables. So, there are tables of experimental conditions, secondary structure elements, atomic positions, etc. All these tables are labeled with the identity of the protein they describe, so that connections can be made between them, but they aren't bound together like a book. The form of the tables follows rules that are uniform across the database, so you can access all the tables about atomic positions or all the chapters about experimental conditions at once, just as easily as you can access all the tables about a particular protein.

If you're interested in only one particular protein, it's not at all inconvenient to go to the library (the PDB), look the book up in the catalog, and read it straight through. The librarian can pick a few items of information out of the book (such as the name of the protein, the author who deposited it, etc.) and put them in an index (like a card catalog) that will help you find where the book is on the shelf.

But what if you're interested in getting the secondary structure chapter out of every book in the protein library? You have to go to the library, take down every book from the shelf, photocopy the secondary structure chapter, and then convert that information into a form that you can easily analyze.

A relational database management system (RDBMS) allows you to view all of the protein structure data in the database as a whole. You can "look" at the database from many different "angles," and extract only the information you need, without actually photocopying a particular chapter out of each book. Since each separate item of information about the protein is stored in its own separate table in the database, the RDBMS can assemble any kind of book about proteins you want, on the fly. If you want a book about hemoglobin, no problem. Even better, it is just as easy for the RDBMS to make you a book about the secondary structures of all proteins in the database.

All you need to do is figure out how to structure the right query to get back what you want from the database. If you want a book about hemoglobin, you can tell the RDBMS "if protein name equals hemoglobin then give me all information about this protein." If you want a book that describes only the secondary structure of each hemoglobin entry in the database, you can tell the RDBMS "if protein name equals hemoglobin then give me the secondary structure table about this protein."

How tables are organized

Data in a relational database table is organized in rows, with each row representing one record in the database. A row may contain several separate pieces of information (fields). Each field in the database must contain one distinct piece of information. It can't consist of a set or list that can be further broken into parts.

The tables in a relational database aren't just glorified flat files, though they may look that way if you print them out. Rows are synonymous with records, not with 80 characters on a line. Fields in each row aren't limited by a number of characters; they end where the value in the field ends. The job of the RDBMS is to make connections between related tables by rapidly finding the common elements that establish those relationships.

You can get an idea of the difference between data organized into tables and character-formatted flat file data by comparing the two types of protein structure datafiles available from the PDB. The standard PDB file is ordered into a series of 80 character lines. Each line is labeled, but especially in the header, the information associated with a label is quite heterogeneous. For example:

REMARK  1                                                         4HHB  14
REMARK  1 REFERENCE 1                                             4HHB  15
REMARK  1  AUTH   M.F.PERUTZ,S.S.HASNAIN,P.J.DUKE,J.L.SESSLER,    4HHB  16
REMARK  1  AUTH 2 J.E.HAHN                                        4HHB  17
REMARK  1  TITL   STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN     4HHB  18
REMARK  1  REF    NATURE                        V. 295   535 1982 4HHB  19 
REMARK  1  REFN   ASTM NATUAS  UK ISSN 0028-0836             006  4HHB  20
REMARK  1 REFERENCE 2                                             4HHB  21
REMARK  1  AUTH   G.FERMI,M.F.PERUTZ                              4HHB  22
REMARK  1  REF    HAEMOGLOBIN AND MYOGLOBIN.    V.   2       1981 4HHB  23
REMARK  1  REF  2 ATLAS OF MOLECULAR                              4HHB  24
REMARK  1  REF  3 STRUCTURES IN BIOLOGY                           4HHB  25
REMARK  1  PUBL   OXFORD UNIVERSITY PRESS                         4HHB  26
REMARK  1  REFN              ISBN 0-19-854706-4              986  4HHB  27

In the PDB reference records shown here, you can see that entries in each row aren't distinct pieces of information, nor are the rows uniform. Sometimes there are four author names on one line; sometimes there are two. Sometimes there are three title lines; sometimes there is only one. This can cause difficulties in parsing, or reading the header with a computer program.

Compare this to an mmCIF file. mmCIF is a new data standard for results of X-ray crystallography experiments. Protein structures have been available from the PDB in mmCIF format since the management of the PDB changed in 1999.

Before you see any data in the mmCIF file, you see what looks almost like a series of commands in a computer program, lines that describe how the data in the file is to be read. Then you'll see tables of data. Here's an example:

loop_
_citation.id
_citation.coordinate_linkage
_citation.title
_citation.country
_citation.journal_abbrev
_citation.journal_volume
_citation.journal_issue
_citation.page_first
_citation.year
_citation.journal_id_ASTM
_citation.journal_id_ISSN
_citation.journal_id_CSD
_citation.book_title
_citation.book_publisher
_citation.book_id_ISBN
_citation.details

primary   yes
; THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT
    1.74 ANGSTROMS RESOLUTION
;
UK 'J.MOL.BIOL.                 '  175  ?     159   1984
'JMOBAK         ' '0022-2836                '  070 ? ? ? ?

    1       no
; STEREOCHEMISTRY OF IRON IN DEOXYHAEMOGLOBIN
;
UK 'NATURE                      '  295  ?     535   1982
'NATUAS         ' '0028-0836                '  006 ? ? ? ?

2       no
?   ? ?    2 ?     ? 1981 ? ?  986
; HAEMOGLOBIN AND MYOGLOBIN.
  ATLAS OF MOLECULAR
  STRUCTURES IN BIOLOGY
;
;  OXFORD UNIVERSITY PRESS
;
'0-19-854706-4            ' ?

An mmCIF file contains dozens of tables that are all "about" the same protein.

The opening lines of the reference section in the mmCIF file (which is just a flat representation of the collection of tables that completely describes a protein structure) describe what the fields in each upcoming row in the table will mean. Rows don't begin arbitrarily at character 1 and end at character 80; they may stretch through several "lines" in the printout or onscreen view of the data. Rows don't end until all their fields are filled; when information is missing (as in the previous example), the fields have to be filled with null characters, such as a question mark or a space.

In the protein database, the table of literature references that describes a particular structure is associated with a particular PDB ID. However, there are other tables associated with that PDB ID as well, and they have totally different kinds of rows from the reference table. The atomic positions that describe a protein structure are contained in a separate table with a completely different format:

loop_
_atom_site.label_seq_id
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.auth_seq_id
_atom_site.label_alt_id
_atom_site.cartn_x
_atom_site.cartn_y
_atom_site.cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.label_entity_id
_atom_site.id
1
 ATOM  N  N    VAL A    1  .   6.204  16.869   4.854  7.00 49.05  .    1  1 1
 ATOM  C  CA   VAL A    1  .   6.913  17.759   4.607  6.00 43.14  .    1  2 1 
 ATOM  C  C    VAL A    1  .   8.504  17.378   4.797  6.00 24.80  .    1  3 1
 ATOM  O  O    VAL A    1  .   8.805  17.011   5.943  8.00 37.68  .    1  4 1 
 ATOM  C  CB   VAL A    1  .   6.369  19.044   5.810  6.00 72.12  .    1  5 1
 ATOM  C  CG1  VAL A    1  .   7.009  20.127   5.418  6.00 61.79  .    1  6 1
 ATOM  C  CG2  VAL A    1  .   5.246  18.533   5.681  6.00 80.12  .    1  7 2
 ATOM  N  N    LEU A    2  .   9.096  18.040   3.857  7.00 26.44  .    1  8 2
 ATOM  C  CA   LEU A    2  .  10.600  17.889   4.283  6.00 26.32  .    1  9 2
 ATOM  C  C    LEU A    2  .  11.265  19.184   5.297  6.00 32.96  .    1 10 2
 ATOM  O  O    LEU A    2  .  10.813  20.177   4.647  8.00 31.90  .    1 11 2
 ATOM  C  CB   LEU A    2  .  11.099  18.007   2.815  6.00 29.23  .    1 12 2
 ATOM  C  CG   LEU A    2  .  11.322  16.956   1.934  6.00 37.71  .    1 13 2
 ATOM  C  CD1  LEU A    2  .  11.468  15.596   2.337  6.00 39.10  .    1 14 2
 ATOM  C  CD2  LEU A    2  .  11.423  17.268    .300  6.00 37.47  .    1 15 

The values in the atom table are clearly related to the values in the reference table; they both contain information about the same PDB structure. However, the two types of data can't just be put together into one big table. It doesn't make sense to put the reference information into the same scheme of rows and columns the atom information goes into, either by tacking it on at the "bottom" of the table or by adding extra columns (although in flat files we are forced to do exactly that!). The two datatypes are related, but orthogonal to each other.

Anywhere in a set of information where it becomes impossible to sensibly tack rows or columns onto a table, a new table needs to be created.[*] Tables within a database may have interconnections only at the topmost level, such as the atom and reference information related to the same PDB file, or they may be more closely linked.

You may notice in the reference records two pages back that authors' names aren't listed. How can that be? Well, the answer is that they're in a separate table. Because each reference can have an arbitrary number of separate authors, that information can't just be tacked onto the reference table by adding a fixed number of extra rows or columns. So there's a separate table for authors' names:

loop_
_citation_author.citation_id
_citation_author.name
 primary   'Fermi, G.'
 primary   'Perutz, M.F.'
 primary   'Shaanan, B.'
 primary   'Fourme, R.'
   1       'Perutz, M.F.'
   1       'Hasnain, S.S.'
   1       'Duke, P.J.'
   1       'Sessler, J.L.'
   1        'Hahn, J.E.'
   2       'Fermi, G.'
   2       'Perutz, M.F.'
   3       'Perutz, M.F.'
   4       'TenEyck, L.F.'
   4       'Arnone, A.'
   5       'Fermi, G.'
   6       'Muirhead, H.'
   6       'Greer, J.'

This table is related to the previous reference table through the values in column 1, which match up with the citation IDs in the other reference table. To get from "Fermi, G." to "THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 ANGSTROMS RESOLUTION" in this database, you connect through the citation ID, which specifies the relationship between the two entities.

Using an RDBMS may at first seem like an overthinking of what could be a pretty simple set of data to store. If you ever write programs that operate on the antiquated flat-file PDB format, though, you'll realize how useful it might be to unambiguously assign your data to tables in a relational database. Among other things, databases eliminate the need for complicated line-format statements and parsing operations that are required when using 80 character-formatted files.

The database schema

The network of tables and relationships between them that makes up a database is called the database schema . For a database to keep its utility over time, it's best to carefully develop the schema before you even think about beginning to populate the database. In the example later in this chapter, we develop a schema for a simple database.

Getting your brain around database schemas and tables can be a challenge without even coming up with your own schema. However, relational databases are the standard for large database operations, and understanding RDB concepts is necessary for anyone who wants to build her own. Before designing your own database, you should definitely consult a reference that covers relational databases rigorously.

Object-Oriented Databases

You'll hear the phrase object oriented in connection with both programming languages and databases. An object-oriented database system is a DBMS that is consistent with object-oriented programming principles. Some important characteristics of object-oriented databases are: they are designed to handle concurrent interactions by multiple clients; they can handle complex objects (beyond tables of character data); and they are persistent—that is, they survive the execution of a process. In practice, because of the popularity of object-oriented programming strategies, most of the major relational DBMSs are compatible with an object-oriented approach to some extent.

The practical upshot of the object-oriented approach in the database world is the emergence of DBMSs that are flexible enough to store more than just tables and to handle functions beyond those in a rigidly defined query-language vocabulary. Since object-oriented databases handle data as objects rather than as tables, an object-oriented database can provide access to everything from simple text-format data to images and video files within the same database. Object-oriented databases don't force the use of the SQL query language, but rather provide flexible bindings to programming languages. Many DBMSs are beginning to have both object and relational characteristics, but the giants of the DBMS world are still primarily relational DBMSs.

Database Software

Databases don't just happen: they're maintained by DBMSs. There are several DBMSs, some open source and some commercial. There are flat file indexing systems, RDBMSs, object DBMSs (ODBMSs), and object-relational hybrids. Which DBMS you use depends on what you can afford, how comfortable you are with software, and what you want to do.

Sequence Retrieval System

Even if you've decided to work with a flat file indexing and retrieval system, you don't need to reinvent the wheel. The Sequence Retrieval System (SRS) is a popular system for flat file management that has been extensively used in the biology community, both in corporate and academic settings. SRS was developed at EMBL specifically for use in molecular biology database applications, and is now available as a commercial product from Lion Bioscience, http://www.lionbioscience.com. It is still offered for free to researchers at academic institutions, along with extensive documentation (but no tech support). A common application of the SRS database is to maintain a local mirror of the major biological sequence databases. The current release is SRS 6.

SRS can be installed on SGI, Sun, Compaq, or Intel Linux systems. To maintain your own SRS database and mirror the major biological databases requires tens of gigabytes of disk space, so it's not something to be taken on lightly. SRS has built-in parsers that know how to read EMBL nucleotide database files, SWISS-PROT files, and TrEMBL files. It's also possible to integrate other databases into SRS by using SRS's Icarus language to develop additional databank modules. For an example of the variety of databases that can be integrated under an SRS flat file management system, you only have to look at the SDSC Biology Workbench. Until its most recent release, SRS was the DBMS used within the Biology Workbench, and supported nearly the full range of databases now integrated into the Workbench.

Oracle

Oracle is the 18-wheeler of the RDBMS world. It's an industry-standard, commercial product with extremely large capacity. It's also rapidly becoming a standard for federally funded research projects. Oracle has some object capacities as well as extensive relational capacities. Potential Oracle customers can now obtain a license to try Oracle for free from http://www.oracle.com. If you want to provide a large-scale data resource to the biology community, you may need an Oracle developer (or a bunch of them) to help you implement it.

PostgreSQL

PostgreSQL is a full-featured object-relational DBMS that supports user-defined datatypes and functions in addition to a broad set of SQL functions and types. PostgreSQL is an open source project, and the source code can be downloaded for free from http://www.postgresql.org, which also provides extensive online documentation for the DBMS. PostgreSQL can also be found in most standard Linux distributions. If you plan to create a database that contains data of unusual types and you need a great degree of flexibility to design future extensions to your database, PostgreSQL may meet your needs better than MySQL. PostgreSQL is somewhat limited in its capacity to handle large numbers of operations, relative to Oracle and other commercial DBMSs, but for midrange databases it's an excellent product.

Open Source Object DBMS

Several efforts to develop open source ODBMSs are underway as of this writing. One of the most high profile of these is the Ozone project (http://www.ozone-db.org). Ozone is completely implemented in Java and designed for Java developers; queries are implemented in the underlying language rather than in SQL. One emphasis in Ozone development is object persistence, the ability of the DBMS to straightforwardly save the states of a data object as it is affected by transactions with the database user. Like many ODBMSs, Ozone is in a relatively early stage of development and may not be particularly easy for a new user to understand. Unless you have a compelling reason to use object-oriented principles in developing your database, it's probably wise to stick with relational database models until object technology matures.

MySQL

MySQL is an open-source relational DBMS. It's relatively easy to set up and use, and it's available for both Unix and Windows operating systems. MySQL has a rich and complex set of features, and it's somewhat different from both PostgreSQL and Oracle, two other popular RDBMSs. Each system recognizes a different subset of SQL datatypes and functions, and none of them recognizes 100% of the possible types. MySQL sets lower limits on the number of operations allowed than PostgreSQL and Oracle do, in some cases, so it's considered suitable for small and medium-sized database applications, rather than for heavy-duty database projects. However, this isn't a hard and fast rule: it depends on what you plan to do with the data in your database. MySQL is strictly a relational DBMS, so if you plan to store unusual datatypes, it may not be the right DBMS for you. For most standard database applications, however, MySQL is an excellent starting point.

MySQL's developers claim that it can manage large databases faster than other RDBMSs. While their benchmarks seem to bear out this claim, we haven't independently evaluated it. What we can say is that it's possible to learn to use MySQL and have a rudimentary database up and running within a few hours to a few days, depending on the user's level of experience with Unix and SQL.

Introduction to SQL

As a practical matter, you are most likely to work either with specialized flat file database systems for biological data, like SRS, or with some kind of RDBMS. In order to work with an RDBMS, you need to learn something about SQL.

SQL, or Structured Query Language (usually pronounced "see-kwl" by those in the know, ess-que-ell by literalists, and "squeal" by jokers) is the language RDBMSs speak. SQL commands are issued within the context of a DBMS interface; you don't give a SQL command at the Unix command line. Instead, you give the command to the DBMS program, and the program interprets the command.

SQL commands can be passed to the DBMS by another program (for instance, a script that collects data from a web form) or hand-entered. Obviously, the first option is the ideal, especially for entering large numbers of records into a database; you don't want to do that by hand. We can't teach you all of the ins and outs of programming with SQL, however; in this section we'll just focus on the basic SQL commands and what they do. Later on, we'll show an example of a web-based program that can interact with a SQL database.

SQL commands read like stilted English with a very restricted vocabulary. If you remember diagramming sentences in high-school English class, figuring out subject-verb-object relationships and conditional clauses, SQL should seem fairly intuitive. The challenge is remembering the restrictions of vocabulary and syntax, and constructing queries so that your DBMS can understand them. A SQL statement might read something like this:[†]

SELECT program FROM software WHERE program LIKE 'blast'

This says "select the names of programs from the list of software where the name of the program is like blast." This is something you might want to do if you use a searchable database of bioinformatics software.

As mentioned above, all DBMSs aren't created equal. There is a SQL standard vocabulary, called SQL 92; however, most systems implement only parts of this standard. You need to study the documentation for the particular DBMS you're using so you don't confuse it by giving unrecognized commands.

SQL Datatypes

The notion of a datatype is simple to understand. A datatype is an adjective that describes the data stored in a particular column of a table. In general, data stored in a table can consist of either numeric values or character strings. SQL, however, defines a multitude of subtypes within these common datatypes, mostly variants that set different upper limits on the size of a text field or numerical field, but also special numeric types such as DATE or MONEY.

When you create tables in a database, you need to define the type of each column. This means you need to know from the beginning, as you are setting up your data model, what type of data will be contained in each column. You should also have a rough idea of the likely upper and lower limits for your data, so that you can select the smallest possible type to contain them. For instance, if you know that the integer values in a column in your table will never be greater than 255, you should use the smallest possible integer type, TINYINT, for that column, rather than making space for much larger values in your database when you won't actually need that space. On the other hand, if that value will eventually grow beyond 255, then you should choose a type that allows a broader range of values for that column. Setting up a relational database requires quite a bit of intelligent forethought.

Here are some of the most popular SQL types, all of which are supported in most major RDBMS programs:

INT

An integer number. Variations include TINYINT, SMALLINT, MEDIUMINT, and BIGINT. Each of these allows a different range of numerical values.

FLOAT

A floating-point number. Maximum value on the order of 3 E 38; minimum value on the order of 1.7 E-38.

REAL

A longer floating-point number. Maximum value on the order of 2 E 308; minimum value on the order of 2 E-308.

CHAR

A fixed-length text string. Values shorter than the fixed length are padded with spaces.

TEXT

A variable-length text string with a maximum value. Variations include TINYTEXT, MEDIUMTEXT, and LONGTEXT.

BLOB

A variable-length binary field with a maximum value. Variations include TINYBLOB, MEDIUMBLOB, and LONGBLOB. Just about anything can go in a binary field. The maximum size of a LONGBLOB is 4 GB. All sorts of interesting things, such as image data, can go into a binary field.

DECIMAL

A real number that is stored as a character string rather than as a numerical field.

DATE

A date value that stores the year, month, and day.

TIMESTAMP

A time value that updates every time the record is modified.

ENUM

A value that is one of a limited set of options and can be selected using either the option name or a numeric value that represents the name.

SET

A value that is one of a limited set of options.

SQL Commands

SQL has many commands, but it's probably most important for you to know how to create new tables, add data to them, and then search for data in your database. We'll introduce you briefly to the SQL CREATE, ALTER, INSERT, UPDATE, and SELECT commands, as they are implemented in MySQL. The references mentioned in the Bibliography contain full descriptions of the SQL commands available through MySQL.

Adding a new table to a database

New tables are created with the SQL CREATE statement. The syntax of the CREATE statement is simply:

CREATE TABLE tablename (columnname type [modifiers] columnname type 
[modifiers])

If you want to create a table of information about software packages, for the example database we discuss in this chapter, you can do as follows:

CREATE TABLE software_package 
       (packid INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
        packname VARCHAR(100)                           
        packurl VARCHAR(255)                         
        function TEXT                        
        keyword ENUM                       
        os SET                         
        format SET                          
        archfile VARCHAR(255) 
)

This command tells MySQL to set up a table in which the first column is an automatically incrementing integer; that is, the DBMS automatically assigns a unique value to each entry you make. The second and third columns are variable-length character strings with preset maximum lengths, in which the name and URL of the software package will be entered. The fourth column is a text field that can contain up to 64 KB of text describing the software package. The fifth column allows you to choose one of 64 preset keywords to describe your software package; the sixth and seventh columns let you choose any number of values from a set of preset values to describe the operating systems the software will run under (e.g., mac, windows, linux) and the type of archive file available (e.g., binary, rpm, source, tar ). The final field is another variable character string that will contain the URL of the archive file.

Changing an existing table

If you create a table and you decide that it should look different than you originally planned, you can use the ALTER command to change it. To add another column to a table, the syntax is:

ALTER TABLE tablename ADD [COLUMN] (columnname type [modifiers])

Adding data to an existing table

The INSERT command adds a new row of data to a table. The syntax of the INSERT command is:

INSERT INTO table ( colname1, colname2, colname3 ) VALUES ( 'value1','value2','value3')

Altering existing data in a table

The UPDATE and REPLACE commands can modify an existing row in a table. Your user privileges must allow you to use UPDATE and REPLACE. These commands can take a WHERE clause, with syntax analogous to that of the SELECT command, so that you can specify under what conditions a record is updated.

Accessing Your Database with the SQLSELECT Command

The SQL SELECT command finds data in a table for you. In other words, SELECT is the command that makes the database useful once you have created it and populated it with data. It can be modified by a conditional clause that lets you determine under what conditions a record is selected from the table.

Choosing fields to select

The general syntax of the SELECT command is:

SELECT [fields] FROM [table] WHERE [clause]

To select all the fields in a particular table, the asterisk character can be used:

SELECT * FROM [table] WHERE [clause]

In this chapter's database example, if you want to select the software package name and software package URL from the software table, the SELECT command is:

SELECT packname, packurl FROM software

Using a WHERE clause to specify selection conditions

The WHERE clause allows you to specify conditions under which records are selected from a table. You can use standard operators, such as =, >=, etc., to set the conditions for your WHERE clause. MySQL also allows you to use the LIKE and NOT LIKE operators for pattern matching.

If you want to set up your SELECT statement to find only software for sequence alignment, it should look like this:

SELECT packname, packurl FROM software WHERE keyword = "sequence alignment";

If you want to find only software packages with names starting with the letter B, the SELECT statement looks like this:

SELECT packname, packurl FROM software WHERE packname LIKE "B%";

The % character is a wildcard character that represents any number of characters, so the software packages you select using this statement can have names of any length as long as the name starts with B.

Joining output from multiple tables

SELECT can also join two related tables. When we talk later about developing databases, you'll find that relationships between tables are created by replicating information called a primary key from one table as a foreign key in another table. If the foreign key in one table matches the primary key in another, the data in the two tables refers to the same record and can be joined to produce one set of output from SELECT. A MySQL SELECT statement for joining two tables might look like this:

SELECT FROM table1, table2 WHERE primarykey1=foreignkey2

For instance, we've already discussed creating one table that lists the names, URLs, and other details about the software packages listed in the database. In order to build the database properly, you have to have another table that lists information about the literature references that describe the functions of the software packages in the database.

What if you want to select only the names and URLs of software packages that were first described in the literature in 1998 or later? The names and URLs are found in the software table; the dates are found in the reference table. Here's the SQL:

SELECT packname, packurl, reference_date FROM software, reference 
WHERE software.package_id = reference.package_id 
AND reference_date >= "1998";

The variable package_id is the primary key from the software table, and it is replicated in the reference table to maintain a relationship between the details of software packages and the references that describe them. If the value of package_id is the same in both tables, the two rows being accessed are part of the same record in the database. Therefore, the first part of the WHERE clause is what joins the two tables. The second part of the WHERE clause (AND reference_date >= "1998") specifies the actual selection condition.

Different database-management systems implement different levels of join functionality, so you will have to check the specific documentation for your DBMS to see how joins work.

Installing the MySQL DBMS

To set up and maintain your own database, you need to have a database server installed on the machine on which the data will be stored.

MySQL is a lightweight relational DBMS that is fairly easy to install and run. We're going to use MySQL to set up the example database, so if you're interested in trying it out, be sure the MySQL server is installed on your machine. If you're using a Red Hat Linux distribution, this is ridiculously easy. If you didn't install MySQL when you set up your machine, simply use kpackage or gnorpm to select the MySQL components you want to install—the server, clients, and development tools. This will probably give you an older version of MySQL; to get the current version and install it easily, use the binary RPMs from the latest stable version at http://www.mysql.com. You'll also want to make sure the Apache web server and PHP support, available from http://www.apache.org, are installed. The next time you restart your machine after the install, the MySQL server daemon,[‡] mysqld, is started, MySQL privilege databases are initialized, and the PHP module is made available to your Apache server.

Setting Up MySQL

When you look at RDBMS software, you usually find you have a choice of setting up a client or a server. The server part of the program runs on the machine on which the data is actually stored. It runs as a daemon on Unix machines; that is, as a system process that is always on, listening for and responding to requests from clients. The MySQL server program is called mysqld. Figure 13-2 shows an example of a client/server architecture.

Client/server architecture

Figure 13-2. Client/server architecture

Clients are programs that connect to the server and request data. Clients can be located on the database server itself, but they also can be located on other machines on which mysqld isn't running and connect over the Internet to the main database.

The MySQL client programs include mysql, the main client that lets you do everything from designing databases to inserting records; mysqladmin, which performs selected administrative functions such as creating new databases and checking the status of the server; and client5. client5 is similar to mysql in that it allows interactive query processing, but for security purposes, it doesn't allow you to add and modify database records.

When we talk about the MySQL DBMS as a whole, we refer to it as MySQL. When we talk about a client program that's part of MySQL, we refer to it by its specific client name.

Using the mysql client program

The mysql program has only a few commands of its own; the commands that are primarily directed to the mysql program or the client5 program are SQL statements. When you are inside the mysql program, the program interprets any SQL statement you give to it as one continuous statement, until the terminating character ";" is read. Here are the mysql commands:

use

Takes a database name as its argument; allows you to change which database is in active use

status

Returns the status of the server

connect

Reconnects with the server

go

Sends a command to the MySQL server; also can be indicated by terminating a SQL statement with g or ;

help

Prints a complete list of mysql commands

Using the mysqladmin client program to set up MySQL

You can get a comprehensive listing of mysqladmin commands with the command:

mysqladmin --help  

Here are the commands you are likely to use frequently:

create

Takes a database name as its argument; creates a new database

drop

Takes a database name as its argument; deletes an entire database

reload

Reloads the grant tables

variables

Prints available variables that describe the MySQL installation

ping

Checks to see if the MySQL server is alive

shutdown

Shuts down the MySQL server on the local machine

Restarting the MySQL server

mysqladmin has an option for shutting down the server. But what about starting it up again? To start your MySQL server, use the Unix su command to become the MySQL administrator, whether that's user mysql or some other user ID. Then, start the MySQL server with the command:

safe_mysqld &

Securing Your MySQL Server

Your MySQL server isn't secure when you first install it from RPMs, although the databases are initialized. To secure your server, you should immediately set a root password for the MySQL installation. This can (and should) be different from your system root password. MySQL usernames and system usernames aren't connected, although server processes do need to run under a user ID that exists on your server. You need to use the mysql program directly to update the user grant table, the main table of permissions for MySQL users. To invoke the mysql program, give the command:

mysql -u root mysql

Your command prompt will change to mysql>, which indicates you are inside the mysql program until you quit using the quit command.

To update the grant tables, type:

UPDATE user SET Password=PASSWORD("your_password") WHERE User="root";

When you issue this command through the mysql program, you're giving a SQL command to update the table user in the database mysql. After you reset the root password, exit mysql and tell MySQL to reread the grant tables with the command:

mysqladmin -u root reload

Now you can reaccess the mysql program and other client programs only if you use the proper root password. To restart the mysql program on the mysql database, give the command:

mysql --user=root --password mysql

You'll be prompted for your password. If you enter the password on the command line, instead of allowing mysql to prompt you for the password, the password can become visible to other users (or hackers) of your system.

If you install MySQL from RPMs on a Linux system, during the installation the mysql user ID is added to your system. This user should own the MySQL data directory and its subdirectories. The MySQL daemon runs as a process started by system user mysql, and access to the database is controlled by that user. You can set the system password for user mysql using the Unix passwd command as root. To set the MySQL password for this user, you may need to use SQL commands to insert the user mysql into the grant tables. The SQL statement that creates the mysql user and grants it global access permissions for all of your databases is:

INSERT INTO user VALUES("localhost","mysql",PASSWORD("your_password"), 
"Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y","Y");}

For more on administration and security of MySQL databases, we suggest you consult the pertinent books listed in the Bibliography.

Setting Up the Data Directory

If you install MySQL from RPMs, your data directory is automatically located in /var/lib/mysql. When you set up your workstation, you may not have left much space on the /var partition. If you're going to be doing a lot with databases, you probably want to give the MySQL data directory some room to grow.

An easy way to do this is to relocate the data directory to a different partition and create a symbolic link from that directory to /var/lib/mysql. If you relocate the data directory this way, you don't have to change any MySQL configuration information.

First, choose a location for your data directory. You can, for example, create a directory /home/mysql/data. Then, shut down your MySQL daemon using:[§]

mysqladmin shutdown

Using the Unix mv command, move all the files in /var/lib/mysql to /home/mysql/data. Once the /var/lib/mysql directory is empty, use rmdir to remove it. cd to the /home/mysql directory and type:

chown -Rf mysql:mysql data 

This sets the proper file ownership for all the files in that directory. Finally, use ln -s to create a symbolic link between the /home/mysql/data directory and /var/lib/mysql. Then restart your MySQL server by typing:

safe_mysqld  &

You'll probably need to be logged in as the superuser to do this.

Creating a New Database

Once your MySQL server is installed and running, you need to create a new database and grant yourself the correct permissions to read and write to that database. You can do this as MySQL user mysql, unless you want to create a separate identity for yourself right now. We're going to make a database of bioinformatics resources on the Web, so you need to create a database called resourcedb. To do this, simply type:

mysqladmin --user=mysql --password create resourcedb

Then run mysql on the resourcedb database with the command:

mysql --user=mysql --password resourcedb

Database Design

The example we'll walk you through is a simple example of how to use MySQL to create a searchable database of bioinformatics software web sites.[‖] We'll also talk a little bit about a scripting language called PHP, which allows you to embed commands that let others access your database directly into an HTML file, and about other ways to access your database from the Web.

If you're looking for bioinformatics or computational biology software on the Web, there are several things you'll probably want to know about each item you find and several ways you'll want to query the database. You'll want to know the name of each item and have access to a description of what it does and the URL from which you can download it. You'll probably want to know the author of the item and what papers have been published about it. You may even want to have immediate access to a Medline link for each reference. You'll want to know what operating systems each item works under, and what format it's available in; you may even want a direct link to the archive file. You may also want to be able to search the database by keywords such as "sequence alignment" or "electrostatic potential."

That sounds pretty simple, right? You may be thinking that all that information would go nicely into one table, and a complicated RDBMS isn't needed to implement this kind of database. Figure 13-3 shows what that big table looks like.

The bioinformatics software DB as one big table

Figure 13-3. The bioinformatics software DB as one big table

However, if you look more closely, you'll see it's not really possible for even a simple bioinformatics software database to fit in one table. Remember, data in tables must be atomic ; that is, each cell must contain only one distinct item, not a list or a set.

If you think through the possibilities, you'll realize that there are several places where lists or sets might occur in a bioinformatics software database record: there might be multiple authors, and/or multiple publications describing the software; the software might be available for many different operating systems; and there might be more than one keyword used to describe each item.

On Entities and Attributes

Databases can contain two kinds of information: information that indicates an entity or thing that might have relationships with other things; and information that is purely descriptive of a single entity—attributes of that entity.

In our database example, the one thing we are sure of is that a software package is an entity. Let's begin designing the tables in this database by listing all the information associated with each software package:

Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename
Reference
Author
Medline link

We may be able to think of more information about each software package in the database, but for the purposes of this example, we'll leave it at that.

Entities can be described by both attributes and relationships to other entities. If an entry in a database has no attributes and no relationships, it shouldn't be considered an entity. One item in our list of facts about each software package definitely has attributes; each reference has an author or authors, and each reference has a Medline link. Therefore, references should be a separate entity in the database. So we'll need at least two tables:

SoftwarePackage
----------------
Software package ID
Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename

Reference
---------
Reference ID
Reference name
Reference year
Author
Medline link

We've included an "identifier" attribute in each table. Why? Imagine that there are two software packages named BLAST. Both do very different things. They need to be distinguished from each other in our database, but not by creating another table of "things named BLAST." The unique ID allows us to store information about two software packages named BLAST in the same table and distinguish them from each other.

Ideally, we want entities to have either one-to-one relationships or one-to-many relationships with each other. The relationship of references to software packages is a one-to-many relationship: each software package can be described by many references, but each reference describes only one software package (see Figure 13-4). Many-to-many relationships can't be sorted out by the RDBMS software, so they need to be eliminated from the data model before creating a database.

Relationship of package to reference

Figure 13-4. Relationship of package to reference

If you're observant, you'll notice that within the Reference table, there is a many-to-many relationship just waiting to happen. Each author can produce many references, and each reference can have many authors. The presence of that many-to-many relationship indicates that Author should probably be a separate entity, even though we haven't chosen to store any attributes about authors in our current data model. So we actually need a third table:

Reference
---------
Reference ID
Reference name
Medline link 

Author
---------
Author ID
Author Name

Even after we create a new table for the Author entity, though, the relationship between authors and references is still many-to-many. This can be resolved by creating a junction entity that has no purpose other than to resolve the many-to-many relationship between the two. The junction entity could be called AuthorRef, or any other arbitrary name. Its only attributes will be its unique identifier (primary key) and the foreign keys that establish its relationship with the other tables.

Creating a Database from Your Data Model

When you actually create your database, entities become tables. Every attribute becomes a column in the table, and the ID becomes the primarykey for that table. Relationships to information in other tables become foreign keys.

Before relationships are established, the four tables in our database contain the following information:

SoftwarePackage
----------------
Software package ID
Software package name
Software URL
Textual description of function
Descriptive keyword
Operating system
Software format
Archive filename

Reference
----------------
Reference ID
Reference name
Reference date
Medline link

AuthorRef
---------
AuthorRef ID

Author
---------
Author ID
Author Name

Each attribute is a column in the table, and each column must have a datatype. The primary keys can be integer values, but they can't be NULL or empty. The appropriate datatype for the primary key identifiers is thus INT_NOT_NULL; the rest of the fields can be defined as TEXT fields of one size or another.

Creating Relationships Between Tables

To store the relationships between tables in your database, you place the primary key from one table in a column in the other table; that column is called a foreignkey. In our example, the primary key of the SoftwarePackage table is also entered in a column in the Reference table, because there is one software package to many references. The primary key from the Reference table and the primary key from the Author table become foreign keys in the AuthorRef table; there are many AuthorRefs for each author, and many AuthorRefs for each reference.

Once you've worked out what information your tables will contain, and what format each column will be in, you have what is called a physical model of your database and you are ready to create it using SQL CREATE statements, as we demonstrated earlier.

Developing Web-Based Software That Interacts with Databases

The purpose of public biological databases is to allow the biology community to share data in a straightforward manner. Nothing is more straightforward than the Web. Therefore, it's almost a given in developing a database (especially with federal funding involved) that you will eventually think about how to make data available on the Web. There are several technologies that allow communication between web pages and databases. The oldest of these is called Common Gateway Interface (CGI) programming, but CGI is now being augmented by other technologies such as XML and PHP.

The world of web-based software development is a rapidly changing one, and it's not our job to detail all the available technologies in this book. However, you should be aware of what these technologies are and roughly how they work, because every time you make a request that directs a web server to process some information, you are using one of them.

If you want to set up your own web server and offer data-analysis services to other users, you need to use CGI scripts or web pages that incorporate XML or PHP code. After we give brief explanations of CGI and XML, we'll show you a couple of examples of how to use PHP commands in your web pages to access the example database we've just created.

CGI

A CGI program, or script, is a software application that resides on a web server. When the CGI program is called by a remote user of the web server, the application executes on the server and then passes information back to the remote user in the form of a web page, as shown in Figure 13-5. CGI programs are accessed using the Hypertext Transport Protocol (HTTP) just like normal HTML web pages. Unlike normal web pages, however, CGI scripts either live in a special directory (such as /cgi or /cgi-bin) within the server's web documents directory, or they have a special file extension such as .cgi. When the server receives an HTTP request, instead of just serving the CGI code to your browser as it does for a normal web page, the server executes the CGI program. CGI is a relatively mature technology and is supported by all the major web servers.

CGI programs usually consist of multiple sections (see Figure 13-5). First, there may be a section of the program that collects user input from a web form. This is followed by the section of the program that takes the user input and does something with it. The CGI program may contain the complete code to do the input processing, but it is more likely that the program formats the input appropriately and passes it to a separate program that exists on the server, then collects the output from that program when the run is completed. The final function of the CGI program is to return the output from the process that ran on the server to the user in the form of a web page, which may contain either textual output or links to downloadable results files, or both.

How a CGI program is executed

Figure 13-5. How a CGI program is executed

An example of a CGI program you might use is the BLAST server at NCBI. When you select "Basic BLAST search" from the NCBI BLAST home page, you'll notice that the URL of the new page actually points to a CGI script:

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=0

The first part of the URL, up to the question mark, gives the directions to the CGI program. The second part of the URL is state information, which tells the CGI program what part of its functionality is needed. The state information in this particular URL is telling the BLAST CGI program to bring up an empty search form in which you can enter your sequence.

Once you click the "Submit" button, a new page appears. The new page lists your request ID and approximately how long the request will take to process. Behind the scenes, the CGI program has passed your request to the actual BLAST program, which runs on NCBI's server. When BLAST finishes running your request, the results are written out and labeled with the request ID the server assigned to you. The CGI program then looks for your results under that request ID.

After the search is run, you have the option of displaying your data. The URL still points to the BLAST CGI program, but the state information changes. The URL now looks like this:

http://www.ncbi.nlm.nih.gov/blast/blast.cgi?RID=965246273-2980-7926
&DESCRIPTIONS=100&ALIGNMENTS=50&ALIGNMENT_VIEW=0&&HTML=on&OVERVIEW=on
&REFRESH_DELAY=22

The state information that is being passed along in this URL tells the program which NCBI request ID (RID) to search for on the server and how the results should be displayed, information that you had the option of entering through the menus on the previous form. The new page that is displayed with this URL contains a listing of your BLAST results as well as links to other information at NCBI. The BLAST results and links were generated behind the scenes on the NCBI server and written to what appears to you as a normal web page (see Figure 13-6).

Processing a BLAST request at NCBI

Figure 13-6. Processing a BLAST request at NCBI

CGI programs produce a lot of the dynamic content currently available on the Web, although other methods for producing dynamic content are becoming popular.

XML

The eXtensible Markup Language, better known as XML, is a data-representation scheme that has attracted a great deal of attention in the last few years. Like the HTML language that describes web pages, XML is derived from the Standard Generalized Markup Language (SGML). HTML and XML define tags that are used to annotate a document. Tags are surrounded by angle brackets and use the syntax <tag>text</tag>. HTML tags specify a web page's structure and appearance. For example, the text <B>this is bold</B> is rendered in boldface.

XML tags, on the other hand, define a document's content. For example, in the text:

homologs of the sequence <gi>g7290345</gi>

the GenBank ID g7290345 is unambiguously identified because it is bracketed by <gi> tags. If you write a program that searches a document for GenBank IDs, it's easier to find them if they're explicitly labeled than if you specify a GenBank ID regular expression. Thus, XML lends structure to flat file data such that it can be read and written in a standard way.

The tags used in a given XML document are defined in a document type definition, or DTD. The DTD acts as a dictionary for the data format, specifying the elements that are present in a document and the values each attribute may have. The DTD can exist in its own file, or it can be part of the XML datafile itself. Because XML allows users to define their own tags in a DTD, it provides a rich and detailed description of information that would potentially end up in a glob of free text (for example, the REMARK fields in a PDB file). The downside of this descriptiveness is that records can rapidly become bloated with details, especially when representing complex data such as the results of microarray experiments.

The fact that XML can mark up data in flat files in a standard and uniform way is significant for working with biological data, which is often stored in flat files. For example, if you want to use data in the ASN.1 format described earlier in this book, you need an ASN.1 parser, which reads only ASN.1 data. By the same token, if you need to read in files in PDB format, you need a different parser for PDB files. However, if your data is marked up in XML, any XML parser can read the data into your program. Here is an example of an XML representation of PDB author data:

<!-- Simple PDB citation DTD -->
<!ELEMENT citation (author)+>
<!ELEMENT author (first-name, last-name)>
<!ELEMENT first-name (#PCDATA)>
<!ELEMENT last-name (#PCDATA)>

<!DOCTYPE pdbcite SYSTEM "pdbcite.dtd">
<citation>
   <author>
     <name>Fermi, G.</name>
     <citation_id>primary</citation_id>
   </author>
   <author>
     <name></name>
     <citation_id></citation_id>
   </author>
 ...
</citation>

A number of XML parsers are available for the programming languages commonly used in bioinformatics research, including Perl, Java, Python, and C++. There are two basic types of XML parser: nonvalidating and validating. Nonvalidating parsers read the XML file and ensure its overall syntactic correctness. Validating parsers, on the other hand, can guard against missing or incorrect values. By comparing the XML document against its DTD, a validating parser ensures that the markup of the data isn't only syntactically correct but that each tag or attribute is associated with appropriate values.

XML applications

Thanks to its flexibility and success in other domains, XML has been adopted as a data description language for some bioinformatics projects. XML has caught on particularly well in genome annotation: the Genome Annotation Markup Element (GAME-XML) DTD was developed by Suzanne Lewis and coworkers at the Berkeley Drosophila Genome Project to represent sequence features in genomes. XML is also the basis for the markup scheme proposed by Lincoln Stein, Sean Eddy, and Robin Dowell for the distributed annotation system, DAS. Some other example applications of XML include the Biopolymer Markup Language (BioML) sequence description format developed at Proteometrics, the Taxonomic Markup Language developed by Ronald Gilmour of the University at Albany for representing the topology of taxonomic trees, and the Chemical Markup Language (CML) for representing small molecule structures.

Information about these and other applications of XML in bioinformatics are available at the web page of the Bioxml group, the XML-specific arm of the Bioperl Project (http://www.bioxml.org). Additional information about XML and its associated technologies are available from the WWW Consortium (http://www.w3c.org).

PHP

PHP is a hypertext preprocessor module for your web server that allows it to read and interpret PHP code embedded in web pages. PHP code resembles, but isn't identical to, familiar programming languages such as Perl and C.

PHP runs on most web servers; see http://www.php.net for more information. Unlike some other dynamic content technologies out there (for instance, Microsoft's ASP), PHP is an open source project that is supported on multiple operating systems. PHP also has built-in support for interacting with MySQL, PostgreSQL, and Oracle databases.

When a web page that incorporates PHP code is requested from a web server, the server processes the PHP instructions in the page before passing it to the client. The page source appears to the client as standard HTML; the PHP code remains invisible to machines beyond the web server.

PHP commands can be interspersed with HTML code in just about any order that seems useful to the page designer, and the resulting HTML will appear at that point after the PHP code is processed. PHP also has the capacity to include external files, so you can write the code that controls the appearance of the web page in the main PHP file and place subroutines in separate files to be included. PHP pages are distinguished from standard HTML files by a .php extension.

Accessing MySQL databases with PHP

Accessing a MySQL database with PHP doesn't take much work. You need one line of code to connect to the database, another line to select which database to use, a query statement, and a statement that sends the data as HTML to the client's web browser. A simple example might look like this:

<?php
$link =@mysql_pconnect ("myhost.biotech.vt.edu","cgibas","password") or
exit (  );
mysql_select_db ("resourcedb") or exit (  );
$result = mysql_query ("SELECT program, url, institution FROM software
WHERE program = "BLAST") or exit (  );
while ($row = mysql_fetch_row ($result))
{
    print("<br>
");
    for ($i = 0; $i < mysql_num_fields ($result); $i++)
    {
       print ($row[$i]);
    } 
} 
mysql_free_result ($result);        
?>

The first line of code (< ? php) signals the start of a chunk of PHP code. The next step is to connect to the MySQL server with the specified name and password, or terminate the script if the connection is unsuccessful:

$link =@mysql_pconnect ("myhost.biotech.vt.edu","cgibas","password") or 
exit (  );

Now you request the database called resourcedb:

mysql_select_db ("resourcedb") or exit (  ); 

Next, issue a MySQL query that selects values from the program, URL, and institution fields in the software table when the program name is equal to "BLAST":

$result = mysql_query ("SELECT program, url, institution FROM software 
WHERE program = "BLAST") or exit (  ); 

Every time a row is read from the database, you break that row down into fields and assign them to the $row variable, then step through each of the fields in the row and print out the value of that field:[#]

while ($row = mysql_fetch_row ($result)) 
{
    print("<br>
");
    for ($i = 0; $i < mysql_num_fields ($result); $i++)
    {
       print ($row[$i]);
    }
}

Finally, release the results from memory when the query is completed:

mysql_free_result ($result);

The last line of code (? >) terminates the PHP portion of the web page.

Collecting information from a form with PHP

Obviously, this code would be more useful if you substituted a variable name for the word "BLAST," and created a little form that would let the user input a word to be assigned to that variable name. All of a sudden, instead of a little bit of PHP code that searches the database for BLAST servers, you have a crude search engine to find a user-specified program in the resourcedb database.

Forms are created in PHP using PHP print statements to produce the HTML code for the form. For example, to produce a pair of radio buttons, the PHP code looks like this:

print("<INPUT TYPE="radio" NAME="type" VALUE="Yes" CHECKED>Yes
");
print("<INPUT TYPE="radio" NAME="type" VALUE="No">No
"); 
                  

Other form features are implemented analogously. For more information about forms, collecting data using forms, and detailed examples of how to produce a PHP form, see the MySQL references in the Bibliography.

Web database programming isn't something you can learn in a few pages, but we hope we've convinced you that creating a MySQL database is something that you can do if needed, and that writing the PHP code to access it won't be that much harder than working with HTML and Perl. Rather than showing the full PHP code for the MySQL database example, we'll walk you through the important things the PHP code will need to do.

To interact with our example database, you want a PHP script that does several major tasks:

  1. Present a welcome page to the user. The page should allow the user the option of searching the database or adding a new entry to the database. Behind the scenes, that selection needs to be processed by the PHP script so that it subsequently presents the correct page.

  2. Present a query submission form to the user. The PHP code needs to build a useful form, then grab the data the user enters in the form and use it to build SQL SELECT statements.

  3. Present query results to the user. As matching records are found in the database, the program will have to format each one into a reasonably nice-looking piece of HTML code so that it displays in the user's web browser in a readable format.

  4. Present a form for adding a new entry. This assumes you have granted permissions for adding entries to the database to outside users and will require you to collect username and password information.

  5. Add the new entry to the database. This routine needs to take the information from the add form and actually use a SQL INSERT command to add it to the database.



[*] The technical term for the process of separating a complex data set into a collection of mutually orthogonal, related tables is normalization. For a rigorous discussion of relational database theory, see the pertinent references in the Bibliography.

[†] SQL commands don't have to appear in all capital letters; they're case-insensitive. But we'll write them in all capital letters in our examples, so that you can distinguish them easily from the names of files, variables, and databases. File, variable, and database names are case-sensitive in SQL, so if you name a database PeOPlE, you'll have to live with that.

[‡] System processes such as servers that run in the background on Unix systems are known as daemons.

[§] You also need to include -- user=mysql -- password on the mysqladmin command line, but from now on, we're going to assume you know that.

[‖] Don't run out and implement this on your machine just because we talked about how to do it. The Web is teeming with out-of-date collections of bioinformatics links (and other kinds of links), and unless you intend to be a responsible curator, no one really needs you to add to them.

[#] The way we've done this, it will be a rather ugly plain-text printout. Adding HTML table tags at various points in the command sequence results in much prettier output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset