Chapter 9. Visualizing Protein Structures and Computing Structural Properties

Analysis of protein 3D structures is a more mature field than biological sequence analysis. The Protein Data Bank started distributing coordinates of macromolecular crystal structures in the early 1970s, and since that time, many research groups and companies have developed software to visualize and measure the properties of protein structures.

Visualization of structure and measurement of structural properties are important tools for molecular and structural biologists. Being able to "see" the 3D structure of a protein and analyze its shape in detail can suggest the location of catalytic sites and interaction sites, and can help identify targets for the site-directed mutagenesis studies that are so often used to arrive at a detailed characterization of a protein's functional chemistry.

Here are some recent applications of this type of approach in molecular biology:

  • Molecular modeling of an allergy-causing protein from mountain cedar pollen and subsequent identification of the region that causes allergic response

  • Characterization of the mutagenic active site in DNA reverse transcriptase from the HIV virus; this site is thought to be responsible for the ability of the HIV virus to mutate rapidly

  • Modeling of a DNA binding protein involved in Bloom syndrome, and characterization of the mutations that cause the disease

There are many specialized analysis programs in the protein structure literature, and we will not attempt to catalogue all these methods. Instead, we present an introduction to standard operations for analyzing and modeling protein structure, with examples of software for each purpose: visualization and plotting; geometric and surface property analysis; classification; analysis of intramolecular interactions and solvent interactions; and computation of some physicochemical properties.

For all-purpose molecular structure modeling, the easiest-to-use tools are still commercial packages such as MSI's Quanta and Insight, Tripos' SYBYL, and others. However, licensing for these packages, especially for multiple users, is quite expensive and they generally require specialized high-end hardware (such as SGI and IBM Unix workstations) to run. In this chapter, we again focus on software that can run on a standard desktop PC under Linux or within a web browser on any platform.

A Word About Protein Structure Data

Because protein structure analysis is a relatively old field, evolving earlier in the history of computers than sequence analysis, it has inherited some inconveniences. While many programs use the standard PDB format, others, especially molecular simulation software, expect input in slightly or significantly different forms. And because protein structure analysis software is older, many programs are written in the FORTRAN language and are very picky about data input formats. Data standardization at the PDB is excellent, but standardization at the individual software package level isn't as good. If you're going to be doing a lot of work with protein structure data it may be necessary to learn some programming to be able to convert structure files to alternate formats when necessary. We show an example of a simple structure file-format conversion in Chapter 12.

The Brookhaven PDB format is the protein structure data format that most structure-analysis programs use. This format met the needs of the protein structure field in the 1970s, and was especially human-readable, and compatible with FORTRAN programs, because of its use of rigidly structured 80-character lines. This format consists of a header section that contains miscellaneous information about the structure, including literature citations; resolution; crystallographic parameters; sequence, and sometimes secondary structure information; and a section that contains atom records. Atoms labeled ATOM are part of the protein chain, while atoms labeled HETATM (for heteroatom group) are part of cofactor molecules, substrates, ions, or other groups that aren't a covalently bound part of the protein chain. A detailed line-by-line description of the Brookhaven format is available from the RCSB PDB web site.

Protein structure files also are available from the PDB in a new format called mmCIF (the Macromolecular Crystallographic Information Format) and from NCBI in the ASN.1 file format. Both of these formats are highly parseable by computers, and if you are writing computer programs to analyze protein structures, they may be easier to use than the obsolete Brookhaven format. However, you'll need to consider that the user community is still attached to the Brookhaven format.

The Chemistry of Proteins

To work with protein sequence and structure, you need a working knowledge of protein chemistry—the kind of knowledge you'd probably have picked up in an undergraduate organic chemistry course. We'll provide you with a little of that vocabulary here, and you can find out more from the references listed in the Bibliography. If you already know what you need to know about protein chemistry, you can skip ahead to Section 9.3.

The reason you should have a basic knowledge of organic chemistry when studying protein structures is simple. Proteins often perform their functions using standard organic reaction mechanisms, mediated by amino acids and small organic molecules (cofactors) that bind to the protein, or by metal ions. To understand how the protein structure might catalyze a reaction, you need to understand enough about organic reaction mechanisms to develop a hypothesis about how the reaction might work, given the shape of the protein and the location of various amino acids.

Even in cases in which a catalytic mechanism isn't your main concern, chemistry comes into play. Protein association is often mediated by the electrostatic properties of the protein structure; interacting molecules can be drawn together over considerable distances by strong electrostatic potentials. Within protein structures, hydrogen bonds and other interatomic interactions confer structural stability. Interatomic interactions and molecular shapes are the basis of the specificity of intermolecular interactions—the interactions of proteins with other proteins or with small molecule substrates. You are likely to be concerned about molecular specificity in practical applications of biochemistry—designing small-molecule or peptide drugs, understanding the molecular basis of disease and immunity, or delving into the specific molecules involved in sending molecular signals between cells and through the body.

The tools in this chapter enable you to look at a protein structure, see what its features are, locate different types of amino acids and visualize specific subsets of the protein, measure distances and surface areas, and compute spatially variable properties such as solvent accessibility and electrostatic potentials. However, what you can do with those tools depends on your understanding of protein chemistry.

From 1D to 3D

How does the chemistry of a protein relate to its 1D sequence? In Chapter 8, we discussed techniques for detecting characteristic conserved patterns, called motifs, in families of protein sequences. We can find these sequence patterns in 1D data because although the 3D structure of a protein is complex, it is somehow determined by the invariant sequence of amino acids that makes up the protein. Motifs that are conserved in sequence often are related to important structural or functional features of a protein family, and those features often can be understood by their roles in the protein structure.

When amino acids come together in sequence to form a polymer, they do so by forming a peptide bond between the basic amino group and the acidic carboxyl group of each amino acid (Figure 9-1). This results in a long chain of amino acids that has a repeating backbone structure.

Peptide bond, peptide chain (chemical notation)

Figure 9-1. Peptide bond, peptide chain (chemical notation)

The variable group of each amino acid protrudes from the repeating backbone and is referred to in the protein structure business as a sidechain (Figure 9-2). Each of the 20 amino acid sidechains is chemically different from the others in some respect.

The amino acid sidechains (chemical notation)

Figure 9-2. The amino acid sidechains (chemical notation)

The sidechains can be classified in many ways. Some are relatively large, while others are tiny or in one case nonexistent. Some have a positive or negative charge. Some are oily, or hydrophobic (water-fearing), meaning that it's energetically unfavorable for them to be solvated in water. Others are hydrophilic (water-loving), and they solvate easily in water. Some have bulky ringlike structures, while others are straight carbon chains. Some are acids, others are bases. Amino acids are conserved through evolution at specific locations in a protein sequence because they are needed there, whether to stabilize the protein structure, to form a specific binding site, or to catalyze a reaction. You can detect that particular amino acids in a protein are conserved by looking at sequence data, but to develop a hypothesis about why they are conserved, it's helpful to examine the 3D protein structure. Figure 9-3 shows the 20 amino acids classified into chemically similar groups. Note that many of the amino acids fall into more than one category. An amino acid sidechain can be both "nonpolar" and "basic," for instance, like lysine, which has a long aliphatic sidechain that terminates in an amino group. Because the relationship between chemical characteristics and amino acids isn't one-to-one, but rather many-to-many, it's not always simple to predict the effects of an amino acid substitution.

The amino acid sidechains (classification in a Venn diagram)

Figure 9-3. The amino acid sidechains (classification in a Venn diagram)

Interatomic forces aren't responsible only for specific interactions that form binding and interaction sites; they also are responsible for the formation of certain standard patterns that are consistently observed in protein structure. The amino acid backbone is sterically constrained—restricted from moving in certain ways because atoms will bump into each other—to follow only certain pathways. You may already be familiar with the alpha helix and beta sheet structures that commonly occur in protein structures; the reason that alpha helices and beta sheets are common is the steric restrictions on the protein backbone.

From the known structures of amino acids, Pauling and Corey first predicted the existence of alpha helices and beta sheets as a component of protein structure. Ramachandran first described exactly what range of conformations are available to amino acids in a peptide chain. Peptide chain conformation is simply described by the values of the dihedral angles in the protein backbone (i.e., the angle described by the four atoms surrounding the N-Cα bond and the angle described by the four atoms surrounding the Cα-C bond). These angles are referred to as Φ and Ψ, respectively. The chain isn't free to rotate around the third kind of bond in the protein backbone, the peptide bond, because it is a partial double bond and hence chemically constrained to be planar, so the values of Φ and Ψ for each amino acid provide a complete description of the protein backbone. A Ramachandran map is simply a plot of Φ versus Ψ for an entire protein structure. One means of evaluating a protein structure model is to compare its individual Ramachandran map with the general Ramachandran map of allowed values of Φ and Ψ.

Figure 9-4 is a general Ramachandran map that shows the allowed combinations of Φ and Ψ values for amino acids in protein structures. The small shaded region in the lower left quadrant of the map is the standard conformation of an amino acid in an alpha helix. The larger shaded region in the upper left quadrant of the map is the standard conformation of an amino acid in a beta sheet, or extended structure.

Ramachandran map of allowed conformation for protein backbones

Figure 9-4. Ramachandran map of allowed conformation for protein backbones

It's apparent from the Ramachandran map that steric interactions are very important determinants of the general features of protein structure. Steric interactions instantly eliminate a large fraction of possible conformations for proteins and leave relatively few options for how a compact structure can form from a linear chain of amino acids.

The sequence of a protein is called its primary structure; the most basic level of organization in a protein is the sequence of amino acids. Alpha helix and beta sheet structures, shown in Figure 9-5, are known collectively as secondary structures and are the next level of organization. Interactions between multiple secondary structure elements give rise to supersecondary structure and tertiary structure—helices and sheets contacting each other to form larger characteristic structures, which can be described by their topology.

Alpha helix and beta strand structures

Figure 9-5. Alpha helix and beta strand structures

To create a functional protein, the sequence of amino acids in the protein chain must give rise to the proper 3D fold for the protein, and it must also place individual amino acids at appropriate points on that scaffold to carry out the protein's chemistry. Finding ways to extract those chemical instructions from the sequences of known proteins, formulating them as rules, and using those rules to predict the structure of other proteins is one of the biggest open research problems in bioinformatics.

Interatomic Forces and Protein Structure

Since the form that a protein structure can take and its chemical characteristics are governed by interatomic interactions, it is important to have at least a basic understanding of the interatomic interactions that play a role in protein structure. Interactions between atoms are physically complicated and to describe them in detail would require a whole other book, which fortunately has already been written by someone else: see the Bibliography. What we hope to give you is a rudimentary knowledge of these forces, to help you understand why computer methods have been developed to measure and calculate particular structural properties of proteins.

Understanding these forces gives us a basis for designing evaluative and predictive methods. Threading methods rely on the ability to discriminate between an amino acid that is in a favorable chemical environment and one that isn't. Homology modeling and structure optimization methods rely on rules for spacing between atoms, bond lengths, bond angles, and other values. These rules can be derived from chemical experiments on small molecules or from the distribution of observed values in known protein structures. However these rules are constructed, though, they reflect energetically favorable interactions between atoms.

Covalent interactions

Covalent interactions are the very short range (approximately 1 to 1.5 angstroms); they are very strong forces that bind atoms together into a molecule. In covalent bonding, the atoms involved actually share electrons. Unlike other forces encountered in protein structures, covalent bonds actually change the nature of the atoms involved to some extent. Atoms involved in covalent bonds are no longer discrete entities; instead, they combine to form a new molecule.

The protein backbone, including the peptide bond that joins one amino acid to another, is held together by covalent bonds. Amino acids retain some of their chemical individuality within the protein structure, but formally they become part of a new molecule. Atoms within individual amino acid sidechains are also covalently bonded to each other. These covalent bonds place strong constraints on the distance between atoms in a protein structure.

Because covalent interactions are strongly constrained by physicochemical rules, an important part of the verification process for structural quality is making sure that bond lengths, bond angles, and dihedral angles don't vary dramatically from their allowed values. Covalent bond lengths are determined by the size and type of the atoms involved and by the number of electrons shared between atoms. The more electrons are shared, the shorter and stronger the bond. Bond angles are constrained by the structure of atomic orbitals. Dihedral angles , the angles of rotation of two bonded pairs of atoms with respect to each other around a central bond, are constrained primarily by steric hindrance. These chemical constraints are also used in macromolecular simulation, where they are associated with applied forces that keep the molecule in allowed conformations.

Hydrogen bonds

Hydrogen bonds arise when two polar groups interact. The two polar groups must be of specific types. One must be a proton donor, a chemical group in which a proton (hydrogen atom) is covalently bonded to a strongly electronegative atom such as oxygen. The bond between the proton and the electronegative atom is polarized, giving the proton a partial positive charge and the electronegative atom a partial negative charge. The other group must be a proton acceptor, an electronegative atom with a partial negative charge and no attached proton. The positively polarized proton in the first group is attracted to the negatively polarized second group, and the two form a bond that isn't covalent, but is nonetheless, much shorter and stronger than a normal nonbonded interaction. Hydrogen bonds are unusual among nonbonded and electrostatic interactions because they are strongly directional; they weaken if the angle described by the three atoms involved is too large or too small.

Hydrogen bond interactions are one of the most important stabilizing forces in protein structure. The protein backbone contains a proton donor, in its N-H group, and a proton acceptor, in its carbonyl oxygen, spaced at regular intervals along the chain (Figure 9-6). The interaction of these groups stabilizes the two major types of secondary structure, the alpha helix and the beta sheet (Figure 9-7). Therefore, some structure prediction methods attempt to use the presence of potential hydrogen bond pairs to improve the accuracy of predictions.

Proton donor and acceptor in the protein backbone

Figure 9-6. Proton donor and acceptor in the protein backbone

Hydrogen bonding in alpha helices and beta sheets

Figure 9-7. Hydrogen bonding in alpha helices and beta sheets

Hydrophobic and hydrophilic interactions

A much-discussed (and frequently wrongly used) concept in protein structure analysis is that of the hydrophobic force. We've already mentioned in passing that amino acids can be classified as hydrophobic or hydrophilic. What exactly does this mean?

Proteins, except for those bound within cell membranes, always exist in aqueous solution. They constantly interact with water molecules. Water is a solution that has some interesting properties, and these properties contribute to the stability of the compact globular structures that characterize cellular proteins.

Water is a polar molecule. Individual water molecules in liquid water can each form four hydrogen bonds with neighboring water molecules. Liquid water is an essentially uninterrupted lattice of hydrogen bonded molecules, as seen in Figure 9-8. This unusual property contributes to the high melting and boiling points of water, as well as to such properties as low compressibility and high surface tension. It also results in interesting interactions of water with soluble proteins.

Hydrogen bonding in water

Figure 9-8. Hydrogen bonding in water

A nonpolar molecule dissolved in water interrupts the regular hydrogen bond lattice of liquid water. Individual water molecules can reorient around a small nonpolar molecule to preserve their network of hydrogen bonds, but this reorientation has a cost in terms of free energy (which is how cost is measured in chemistry). The presence of a nonpolar solute forces water molecules into a more ordered conformation than they would ordinarily assume. Instead of being able to face any which way and rotate freely, water molecules near the surface of a nonpolar solute have to work around it and form a cage. This is entropically unfavorable.

The larger a nonpolar solute gets, the more water molecules need to reorient to accommodate it, and the higher the energy cost of solvating the molecule becomes. Of course, if the nonpolar solute has some polar groups on its surface, water molecules can use those groups as hydrogen bonding partners instead of other water molecules, and the water lattice is less disturbed. Globular proteins, which exist in aqueous solution even though they are composed substantially of nonpolar groups, must present a good hydrogen-bonding surface to the world. Hydrophilic amino acids are those whose sidechains offer hydrogen bonding partners to the surrounding medium, while hydrophobic amino acids' sidechains don't. The surface of a globular protein is usually anywhere from 50%-75% polar atoms, and deviations in this pattern can suggest binding or complexation sites.

Solvent accessibility and hydrophobicity play an important role in evaluating model structures. Threading methods for protein fold recognition use amino acid environments in evaluating models. When many hydrophobic amino acids are found in solvent-exposed structural environments or hydrophilic amino acids buried in the protein interior, it is considered unlikely that the protein model is folded correctly.

Charge-charge, charge-dipole, and dipole-dipole interactions

Unlike covalent bonds, the other important interactions in protein structure are nonspecific. They don't change the discrete nature of the interacting atoms. They involve no sharing of electrons. Covalently bonded atoms are married; noncovalently bonded atoms are just shacking up.

Several kinds of important forces can arise among polar and charged atoms. An ion is an atom that has a net positive or negative charge due to either a surplus or a deficit of electrons. Atoms that carry a positive ionic charge are attracted to atoms that carry a negative ionic charge, with a strength that depends on the size of the charges and the inverse of the distance between the atoms. In proteins, charge-charge interactions occur between the sidechains of acidic and basic amino acids that are negatively charged or positively charged due to loss or gain of a labile proton under normal physiological conditions. The charge-charge interactions between amino acids in a protein structure are called salt bridges , and they can contribute a significant stabilizing force to a protein structure.

There are other, weaker interactions that occur between charges and groups that don't carry a positive or negative ionic charge. Dipolar molecules are molecules like those involved in hydrogen bonds, in which one end of the molecule has a partial positive charge and the other end has a partial negative charge. The dipole of a molecule is essentially a vector that describes the magnitude of the polarization along a bond. Dipolar molecules can be strongly attracted to other partial charges or to ionic charges. Many amino acid sidechains, as well as the protein backbone, have a strongly dipolar character, so charge-dipole and dipole-dipole interactions play a substantial role in stabilization of protein structure.

Van der Waals forces

The van der Waals force is a nonspecific attractive force between molecules. This force is loosely analogous to gravity, in that it exists between every pair of nonbonded atoms, and it's a fairly long-range force. However, it doesn't arise simply from the mass of the atoms involved, but from the transient attractive forces between the instantaneous dipole moments of each atom. The van der Waals force is quite strong, and because van der Waals interactions are nonspecific and numerous they play a significant role in protein folding and protein association.

Repulsive forces

Repulsive forces, or steric interactions, are very short range forces that increase sharply as atomic centers approach each other. The radius at which the repulsive force begins to increase sharply defines a spherical boundary around each atom center inside which another atom's spherical boundary (called the van der Waals radius) can't pass. If two nonbonded atoms in a structure get into each other's personal space, the contact is energetically unfavorable. In real molecules, atoms stay out of each other's way. However, in models of molecules, whether derived from NMR or x-ray data or built from scratch, checking for van der Waals bumps between nonbonded atoms is an important part of the structure-refinement process.

Relative strength of interatomic forces

The interaction between atoms can be described by a pair potential, such as the Lennard-Jones potential (Figure 9-9), which includes both an attractive and a repulsive term. The form of the potential shows that atoms tend to repel each other at very short range (positive potential energy indicating an unfavorable interaction) but to attract each other at slightly longer range. The strength of the attraction decays with distance, depending on the forces modeled.

Plot of Lennard-Jones potential

Figure 9-9. Plot of Lennard-Jones potential

When making inferences about structural stability or function based on intermolecular interactions, it is important to understand the relative strengths of these interactions, and how they scale with distance (Table 9-1).

Table 9-1. How Interatomic Forces Scale with Distance

Type of Bond

Range of Interaction

Covalent

Complicated short range

Hydrogen bond

Roughly 1/r 2

Charge-charge

Scales with 1/r

Charge-fixed dipole

Scales with 1/r 2

Charge-rotating dipole

Scales with 1/r 4

Fixed dipole-fixed dipole

Scales with 1/r 3

Rotating dipole-rotating dipole

Scales with 1/r 6

Charge-nonpolar

Scales with 1/r 4

Dipole-nonpolar

Scales with 1/r 6

Nonpolar-nonpolar

Scales with 1/r 6

In Table 9-1, r represents the distance between two atoms in angstroms. Interactions that decrease in strength with 1/r are effective at a much longer range than those that decrease in strength with higher powers of r. Covalent interactions and hydrogen bonds are strong, and very energetically significant at short distances. Charge-charge interactions have some of the longest-range effects; electrostatic effects on protein activity have been experimentally shown at over 15-angstrom distance, a substantial range in molecular terms. A concentration of charges on a protein surface can create a powerful electrostatic steering effect that can attract ligand molecules or other proteins at even longer range. Hydrogen bonds and charge-dipole interactions are also relatively strong. The effects of these interactions are modeled by computing electrostatic potentials and using the computed potentials as the basis for calculating other molecular properties such as binding constants (via Brownian dynamics) or pKa values.

On the other hand, interactions between noncharged and nonpolar atoms are very weak and effective only at short range. However, the effects of these interactions can be cumulative, stabilizing structure and making intermolecular associations more favorable. The effects of these interactions are addressed when you compute the size of intermolecular contact surfaces or enumerate interactions between neighboring interactions in a protein. In the remainder of this chapter, we discuss various methods for measuring and evaluating atomic structures of proteins, all of which can be used together to add to your understanding of protein chemistry.

Web-Based Protein Structure Tools

Now that we've reviewed the basics of protein chemistry, let's turn our attention to the tools. The most important source of information about protein structure is the PDB. In addition to being an entry point to the structural data itself, the PDB web site (http://www.rcsb.org/pdb) contains links to many tools database you can apply to individual protein structures as you search the database. Information from the database is made available through the Protein Structure Explorer interface. For each protein, you can view the molecular structure using 3D display tools such as RasMol and the Java QuickPDB viewer. PDB files and file headers can be viewed as HTML and downloaded in a variety of formats. Links to the protein structure classification databases CATH, FSSP, and SCOP are provided, along with the tools CE and VAST, which search for structures based on structural alignment. Average geometric properties, including dihedral angles, bond angles, and bond lengths can be displayed in tabular format with extremes and deviations noted. Sequences can be viewed and labeled according to secondary structure, and sequence information downloaded in FASTA format.

You can go directly to the page for a particular protein of interest by entering that protein's four-letter PDB code in the Explore box on the PDB's main page. The PDB can also be searched using two different search tools, SearchLite and SearchFields. SearchLite is a simple search tool that allows you to enter one or more search terms separated by boolean operators into a single search field. SearchFields is a tool for advanced searches that provides a customizable search form that allows you to use separate keywords to search each PDB header field. You can modify the form by selecting checkboxes at the bottom of the form and regenerating the form. SearchFields supports options for searching a dozen of the most important fields in the PDB header, as well as crystallographic information. SearchFields also allows the database to be searched using FASTA for sequence comparison, as well as secondary structure features or short sequence features.

From the individual protein page generated by the Structure Explorer, the PDB provides a menu of links through which to connect to other tools. These features are still evolving rapidly. Table 9-2 provides a brief overview of the PDB protein page. We also encourage you to explore the PDB site regularly if you are interested in tools for protein structure analysis.

Table 9-2. PDB Summary Information

Page

Description

Summary page

The Summary page shows important information from the PDB header, as well, the chain composition of the protein and chemical information about any ligands and cofactors.

View Structure

The View Structure page provides links to everything from static images to interactive protein views using VRML, RasMol, and the PDB's Protein Explorer tool.

Download/Display File

The Download page offers several options for downloading individual protein structures and headers in both classic PDB format and the new mmCIF format.

Structural Neighbors

The Structural Neighbors page links to manually curated protein classification databases, such as SCOP and CATH, as well as the automated protein structure comparison tools CE and VAST.

Geometry

The Geometry page provides tabular views of bond length, bond angle, and dihedral angle data for the protein.

Other Sources

The Other Sources page is a rich catalog of links for each protein to everything from its SWISS-PROT accession code to literature references describing the structure. From this page, you can generate everything from domain analyses to structural quality reports to searches of genome catalogs and the NCBI Taxonomy database.

Sequence Details

The Sequence Details page shows the sequence of the protein and the location of its secondary structure features, as extracted from the crystallographic data. The sequences of the individual protein chains in a PDB entry are also available for download in FASTA format.

We'll discuss the specifics of some of the tools linked from the PDB web site in the upcoming sections. Again, as with any web-based tool, it's a good idea to learn as much as you can about the underlying algorithms before basing any conclusions on their results. Just because a method is endorsed by the PDB, doesn't mean that it's 100% foolproof, or that you can interpret results without understanding the method.

Structure Visualization

One of the first tools developed for structure analysis and one of the first analyses you will probably want to do is simply structure visualization. Protein structure data is stored as collections of x, y, z coordinates, but proteins can't be visualized simply by plotting those points. The connectivity between atoms in proteins has to be taken into account, and for the visualization to be effective, a virtual 3D environment, which provides the illusion of depth, needs to be created. Fortunately, all this was worked out in the 1970s and 1980s, and there are now a variety of free and commercial structure visualization tools available for every operating system.

Even with virtual 3D representation, protein structures are so complex that they are difficult to interpret visually. The human eye can interpret 3D solids, but has a difficult time with topologically complex 3D data sets. There are a number of conventional simplified representations of protein structure that allow you to see the overall topology of the protein without the confusion of atomic detail. In order to be useful, a protein structure visualization program needs to, at minimum, be able to display user-selected subsets of atoms with correct connectivity, draw standard cartoon representations of proteins such as ribbons and cylinders, and recolor subsets of a molecule according to a specified parameter.

Molecular Structure Viewers for Your Web Browser

One type of molecular structure viewers are lightweight applications that can be set up to work with your web browser. When properly configured, they will display molecular data as you access it on the Web. RasMol and CnD3 are two of the most popular viewers.

RasMol

One of the most popular molecular structure visualization program tools is RasMol. It is available for a wide range of operating systems, and it reads molecular structure files in the standard PDB format. RasMol 2.7.1, the most up-to-date version, can be downloaded from Bernstein and Sons (http://www.bernstein-plus-sons.com). Either source code or precompiled binary distributions can be downloaded.

RasMol comes in three display depths: 8-, 16-, and 32-bit. Eight-bit is the default, but if you have a high-resolution monitor, you may have to experiment and find out which executable is right for your system. You'll know you have a problem when you try to run RasMol and it complains that no appropriate display has been detected. Start with the 8-bit version, and work your way up.

If you plan to compile RasMol yourself, you need to get into the src directory and edit the Makefile to produce the appropriate version. To do this, open the Makefile with an ASCII text editor such as vim or Emacs and search for the variable DEPTHDEF. You should find something like this:

# DEPTHDEF = -DTHIRTYTWOBIT
DEPTHDEF = -DSIXTEENBIT
# DEPTHDEF = -DEIGHTBIT

In this example, DEPTHDEF has been defined as 16-bit.

The # character at the beginning of a line marks that line as a comment, which isn't read by the make program when it scans the Makefile. Lines of code can be skipped over by being commented out; that is, marked as a comment. Remove the # character in front of the depth definition you need to use, and add it to comment out the others. Comment characters vary from programming language to programming language, but the notion of a comment line is common to all standard languages.

You may also need to edit the rasmol.h file, according to the install instructions.

Once you have the proper RasMol executable, whether you download it or compile it yourself, you need to copy it into /usr/local/bin and copy the file rasmol.hlp into the directory /usr/local/lib/rasmol. Then, in your web browser's preferences, you need to add RasMol as an application. If you're using Netscape, the default browser on most Linux systems, go to the Preferences→Navigator→Applications menu, select New, and enter the following values into the dialog box:

Description:		Brookhaven PDB 
MIMEType:		chemical/x-pdb 
Suffixes:	.pdb 
Application:		/usr/local/bin/rasmol

You may also want to create a second entry for the MIME type chemical/x-ras.

When run from the command line, RasMol opens a single graphics display window with a black background. The molecule can be rotated in this window either directly with the mouse, or with the sliders on the bottom and right side of the window. This window has five pulldown menus. The File menu contains commands for opening molecular structure files. The Display menu contains commands for changing the molecular display style to formats including ball and stick, cartoons, and spacefill. These display commands execute quickly, so you can try each of them out to see the different standard molecular display formats. The Colours menu allows you to change the color scheme of the entire molecule, and the Options menu changes the display style, allowing you to display the molecule in stereo, turn the display of heteroatom groups or labels on and off, etc. The Export menu allows you to write the displayed image in common electronic image formats such as GIF, PostScript, and PPM, which can be edited later using standard image manipulation programs that come with most Linux distributions, such as GIMP.

When you import or save files in RasMol, you do it from the RasMol command line. In the shell window from which you start RasMol, the command prompt changes to RasMol >. Enter help commands at this command prompt to see the full range of RasMol commands, including commands for selecting subsets of atoms. If RasMol complains that it can't find its help file, create a symbolic link to /usr/local/lib/rasmol/rasmol.hlp in the directory in which you installed RasMol and/or the directory in which you are running it. Help commands allow you to create your own combinations of colors and structure display formats, including some not available from the menus; create interatomic distance monitors; and display some intermolecular interactions, such as hydrogen bonds and disulfide bridges.

Cn3D

Cn3D is an application from NCBI that can view protein structure files in NCBI ASN.1 format. If you use the NCBI databases frequently, you will also want to install this tool and set it up to work as an application in your browser.

To install Cn3D on a Linux workstation and set it up as a browser application, you simply need to download the Cn3D archive from NCBI, make a Cn3D directory on your own machine, move the archive into that directory, and extract it.

Then, in your web browser's application preferences, make the following new entry:

Description:		NCBI ASN.1 
MIMEType:		chemical/ncbi-asn1-binary 
Suffixes:	.prt 
Application:		/usr/local/cn3d/Cn3D

Cn3D opens two windows: a color structure viewer, in which a molecule can be rotated, colored according to different properties, and rendered in different display formats; a sequence viewer, which allows you to view sequences and alignments corresponding to the displayed protein and to add graphics to the sequence display to highlight the location of secondary structure features.

SWISS-PDBViewer

The SWISS-PDBViewer is a relatively new 3D structure display and analysis tool that complements the services offered by the Swiss Institute of Bioinformatics. It can be used to prepare input for homology modeling using the SWISS-Model web server. However, it is also useful as a standalone visualization tool. The viewer incorporates many useful functions, including superimposition of structures, calculation of molecular surfaces and electrostatic potentials, high-quality rendering, analysis of torsion angles, creation of mutations to the structure, and much more. At the time of this writing, SWISS-PDBViewer is in a phase of rapid development; if interested, you should check the Swiss Institute of Bioinformatics web site for the current version and online documentation.

Standalone Modeling Packages

Heavy-duty molecular structure viewers tend to have many more features than web applications such as RasMol and Cn3D. The most popular examples are MolMol, MidasPlus, and VMD. These programs run on your desktop machine, and to use them you need copies of the PDB files you're interested in using already stored on your computer.

MolMol

If you have Cn3D and RasMol linked to your web browser, you are well-equipped to view any molecular structure on the fly. However, there are times when you need to do more extensive manipulations of a molecular structure. MolMol is a full-featured molecular structure visualization package that allows you to display molecules, edit structures, and compute molecular properties.

You run the MolMol program by issuing the command molmol from the command line. There are no command-line options. The program opens with one large window with a white background, and a separate smaller window, which contains sliders for x, y and z rotation and for changing depth and position of the clipping plane. The clipping plane controls the simulated depth of the display window and the point at which the display window intersects the molecular structure. Atom selection options are controlled from the menu bar to the right of the main window.

Like RasMol, MolMol has pulldown menus, but all its options are available from the pulldown menus, and there are substantially more of them. MolMol has a complete manual, which is distributed, along with the software, in HTML, and several printable formats, so we will not discuss each command here in detail. Some MolMol features you may find useful, in addition to the standard molecular display functions, are the display of Ramachandran and contact maps, calculation and display of macromolecular surfaces, and display of qualitatively accurate electrostatic potentials.

MolMol is available as a binary distribution from ETH Zurich and is simple to install on a Linux workstation. Follow the directions provided, and you can't go wrong. While the MolMol interface isn't quite as slick as that of a commercial product like MSI's Quanta, it is an amazing value for the price. A couple of general tips: be sure to close dialog boxes and windows by clicking on their OK buttons or by selecting Quit from the menus, rather than by clicking the Kill Window button at the top-right corner. If the program seems to need to take its time to do something, don't click a lot of extra buttons or try to force it to close down—just wait. This will keep the program from hanging up your machine.

MidasPlus

MidasPlus is a near commercial-quality molecular modeling package available from the University of California at San Francisco. It provides many standard molecular display functions, as well as tools for measurement, limited modeling capabilities (for instance, the ability to substitute amino acids in the structure), and computation of molecular surfaces and electrostatics. The MidasPlus source code and executables for various platforms, including some Linux systems, are available from UCSF for a licensing fee of $350 —much less than comparable commercial software packages. Your Linux workstation must be equipped with a good-quality 3D graphics card in order to support MidasPlus.

VMD

Another excellent package for creating molecular graphics is VMD, the Visual Molecular Dynamics program from the Theoretical Biophysics group at the University of Illinois. VMD was designed to visualize and animate trajectories from molecular dynamics simulations, but it can also produce quite nice visualizations of single molecules. VMD is available for Linux systems and has an easy-to-use, menu-driven graphical user interface.

Creating High-Quality Graphics with MolScript

Usage: molscript -in infile -[ options ] -out outfile
Usage: molauto -[ options ] infile > outfile

MolScript has a completely different purpose from the other visualization packages we have discussed. It is designed to produce high-quality graphics for print publication, as you can see in Figure 9-10. It can be configured to run from the command line and to produce PostScript, Raster3D, and VRML output only; it can also be configured to run interactively in its own window, using OpenGL, and to produce output in many additional image file formats.[*]

A sample image generated by molscript

Figure 9-10. A sample image generated by molscript

Setting up interactive MolScript with OpenGL on a Linux workstation isn't straightforwRasMolard; it requires the installation of Mesa (open source OpenGL) libraries and customization of the Makefile that comes with the distribution. However, the basic MolScript installation is quick and simple and can produce visually appealing line drawings of molecular structure cartoons in color or black and white, in a style that is uniquely elegant and appropriate for print media. To install the basic version of MolScript, simply follow the directions in the install file. Copy the resulting executables (molscript and molauto) to your /usr/local/bin directory or to another directory in your default path. Here's what molscript and molauto do:

molscript

The main MolScript program; generates images

molauto

The MolScript setup program; automatically generates a rudimentary MolScript input file from an input PDB file

MolScript takes two input files: a MolScript command file and a PDB coordinate file. Here's the MolScript input file that produced the images in Figure 9-10:

! MolScript v2.1 input file 
! generated by MolAuto v1.1.1
title "MYOGLOBIN  (FERRIC IRON - METMYOGLOBIN)"
plot
  read mol "1MBN.pdb";
  transform atom * by centre position atom *;
  set segments 2;
  set planecolour hsb 0.6667 1 1
coil from 1 to 3
set planecolour hsb 0.619 1 1
helix from 3 to 18
set planecolour hsb 0.5714 1 1
coil from 18 to 20
set planecolour hsb 0.5238 1 1
helix from 20 to 35
...
coil from 94 to 100
set planecolour hsb 0.1429 1 1
helix from 100 to 118
set planecolour hsb 0.09524 1 1
coil from 118 to 125
set planecolour hsb 0.04762 1 1
helix from 125 to 148
set planecolour hsb 0 1 1
coil from 148 to 153;

set colourparts on
bonds in require residue 1 and type HEM;

end_plot

The MolScript scripting language is unique and not really based on any standard computer language. The only way to learn it is to decide what you want to do, study the manual and examples, and learn the language. The example just shown is a simple MolScript command file; it reads in a single molecule, centers it on the molecule's center of mass, defines the locations of the various secondary structure elements and shades them through the spectrum from red to blue. MolScript can produce much more complex figures than this, however. MolScript plots can be scaled and multiple plots shown on a single page. Subsets of atoms in the molecule can be turned on, displayed in different formats, and custom colored. Labels can be added to figures.

Fortunately, the molauto program automatically produces simple input files for the molscript program, which can help you get started using the MolScript command language. molauto does the most tedious part of input file setup for you—assigning helix, sheet, or coil drawing styles, and colors, to each segment of secondary structure. molauto has a variety of command-line options, which you can access by entering molauto -h. molauto reads input in the standard PDB file format, and writes to standard output unless a redirector is used.

The following are some of the most useful command line options for molauto:

-ss_pdb

Reads secondary structure assignments from the PDB file

-ss_hb

Uses hydrogen bonding patterns to assign secondary structure

-cylinder

Uses cylinders to indicate alpha helices

-stick

Renders cofactor molecules using a ball-and-stick representation

-nocolour

Leaves out the coloring commands

-nice

Improves the quality of the rendering, using more colors and segments

The output of the molauto program is an input for the main molscript program. Command-line options for molscript include:

-ps

Produces PostScript output

-v

Produces VRML output

-size width height

Changes the size of the output image

The default input files produced by molauto can be hand-edited to produce various effects. One important thing you might want to do (and can't do automatically unless you have installed the MolScript package with OpenGL support) is to rotate the molecular structure until you achieve a good view.

To rotate the molecule view using the noninteractive version of molscript, add the following lines to your molscript input file, replacing the line that currently reads:

transform atom * by centre position atom *; 

with:

transform atom * by centre position in amino-acids
                 by rotation x    0.0
                 by rotation y    0.0
                 by rotation z    0.0
                                    ;  !Be sure to include this semicolon.

After you generate your first version of the image, open it in a fast PostScript viewer such as gv. To change the view of the molecule, experiment with changing the values of x, y, and z rotation in your input file. Since molscript takes only seconds to run on any protein input file, you can make changes to the input file, save the file, and redisplay the new output several times until you like the view.

Once generated, the molscript image file can be viewed, converted to other file formats, and edited using standard Unix image-manipulation tools. One program you can load when you install most major Linux distributions is GIMP, the freeware package similar to Adobe Photoshop.

Active Site Visualization with LIGPLOT

Usage: ligplot protein.pdb resid resid chain

Another useful tool for producing graphics for publication is the program LIGPLOT (http://www.biochem.ucl.ac.uk/bsm/ligplot/ligplot.html), which is available from the Structure and Modelling group at University College London (UCL). Given a molecular structure and a specific residue or heteroatom group within the structure as input, LIGPLOT automatically generates a 2D schematic drawing showing hydrogen bonds, interatomic contacts, and solvent accessibility. A sample of LIGPLOT is shown in Figure 9-11

A schematic diagram of ligands to the heme cofactor in cytochrome B5, generated with LIGPLOT

Figure 9-11. A schematic diagram of ligands to the heme cofactor in cytochrome B5, generated with LIGPLOT

To install LIGPLOT on a Linux workstation, simply follow the directions in the README file.

In order for LIGPLOT to find its parameter files and helper programs correctly, you need to add some path information to your .cshrc file:

setenv ligdir /usr/local/ligplot
alias ligplot $ligdir'/ligplot.scr'
alias ligonly $ligdir'/ligonly.scr'
alias dimplot $ligdir'/dimplot.scr'
alias dimonly $ligdir'/dimonly.scr'
setenv hbdir /usr/local/hbplus
alias hbplus $hbdir'/hbplus' 

The values on the command line specify a residue range in a particular protein chain. The program doesn't have to display only interactions with ligands and prosthetic groups; it can also display the network of close interactions with any residue in a protein. This works best when the residue range selected is small.

dimplot

Usage: dimplot protein.pdb chain1 chain2
Usage: dimplot protein.pdb -d domain1 domain2

The dimplot program, a variant of LIGPLOT, displays interactions across an interface between two protein chains or domains. The domain variant works only if your PDB file labels proteins at the domain level of organization.

The painful part of installing the LIGPLOT, hbplus, and naccess programs on some Linux systems is, ironically, not the installation itself, but having the capability to decrypt the encrypted archives you get from UCL. The files are encrypted using the standard Unix crypt command. This sounds straightforward enough, but many Linux vendors don't include crypt in their distributions. In order to use crypt on your system, you may in fact need to reinstall the latest version of glibc-2.0. If you don't want to deal with this, request a decrypted copy of the LIGPLOT tar archive from the authors when you send in your license agreement.

Structure Classification

Protein structure classification is important because it gives you an entry point into the world of protein structure that is independent of sequence similarity. Proteins are grouped not by functional families, but according to what kind of secondary structure (alpha helix, beta sheet, or both) they have. Within those larger classes, subclasses are defined based on how the secondary structures in the protein are arranged.

The focus in protein classification is on finding proteins that have similar chemical architectures; it doesn't matter if their sequences are related. Over the years, we've learned from classification that there are far fewer unique protein folds than there are protein sequence families. Protein chemists often are interested in the information that can be extracted from broader structural classes of proteins, since analyzing that information can help them better understand how proteins fold.

Classification of protein structures into families is a nontrivial task. Proteins have many levels of structure: the primary structure, which is the 1D sequence; the secondary structure , which is composed of the regular substructures that the protein polymer forms due to steric and hydrogen bond interactions; the tertiary structure , which is the overall 3D structure of the protein; and the quaternary structure, which is the most complex protein structure composed of multiple chains. The quaternary structure is required to form a functional protein. Structure classification involves developing a representation of how units of secondary structure come together to form domains , which are compact regions of structure within the larger protein structure. Dividing proteins into domains is another aspect of structure classification.

There isn't really a consensus as to how to classify protein structures quantitatively. Instead, structures end up in qualitatively named classes such as "greek key," "helix bundle," and "alpha-beta barrel." These fold classes are useful in that they draw attention to prominent structural features and create a frame of reference for classifying structure. However, qualitative classifications don't lend themselves to automated analysis, and such protein classification databases still require the involvement of expert curators.

If you're simply concerned with finding the close structural relatives of a published protein structure, there are a number of online classification databases in which existing structures have been annotated by a combination of automated analysis and input from protein structure experts. There are also automated tools for finding structural neighbors by structure alignment, though like any alignment method, these tools require you to understand the significance of comparison scores when analyzing results.

If you're interested in doing your own analysis of a protein structure, there are several structure classification processes and tools that might help.

Secondary Structure from Coordinates

Protein coordinate data sets don't automatically come labeled with alpha-helix and beta-sheet classifiers. Secondary structure features in the protein can be distinguished with reasonable certainty by their hydrogen bonding patterns and their backbone torsion angles.

The standard program for extracting secondary structure from sequence is the DSSP program. DSSP analyzes the geometry and backbone hydrogen bonding partners of each residue in a known protein structure, producing a tabular output that includes residue numbering, sequence, hydrogen bonding, and geometry details. The DSSP database, and DSSP executables derived from the 1995 release of the program, are available from the European Bioinformatics Institute (EBI); these executables may still cause Y2K-related errors on some older Linux systems. Updated DSSP source code is available from the Gerrit Vriend at the Center for Molecular and Biomolecular Informatics at the University of Nijmegen, Netherlands.

STRIDE

Usage: stride -[ options ] infile > outfile

An alternative to DSSP is the program STRIDE, offered in either web server or downloadable form at the European Molecular Biology Laboratory (EMBL, http://embl-heidelberg.de/stride/stride.html/). STRIDE compiles easily on a Linux machine. Create a directory for the program, move the tar archive into the directory, and extract. Compile the program with make.

Command-line options for STRIDE include:

-M molscript file

Produces a simple MolScript input

-h

Reports hydrogen bond information

-o

Reports secondary structure assignments only

A complete list of commands can be viewed by running STRIDE with no command-line options.

The STRIDE output format is in structured 78-character lines. The following example illustrates the hydrogen bond information output format:

ACC  ALA -  143  142 ->  TYR -  146  145  3.3  107.8  125.8   58.5   76.9  1MBN 

ACC  ALA -  143  142 ->  LYS -  147  146  3.2  154.3  113.4    0.1   43.4  1MBN 
DNR  ALA -  144  143 ->  LYS -  140  139  3.0  153.6  109.9   16.4   27.2  1MBN 

ACC  ALA -  144  143 ->  GLU -  148  147  3.0  160.3  109.4   11.6    6.4  1MBN 
DNR  LYS -  145  144 ->  ASP -  141  140  3.2  145.3  119.5    3.7   73.8  1MBN 

ACC  LYS -  145  144 ->  LEU -  149  148  3.0  149.4  128.8    4.7   63.7  1MBN 
DNR  TYR -  146  145 ->  ILE -  142  141  3.2  158.7  121.8   20.1   52.6  1MBN 
DNR  TYR -  146  145 ->  ALA -  143  142  3.3  107.8  125.8   58.5   76.9  1MBN 

ACC  TYR -  146  145 ->  GLY -  150  149  3.0  156.9   96.3   37.1   37.7  1MBN 

ACC  TYR -  146  145 ->  TYR -  151  150  3.1  111.2  118.0    4.2   89.9  1MBN 
DNR  LYS -  147  146 ->  ALA -  143  142  3.2  154.3  113.4     0.1   43.4  1MBN 

The STRIDE source code is well constructed and documented. It's an excellent example of how molecular geometry is analyzed. Each function, e.g., surface area calculation, torsion angle calculation, etc., lives in its own separate program. If you want to understand many of the standard operations involved in analyzing geometric properties of proteins, we highly recommend the STRIDE source code.

Topology Cartoons

Topology cartoons are a 2D notation for depicting the topological arrangement of secondary structural elements in proteins. The cartoons can clarify the spatial relationships and connectivity between secondary structure elements in a protein. These relationships may not be easily seen in a 3D structure, even if only the structural backbone is displayed or a ribbon diagram is drawn. Software for generating your own cartoons may be found on the Protein topology page, http://www.sander.embl-ebi.ac.uk/tops/.

Topology cartoons, as illustrated in Figure 9-12, represent each secondary structural unit as a shape. Circles are helices, and triangles are beta strands. The beginning of the chain is marked with an N, the end with a C. Each element has a directionality, which can be deduced from the way the connecting segment is drawn. If the N-terminal connection is to the edge of the secondary structural element, that element is directed out of the plane of the drawing; if the N-terminal connection is to the center of the secondary structural element, it is directed back into the plane of the drawing.

A protein topology cartoon

Figure 9-12. A protein topology cartoon

TOPS

Usage: tops pdbcode

The TOPS program expects a file in DSSP format, generated from your protein of interest, as its input.

In order to compile the TOPS code on your own machine, you need Java support. Linux ports of Java are available from IBM and Blackdown at http://blackdown.org. The Blackdown version requires that you update to glibc2.1.2, but the IBM version installs easily under Red Hat 6.1 using GnoRPM (if you download RPMs, of course). Once the IBM JRE and JDK are installed, TOPS installs without any difficulty. To run the EditTOPS executable, which allows you to actually view and plot topology files, be sure that these environmental variables are set correctly:

PATH

Includes /usr/jdk118/bin (or wherever you installed Java)

CLASSPATH

Where you installed TOPS classes TOPS.jar

TOPS_HOME

Where you installed TOPS

You can set these variables by writing a script called topssetup, which contains the following three lines, and placing it in your home directory. Before you try to run TOPS or EditTOPS, use source topssetup to set the environment variables correctly.

setenv PATH "/usr/sbin:/sbin:/usr/jdk118/bin:${PATH:."
setenv CLASSPATH "/usr/local/Tops/classes/TOPS.jar:${CLASSPATH"
setenv TOPS_HOME "/usr/local/Tops"

Topology patterns also have been implemented as data structures in web-based search tools that allow you to compare topologies of two structures or to search a protein database for structures of similar topology. These services are available from the EBI at http://www.ebi.ac.uk.

Classification Databases

Classification databases are taxonomies of protein structure, and they bear a strong resemblance to the morphology-based taxonomies developed by early biologists. Proteins that "look" grossly the same, in terms of shape and topology, are classified as more closely related than proteins that look substantially different. Protein structure types have whimsical names (like Greek key beta barrel ) based on visual observation and comparison with familiar objects. The classification databases can be envisioned as trees with many branchings at each branch point—very similar to phylogenetic trees, in concept.

SCOP

The Structural Classification of Proteins (SCOP, http://scop.mrc-lmb.cam.ac.uk/scop/) is a database maintained by the MRC Laboratory of Molecular Biology at Cambridge, United Kingdom. SCOP is extensively hand-curated, and tends to lag at least several months behind the PDB in terms of its content. SCOP is a simple, relatively low-tech resource composed of a hierarchy of HTML pages with links to still pictures of individual proteins and folds, as well as embedded links to structure files to be opened with RasMol or Chime plugins and links back to the PDB to download structures.

At the top level of SCOP, known proteins are generally grouped by their secondary structure characteristics into all-alpha, all-beta, coiled coil, small proteins with structural metal ions, and various types of mixed alpha-beta structures. These major types are called Classes within SCOP. The next layer of classification, the Fold level, is a mixture of topology and similarity to domains of known function: one fold can be called "globin-like" and the next "four helical up and down bundle." Beyond the Fold level, proteins are divided further into Superfamilies and Families. Superfamily and Family divisions may be purely functional, or they may also involve some structural difference.

CATH

CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new/) is similar to SCOP in concept, but it divides up the PDB a little differently. In CATH, proteins are classified at the level of (C)lass, (A)rchitecture, (T)opology, and (H)omologous superfamily. The CATH interface is easily navigated, and it is an excellent resource for examining the variety of known protein structures. CATH can be searched by PDB code, and proteins can be displayed within the browser page. The CATH maintainers provide an excellent lexicon of protein structure description to give you a feel for the structural reality behind the somewhat whimsical protein family names. At the time of this writing, the CATH web interface is undergoing rapid revision and expansion of its capabilities, to include everything from structural assignments of uncharacterized genes that may fit into CATH classes, to new levels of classification hierarchy.

Unique protein structure data sets

The PDB is full of duplication. It's been estimated that out of the approximately 13,000 structures in the PDB at this time, only around 1,000 of them actually represent unique folds. This lack of uniqueness can bias predictive and analytical methods based on extraction of structural patterns and features from the protein database. Thus, there is a need to produce nonredundant subsets of the PDB and to select, from among groups of similar proteins, the best representative of each class. This is essentially a subset of the classification problem, and for a long time it was done based on manual examination and annotation of PDB data. But as the PDB has grown, automated methods for generating nonredundant data sets based on sequence comparison have emerged.

The process for generating such data sets is fairly standard, although the particular parameters differ. First, the PDB is culled to remove extremely short protein chains, chains of very poor resolution, and chains containing a large number of nonstandard residues. The PDB is then decomposed into individual chains, and the chains are sorted by various quality criteria. An all-against-all sequence comparison is done, and chains that don't differ sufficiently to meet a certain cutoff are removed, choosing the lowest-quality chain in a pair to be removed, until all the chains in the list meet the uniqueness criteria in a pairwise comparison. Finally, the removed chains are reintroduced and added back to the set if they don't violate the uniqueness criteria with any other chain in the final set.

At this time, nonredundant data sets can be obtained from PDB Select, at EMBL, from NCBI, and from Dr. Roland Dunbrack at the Fox Chase Cancer Center. There is no software we know of that allows you to create a unique data set based on your own choice of parameters, although the groups mentioned may be willing to generate data sets by special request. A Perl script for creation of nonredundant databases from a sequence DB, called nrdb90.pl, is also available from EBI; however, it's hardcoded to produce a nonredundant set at the 90% sequence identity level. If you're intrepid, you can modify this script for your own purposes.

Structural Alignment

Recently, there have been many attempts to make protein-structure classification an automatic and quantitative process, rather than an expert-curated process. Overlaying and comparing structures is a 3D problem that is much more resource-intensive than comparing 1D sequence data. The automated structure comparison tools that exist, therefore, are available primarily as online tools for searching precomputed databases of structure comparisons.

Comparing Two Protein Structures

The most common parameter that expresses the difference between two protein structures is RMSD, or root mean squared deviation, in atomic positions between the two structures. RMSD can be computed as a function of all the atoms in a protein or as a function of some subset of the atoms, such as the protein backbone or the alpha-carbon positions only. Using a subset of the protein atoms is common, because it is likely that, when two protein structures are compared, they will not be identical to each other in sequence, and therefore the only atoms between which one-to-one comparisons in position can be made will be the backbone atoms.

This is the first context we've discussed in which the orientation of a molecular structure becomes important. Because protein structures are generally described in Cartesian coordinates, they essentially exist within a virtual space, and they come with a built-in orientation with respect to that space. RMSD is a function of the distance between atoms in one structure and the same atoms in another structure. Thus, if one molecule starts out in a different position with respect to the reference coordinate system, the other molecule—the RMSD between the two proteins—will be large whether they are similar or not.

In order to compute meaningful RMSDs, the two structures under consideration must first be superimposed, insofar as that is possible. Superimposition of protein structures usually starts with a sequence comparison. The sequence comparison establishes the one-to-one relationships between pairs of atoms from which the RMSD is computed. Atom-to-atom relationships, for the purpose of structure comparison, may actually occur between residues that aren't in the same relative position in the amino acid sequence. Sequence insertions and deletions can push two sequences out of register with each other, while the core architecture of the two structures remains similar.

Once atom-to-atom relationships between two structures are established, the task of a superposition program is to achieve an optimal superposition between the two programs—that is, the superposition with the smallest possible RMSD. Because protein scaffolds, or cores, can be similar in topology without being identical, it isn't usually possible to achieve perfect overlap in all pairs of atoms in two structures that are being compared. Overlaying one pair of atoms perfectly may push another pair of atoms further apart. Superposition algorithms optimize the orientation and spatial position of the two molecules with respect to each other.

Figure 9-13 shows an optimal alignment between atomic structures of triosephosphate isomerase and beta-mannase, shown in Compare3D. The two structures are similar enough to be classified as structural neighbors, and their chain traces are relatively similar. However, their sequence identity is only 8.5%.

An optimal superposition of myoglobin and the 4 chain of hemoglobin, which are structural neighbors

Figure 9-13. An optimal superposition of myoglobin and the 4 chain of hemoglobin, which are structural neighbors

Once optimal superpositions of all pairs of structures have been made, the RMSD values that are computed as a result can be compared with each other, because the structures have been moved to the same frame of reference before making the RMSD calculations.

ProFit

Usage: profit reference.pdb mobile.pdb

ProFit, developed by Andrew Martin at the University of Reading, United Kingdom, is an easy-to-use program for superimposing two protein structures. One protein is assigned by the user to be the reference structure, and the other protein is mobile with respect to the reference. ProFit outputs RMSD and can also write out coordinates for the superimposed proteins. ProFit allows the option of superimposing only selected regions of each protein so that domains can be examined independently. ProFit compiles and runs on any Unix workstation. ProFit may be downloaded from Andrew's web site (http://www.bioinf.org.uk/).

DALI Domain Dictionary

The DALI Domain Dictionary (DDD) at the EBI is based on an automatic classification of protein domains by sequence identity. Rather than using a human-designed classification scheme, DDD is constructed by clustering protein neighbors within an abstract fold space. Instead of working with whole proteins, DDD classifies structures based on compact, recurring structures (called domains ) that may repeat themselves within, and among, different protein structures. The content of DDD may also be familiar to you as FSSP, the "Fold classification based on Structure-Structure alignment of Proteins" database.

DDD can be searched based on text keywords; it can also be viewed as a tree or a clickable graphical representation of fold space. Views of sequence data for conserved domains are available through the DDD interface, as well as connections to structural neighbors.

The superposition program (SUPPOS) that produces the structural alignments in DALI/FSSP is available within the WHAT IF software package of protein structure analysis tools, which is discussed in Section 9.7.1.2.

CE and CL

The Combinatorial Extension of the Optimal Path (CE) is a sophisticated automatic structure alignment algorithm that uses characteristics of local geometry to "seed" structural alignments and then joins these regions of local similarity into an optimal path for the full alignment. Dynamic programming can then optimize the alignment.

CE is available either as a web server or as source code from the San Diego Supercomputer Center. The web server allows you to upload files for pairwise comparison to each other or to proteins in the PDB, to compare a structure to all structures in the PDB, to compare a structure to a list of representative chains, and review alignments for specific protein families. CE also is fully integrated with the PDB's web site, and CE searches can be initiated directly from the web page generated for any protein you identify in a sequence search. Along with the source code, you can download a current, precomputed pairwise comparison database containing all structures in the PDB. If you're doing only a few comparisons, however, you probably won't even want to do this.

When using the CE server to compute similarities, there are several parameters that you can set, including cutoffs for percent sequence identity, percent of the alignment spanned by gaps, and percent length difference between two chains. You can also set an RMSD cutoff and a Z-score cutoff. The Z-score is a measure of the significance of an alignment relative to a random alignment, analogous to a BLAST E-value. A Z-score of 3.5 or above from CE usually indicates that two proteins have a similar fold.

Along with CE, the SDSC offers the Compound Likeness (CL) server, a suite of tools for probabilistic comparison between protein structures. In CL, you select either an entire protein structure or a structure fragment to use as a probe for searching the PDB. Search features include bond length and angle parameters, surface polarity and accessibility, dihedral angles, secondary structure, shape, and predicted alpha helix and beta sheet coefficients. CL allows you to ask the question "what else is chemically similar to this protein (or fragment) that is of interest to me" and to define chemical similarity very broadly. A full tutorial on CL is available at the CL web site (http://cl.sdsc.edu/cl1.html/ ).

VAST

VAST is a pairwise structural alignment tool offered by NCBI. VAST reports slightly different parameters about structural comparison than CE does, and the underlying algorithm differs in significant respects. However, the results tend to be quite similar. VAST searches automatically allow you to view your superimposed protein structures in the Cn3D browser plug-in, with aligned sequences displayed in Cn3D as well. For practical purposes, either CE or VAST is sufficient to give you an idea of how two structures match up; if you are concerned about the algorithmic differences, both groups provide access to detailed explanations at their sites. Unlike CE, the VAST software doesn't appear to be available to download, so if you want to perform a large number of comparisons on your own server, CE may be preferable.

Structure Analysis

Geometric analysis of protein structures serves two main purposes. It is useful for verifying the chemical correctness of a protein structure, both as a means of deciding whether the structural model is ready to be submitted to the PDB and for analyzing existing structures. Geometric analysis also allows you to examine the internal contacts within a protein structure. Since protein function often depends on the interactions of amino acids that aren't adjacent in the protein sequence, contact analysis can provide insight into complex, nonsequential structural patterns in proteins.

Analyzing Structure Quality

Geometric analysis can show where a model developed from x-ray crystallography data or NMR data violates the laws of chemistry. As mentioned earlier, there are physical laws governing intermolecular interactions: nonbonded atoms can get only so close to each other because as two atoms are forced together beyond the boundary set by their van der Waals radius, the energetics of the contact become very unfavorable. These interactions limit not only the contacts between pairs of atoms in different parts of a protein chain, but also how freely atoms can rotate around the bonds that connect them. The structure of atomic orbitals and the nature of bonds between atoms place natural limits on the position of bonded atoms with respect to each other, so bond angles and dihedral angles are, in practice, restricted to a limited set of values. Tools for geometric analysis generally have been developed by crystallographers to show where their structural models violate these laws of nature; they also can be used by homology modelers or ab-initio structure modelers to evaluate the quality of a structural model.

There are a variety of tools for analyzing structure quality. Some run as standalones; others are incorporated into more comprehensive structure analysis and simulation packages. An exhaustive listing of the best of these tools can be found on the PDB web site.

PROCHECK

PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html) is a popular software package for checking protein quality. It produces easily interpreted color PostScript plots describing a protein structure and can also compare two related protein structures. It runs on Unix systems and also been ported to Windows.

Using PROCHECK requires that you set up several aliases in your .cshrc file. The aliases you need are:

setenv prodir /usr/local/procheck
alias procheck	'$prodir /procheck.scr'
alias procheck_comp	'$prodir /procheck_comp.scr'
alias procheck_nmr	'$prodir /procheck_nmr.scr'
alias proplot	'$prodir /proplot.scr'
alias proplot_comp	'$prodir /proplot_comp.scr'
alias proplot_nmr	'$prodir /proplot_nmr.scr'
alias aquapro	'$prodir /aquapro.scr'
alias gfac2pdb	'$prodir /gfac2pdb.scr' 
alias viol2pdb	'$prodir /viol2pdb.scr' 

The aliases are required by the various PROCHECK command scripts, so you can't just run PROCHECK by typing the full pathnames to each individual module. When you run PROCHECK or PROCOMP, the program you actually run is a command script that calls several other programs and scripts.

PROCHECK can be set up to produce several different kinds of output, either in color or black and white, by editing the procheck.prm file in the directory in which you are about to issue the procheck command. The parameters are edited by changing Y to N or vice versa at points in the procheck.prm file where those options are available. The file is self-documenting and easy to understand. The most important part of the file, for reference, is probably the portion in which you turn on or off the various types of plots that are available. The rest of the parameters in procheck.prm are mainly default color values for different types of plots.

Colour all plots?
-----------------  
Y <- Produce all plots in colour (Y/N)?

Which plots to produce
----------------------
Y	<-  1. Ramachandran plot (Y/N)?
N	<-  2. Gly & Pro Ramachandran plots (Y/N)?
N	<-  3. Chi1-Chi2 plots (Y/N)?
N	<-  4. Main-chain parameters (Y/N)?
N	<-  5. Side-chain parameters (Y/N)?
N	<-  6. Residue properties (Y/N)?
N	<-  7. Main-chain bond length distributions (Y/N)?
N	<-  8. Main-chain bond angle distributions (Y/N)?
N	<-  9. RMS distances from planarity (Y/N)?
N	<- 10. Distorted geometry plots (Y/N)?

Once you've edited the procheck.prm file to your satisfaction, run PROCHECK with the command procheck filename.pdb [chain] resolution. The resolution parameter causes your protein to be compared to a "reference protein of X angstrom resolution" in the PROCHECK output. This parameter isn't optional. The command line for PROCOMP requires a second protein filename and chain ID instead of the resolution parameter.

WHAT IF/ WHAT CHECK

WHAT IF is a multifunctional menu-driven molecular modeling package developed by Gert Vriend and now available through the University of Nijmegen. WHAT IF can calculate just about any property of proteins we discuss in this chapter, from solvent accessible surface area to pKa values to contacts to molecular dynamics using GROMOS. The full WHAT IF package is available to academic users for a small fee, and it is known to compile and run well on Linux systems.

WHAT CHECK provides access to a subset of WHAT IF structural quality checks. WHAT CHECK reports on stereochemistry, bond lengths, angles, and dihedrals, among other quantities. Complete WHAT CHECK reports for any protein in the PDB can be downloaded from the PDBREPORTS database at EMBL. WHAT CHECK also is available as part of the Biotech Validation Suite web server at EMBL, for use on models and on structures not already deposited in the PDB.

Intramolecular Interactions

Geometric analysis can also be useful in understanding a protein's fold and function. In this case, the geometry of interest isn't the chemical bonding interactions between atoms adjacent to each other in the protein chain, but the nonbonded interactions between atoms that are widely separated in the protein chain. The density of intramolecular contacts in the structural core of a domain may be quite different from the density of contacts in a region between two structural domains. Measuring this density over the whole protein may give clues as to the process by which a protein folds. The patterns of hydrogen bonds that hold a protein together may serve as an identifying signature for a protein fold. And contacts between certain chemically important residues in a protein may suggest hypotheses about the protein's catalytic mechanism or function. Protein engineers may want to examine the intramolecular contacts in a protein to determine where changes are least likely to disrupt the protein's structure.

Computing contacts with HBPLUS

Listings of intramolecular nonbonded interactions and hydrogen bonds can be computed using the standalone program HBPLUS, available from the Biomolecular Structure group at UCL. Obtaining the HBPLUS program and running it are straightforward, but because the results are produced as a single long text file, they require some scripted postprocessing to become useful. The LPC-CSU (Ligand-Protein Contacts/Contacts of Structural Units) server at the Weizmann Institute (Rehovot, Israel, http://bioinfo.weizmann.ac.il:8500/oca-bin/lpccsu/) can produce textual reports of important intra- and intermolecular contacts in any protein. Protein structures can be uploaded to the server from the user's machine or found on the server using their PDB ID codes.

Contact mapping and display functionality also can be found within the WHAT IF package. Two-dimensional contact maps are a standard feature of most molecular modeling packages. A 2D contact map is simply a plot of pairwise interactions between residues, where residue number within the protein is plotted on each axis and a dot (perhaps color-coded to indicate the contact distance) is drawn wherever residue X and residue Y come into close contact. Contact maps have distinct patterns that can help identify a protein's fold, and some efforts have even been made to predict contact maps for proteins of unknown structure based on their sequences and predicted secondary structure features.

Solvent Accessibility and Interactions

Solvent-accessible surface calculations help you figure out which chemical groups are on the surface of a protein. Amino acids on the surface of a protein usually are the ones that determine how it interacts with other molecules, such as chemical substrates, ligands, other proteins, and receptors If you know what the chemical surface of the protein looks like, you can use that information to help determine why one molecule binds to another, why an enzyme is specific for a particular substrate, or how the protein influences its environment in other ways.

Analytical shape calculations also help you describe the geometry of the protein surface. A lot of biochemistry textbooks describe intermolecular interactions in terms of locks and keys. Molecules fit together in geometrically specific ways, so the shape of the lock (e.g., the enzyme) has to complement to the shape of the key (the substrate). The shape of a receptor on the cell surface has to complement to the ligand it's supposed to respond to, or the cellular response isn't triggered. The immune system is a good example. In the immune response, the organism produces antibodies that attack antigens of a particular shape. This is why you can vaccinate an animal against a disease by injecting a sample of killed virus. The killed virus is shaped just like the live, deadly virus, but it can't harm the animal. Nonetheless, the animal develops antibodies that recognize the shape of the killed virus. Then, when the live virus comes along and tries to invade, the animal already has antibodies that are the right shape to attack the live virus.

So, for instance, if you want to design a new vaccine or engineer a protein that will carry out a particular reaction, or understand how two proteins in a metabolic pathway interact with each other, it's important to be able to measure the shape of the molecule.

The standard method for computing solvent accessibility is quite simple. Each atom in the molecular structure is represented by a sphere; there is a different sphere radius for each distinct atom type. The spheres surround the known atomic centers and are modeled by a collection of several hundred discrete points. To determine the solvent-accessible surface of the protein, solvent-accessibility calculators simulate a spherical "probe" with a radius equivalent to that of water (1.4 angstroms) rolling over the surface of the atomic spheres. The path of the center of the probe determines the solvent-accessible surface of the molecule. Because the probe (and hence, water molecules) can't fit into sharp crevices in the molecular surface, the computed solvent accessible surface is much smoother than the underlying molecular surface (Figure 9-14).

Determination of solvent accessibility by probe-rolling

Figure 9-14. Determination of solvent accessibility by probe-rolling

Because proteins are dynamic entities rather than the rigid bodies assumed by solvent-accessibility calculations, it's likely that the interior of the molecule has more contact with solvent than can be computed using a probe-rolling algorithm. However, solvent-accessibility calculations can help develop an initial understanding of a protein molecule that will inform further experimentation. Accessibility calculations are one way of getting at the complex physicochemical properties of a protein; the nature of the protein surface affects its interaction with the surrounding media as well as with other proteins or substrates.

Computing Solvent Accessibility with naccess

Usage: naccess pdb file [-p probe size ] [-r vdw file ] [-s stdfile ] -[ hwyc ]

There are many programs for calculating solvent accessibility by probe-rolling. They are all straightforward and easy to use, requiring a standard PDB file as input and usually giving output in the form of a percentage of accessibility for each amino acid or atom in the protein. One popular program is naccess , which is available from the Biomolecular Structure and Modelling group at UCL. naccess can be used in combination with other programs developed by this group, such as HBPLUS (a program for computing intermolecular interactions and hydrogen bonds) and LIGPLOT, which we covered earlier. It also runs as a standalone. naccess is written in FORTRAN, so you'll need the g77 compiler installed on your machine to compile it.

naccess outputs two files, an .asa file with accessible surface areas for each atom in the molecule and an .rsa file with accessible surface areas and relative accessibilities for each amino acid. It handles both protein and nucleic acid molecules and produces reports of accessibilities for individual molecular chains as well as complete structures. The -h, -w, and -y flags cause the program to ignore hydrogen atoms, water molecules, or heteroatoms, respectively. Run with the -c option, naccess produces intermolecular contact areas rather than accessible areas.

SURFNET is a program developed by Roman Laskowski at UCL to manipulate solvent-accessibility information and display useful representations of surface features, clefts, cavities, and binding sites. SURFNET generates surface output in formats that can be displayed in molecular visualization programs, including RasMol.

Solvent Accessibility with Alpha Shapes

The Alpha Shapes software is a mathematically exact alternative to the standard Connolly surface method of computing solvent accessibility. Developed by the research group of mathematician Herbert Edelsbrunner at NCSA (http://www.alphashapes.org/alpha/), the Alpha Shapes software is a general-purpose program for modeling the surfaces of objects. A set of extensions to the Alpha Shapes software, specifically for analyzing protein molecules, is also available.

The Alpha Shapes method constructs what is called a simplicial complex or alpha complex of a structure, based on a rigorous geometrical decomposition of the space surrounding the collection of points that describes the structure. Once the alpha complex is constructed for a protein structure, algorithms for inclusion and exclusion can describe exactly the surface area or volume of the structure as well as cavities, clefts, and regions of contact. The main benefit of using the Alpha Shapes algorithm to compute protein shapes is that the software comes with a sophisticated visualization program called alvis, which can display such geometrical features as the interior shape of an ion channel or the cavities in the interior of a protein.

Several programs make up the Alpha Shapes distribution. These programs must be run in the proper sequence to correctly analyze molecular data:

pdb2alf

Translates a PDB file into an alpha datafile.

delcx

Computes the Delaunay complex of the molecule on the output from pdb2alf.

mkalf

Computes the alpha complex from the Delaunay complex computed by delcx.

VOLBL

Computes protein properties, using the alpha complex computed by mkalf and information from the original the PDB file. Depending on which command-line options are used, VOLBL can compute cavities in the protein interior and space-filling models of the protein, as well as volumes of molecules and cavities. Multiple VOLBL runs can produce complimentary data sets, which can be added or subtracted to determine contact areas and other molecular properties.

You can find usage details of each of these programs in a README file that accompanies the Alpha Shapes distribution, or by attempting to run the program with no arguments on the command line.

Using alvis to visualize your Alpha Shapes data can be quite interesting. To do this, you need output from delcx and mkalf, but not from VOLBL. To run alvis on a data set generated from molecule.PDB, where output files molecule.dt and molecule.alf are also present, enter alvis molecule. The visualizer opens with the convex hull of the molecule displayed. The standard atomic structure of the molecule can't be seen from within the current version of alvis, but you can compare your alvis view with another view of the same molecule (perhaps using RasMol or a similar molecular visualization program) side by side.

In the bottom left of the alvis control panel, you'll see a box containing a graph with three colored curves. This graph is called the alpha rank graph, and it can be used to select a desirable view of the molecule. Positioning your cursor at peaks, valleys, or intersections on these graphs gives the most interesting views of the molecular shape.

Using the Pocket Panel, available from the Gizmos pulldown in alvis, you can make selections that shows voids, pockets, and difference areas in a protein. The online alvis tutorial at http://www.alphashapes.org describes in full the settings needed to view these features.

Along with the main Alpha Shapes programs, a number of utility scripts are provided that can postprocess VOLBL output to give specific information. These include:

aacount

Computes an itemized residue-wise contribution to area or volume from a VOLBL output file

aadiff

Computes residue-wise differences in accessible area between two models

aanonpolar

Outputs area or volume contributions from nonpolar atoms; aapolar does the same thing for polar groups

areadiff

Computes atom-wise differences in area between two files

Analogous scripts for computing volume differences are also included.

Analytical surface potentials based on Alpha Shapes can also be accessed with the CAST-P web server at the University of Illinois at Chicago. At the time of this writing, not all protein structures in the PDB are represented on CAST-P; the site is currently under development. However, it promises to be a useful analytical tool in the future. CAST-P features an integrated Java-based visualization of cavities in protein structures and the amino acids that are in contact with cavities.

Computing Physicochemical Properties

We've already discussed forces that control the interactions between individual atoms in a protein molecule. However, to understand intermolecular interactions, it may be more interesting to learn how all the atoms in a protein act together at a distance, to influence other proteins or ligands.

The electrostatic potential of an object is a measure of the force exerted by that object on other nearby objects. The electrostatic potential of a protein molecule is a long-range force that can influence the behavior of other molecules in the environment at a range of up to 15 angstroms.

Electrostatic interactions within the macromolecule can also be important. Nearby charged groups within a protein may cause the pKa value (the pH at which an acidic or basic group loses or gains a proton) of an amino acid to shift, creating the chemistry necessary for that molecule to perform its chemical function.

Macromolecular Electrostatics

A protein molecule can be thought of as a collection of charges in a dielectric medium. In the model that computes electrostatic potentials for protein molecules, each atom is represented as a point with a partial atomic charge. The solvent accessible surface of the protein forms the boundary between the interior medium of the protein and the exterior medium surrounding the protein.

Computing the electrostatic potential of a protein structure allows you to predict quantities such as individual amino acid pKa values, solvation energies, and approximations to intermolecular binding energies. If you are interested in protein modeling, macromolecular electrostatics is a topic that you may wish to explore further. Our review of the subject in the March 2000 issue of Methods provides an entry point to the molecular electrostatics literature.

The University of Houston Brownian Dynamics (UHBD) package is a state-of-the-art software package for computing macromolecular electrostatics. UHBD computes electrostatic potentials and can also use those potentials as parameters in subsequent Brownian Dynamics and Molecular Dynamics simulations. The most recent release of UHBD can be compiled on Linux systems and includes several control scripts that implement UHBD to calculate pKa values for individual titrating amino acids in the protein, as well as theoretical titration curves for the protein as a whole. UHBD is accessed by a scripting language; it requires a protein structure file and a command script as input. It also requires a file containing atomic partial charges for any amino acids and other atoms in the input structure. Detailed scripting examples are provided in the UHBD distribution, along with charge datafiles that allow the program to assign correct partial atomic charges to all but unusual atom groups.

UHBD, and other similar programs such as DelPhi—which overlaps only the electrostatics functionality of UHBD—use numerical approximations to solve the Poisson-Boltzmann equation for the large number of interacting charges that make up a protein. In the finite-difference approach used by UHBD, the irregularly spaced charges in a protein molecule are mapped onto a regular 3D grid, and the Poisson-Boltzmann equation is solved iteratively for each point on the grid until the solution converges to an electrostatic potential for each point.

Visualization of Molecular Surfaces with Mapped Properties

Other than alvis, which doesn't truly display a molecular surface but rather a mathematically derived pseudosurface, there are several options for displaying the shapes of molecules. Most molecular modeling packages incorporate a molecular surface display feature and allow the surface to be colored according to chemical properties. However, the display schemes in programs not specifically designed for that purpose are too unsophisticated to handle data from macromolecular electrostatics calculations and other representations of physicochemical properties. An exception seems to be the SWISS-PDBViewer (discussed earlier), which can interpret data from external electrostatics calculations and analytical molecular surface calculations.

GRASP/GRASS

GRASP is a high-quality molecular surface visualization program developed by Barry Honig's group at Columbia University. GRASP can read electrostatic potential files and display them as features of a molecular surface, and has many other display options for creating really beautiful visual interpretations of electrostatic properties. Unfortunately, GRASP is available only for SGI IRIX workstations and there are no plans to make it available for other operating systems at this time.

If you're using a Mac or PC, some of GRASP's functionality can be accessed through the GRASS web interface at Columbia. However, this web interface relies heavily on an interface to either GRASP itself (on SGI workstations), the Chime browser plug-in, or a VRML viewer, all of which are still problematic or nonexistent if you're working on a Linux system. We have had some success using the vrmlview viewer with Netscape to visualize VRML models from the GRASS server, although the image quality is relatively low. To use vrmlview, download and install the program and then set your Netscape preferences to use the vrmlview executable to handle files with MIME type model/vrml. The "Handled By" entry in your Netscape applications list should read /usr/local/bin/vrmlview %s (or wherever you installed vrmlview).

The GRASS interface is straightforward and clickable. You can select from several molecular display options: CPK surface, molecular surface, ribbons, or a stick model. Then, a property can be chosen to be mapped onto the molecular graphics. Available properties include electrostatic potentials computed using GRASP's built-in FDPB solver, surface curvature, hydrophobicity, and amino acid variability within the protein's sequence family. GRASS doesn't implement the full functionality of GRASP, but many of the most useful features are available.

Structure Optimization

Protein structure optimization is the process of bringing a structure into agreement with some "ideal" set of geometric parameters. As mentioned earlier when we discussed structure quality checking, protein structural models sometimes violate the laws of chemistry. Placing atoms too close together causes unfavorable intramolecular contacts, or van der Waals bumps. Bond lengths, bond angles, and dihedral angles between atoms in the protein can also be "wrong"; that is, they can fall outside some normal range of values expected for that type of bond or angle.

Structure optimization is an important issue not just to developers of theoretical models, but to researchers who experimentally determine protein structures. All protein atomic coordinates are, in an important sense, structural models. Structure optimization tools have long been part of the x-ray crystallographer's toolkit. The process of optimization can be computationally intensive. Because all atoms in a protein structure are connected by bonds with rigidly fixed lengths, moving an atom in one part of the protein structure has wide-ranging effects on its neighbors. Often moving one part of the protein into a better configuration means moving another part of the protein into an unfavorable configuration. Optimization is, essentially, an iterative series of small changes designed to converge to the best overall result. There are many methods of optimization, which is its own subdiscipline within theoretical computer science.

You won't always need to know the particulars of optimization methods, but if you begin using structure optimization and molecular simulation methods frequently, you should be aware that your choice of optimization algorithms may be an issue. It's not always certain that optimization will provide you with a better structural model; if the method is based on incorrect structural rules, or if the rules are prioritized incorrectly, optimization can actually give you a worse model than you started with.

Informatics Plays a Role in Optimization

What are the "ideal" parameters or constraints used in optimization? In some cases, they are based entirely on chemical principles: bond lengths and angles determined by steric restrictions and nonbonded interactions described as Lennard-Jones potentials. In other cases, structural constraints are based on information derived from the database of known protein structures. If a particular amino acid in a particular sequence context always has the same conformation, a higher probability can be assigned to it assuming that conformation again, rather than a different conformation. Secondary structure prediction methods use an information-based approach to predicting likely conformations for the protein backbone. Optimization methods use information to refine atomic structures at the level of individual sidechain atoms once the backbone trace has been worked out.

Rotamer Libraries

Rotamer libraries are parameter sets specifically for the optimization of sidechain positions in molecular model building. They are called rotamer libraries because they contain information about allowed rotations of the remote amino acid sidechain atoms around the Cα - Cβ bond, expressed as the allowed values of sidechain dihedral angles.

Because of steric constraints on bond rotation, amino acid sidechains in proteins can assume only a few conformations without unfavorable energetic consequences. Rotamer libraries can be derived using chemical bond and angle constraints, but, they are more likely to be developed by analysis of the conformations assumed by amino acid sidechains in known protein structures. Rotamer libraries can be either backbone-dependent or -independent. Backbone-independent rotamer libraries classify all instances of a particular amino acid as part of the same set, even if one occurrence is within a beta sheet and the other is within an alpha helix. Backbone-dependent rotamer libraries, on the other hand, further classify amino acids according to their occurrence in specific secondary structures.[†]

SCRWL, available from the Fred Cohen research group at UCSF, is a program that allows you to model sidechain conformations using a backbone-dependent rotamer library.

PDFs

The derivation of probability density functions (PDFs) is similar in concept to the development of rotamer libraries, although more mathematically rigorous. The essence of a PDF is that a mathematical function is developed to represent a distribution of discrete values. The discrete values that make up the distribution are harvested from occurrences of a situation in a representative database of samples. That mathematical function can then be used to evaluate and optimize (and in some cases even predict) the properties of future occurrences of the same situation.

In protein modeling, PDFs have been used to describe intra- and inter-residue interatomic distances, as well as bond angles, dihedral angles, and other more spatially extensive regions of protein structure. Modeller, which is discussed in Chapter 10, uses a combination of bond angle and dihedral angle PDFs to optimize the protein structure models it builds. Modeller's internal OPTIMIZE routine can be used for PDF-based structure optimization.

The data from which PDFs are generated can be broken down into specific occurrences; for example, all contacts between Cβ in residue i and Cβ in residue i+4 when both residues are leucine but again, trade-offs between classification detail and class population occur. Distance PDFs for proteins have been used by several groups to evaluate and optimize protein structures. Most such work is still in its early stages, and software isn't yet available for public use.

Figure 9-15 shows a plot of a distance probability density function for tertiary interactions between sulfur atoms in cysteine residues generated from known protein structures. The function's peak near 2 angstroms corresponds to the high propensity with which the sulfur atoms form disulfide bridges between cysteine residues. These data are taken from the Biology Workbench at the San Diego Supercomputer Center (http://workbench.sdsc.edu/) and plotted using xmgr. Note that the Workbench PDFs make a distinction between cysteine residues participating in disulfide bridges (pictured here and referred to as CSS residues at the Workbench site) and those cysteines that don't participate in disulfide bonds (which the Workbench site calls CYS).

Interatomic distance probability density function

Figure 9-15. Interatomic distance probability density function

Structure evaluation based on PDFs is implemented in the Structure Tools section of the SDSC Biology Workbench (Figure 9-16). You can upload a PDB structure or a theoretical model and score the structure either on a residue-by-residue or an atom-by-atom basis. Scores can be displayed on a plot, where the Y-axis represents the relative probability of the region of structure that's being evaluated. This can be thought of in terms of the probability that a particular residue or atom is in the "correct" position, given what is known about other occurrences of that residue or atom in similar sequence environments. Regions with low probability are likely to be misfolded or poorly modeled. PDF probability scores can also be written out in a special PDB file, in place of the temperature factor values found in the original PDB file. These special PDB files can then be displayed using a visualization program such as RasMol or Chime. Coloring the molecule by temperature factor maps the PDF probability scores onto the molecular structure, highlighting regions of the structure that score poorly.

Comparing PDF scores for an obsolete PDB structure (1B5C) and (1CY0) that superseded it

Figure 9-16. Comparing PDF scores for an obsolete PDB structure (1B5C) and (1CY0) that superseded it

Protein Resource Databases

Several new databases containing information about protein structure and function, and designed for users of genome-level information, have recently emerged on the Web. Some of the most notable are GeneCensus, PRESAGE, and BIND. These databases are still relatively lightly populated, and have not yet taken the central role in biological research that PDB and NCBI. However, certainly these or similar resources will soon become vital to molecular biology research.

GeneCensus

The GeneCensus project is a broad, sequence-based comparison of the protein content of several genomes. At the time of this writing, GeneCensus contains information from 20 genomes. GeneCensus currently can be searched with a PDB ID or an ORF ID to locate occurrences of specific protein folds in the GeneCensus genomes.

PRESAGE

PRESAGE is a database of information about experimental progress with the structures of various proteins. You can search PRESAGE with a TIGR ORF code, NCBI or SWISS-PROT accession number, and a number of other codes to find out if someone is attempting to isolate, crystallize, and solve the structure of that protein, and if so, how far along they are. PRESAGE is relatively new at press time; it currently contains only about 6,000 records and isn't guaranteed to be comprehensive. It can't be searched directly by BLAST or FASTA search with a sequence, so before checking PRESAGE for instances of an unknown sequence, you have to search for matching accession codes. However, in principle, the PRESAGE database promises to be useful, both for crystallographers and their collaborators, and for curious citizens of the molecular biology community who are wondering if the structure of their protein has been solved yet.

BIND

The Biomolecular Interaction Network Database (BIND) is another relatively new data repository offered by the Samuel Lunenfeld Research Institute. BIND was designed to be a central deposition site for known information about macromolecular interactions. BIND is a complex database, containing information about interactions between objects in the database, molecular complex information, and metabolic pathway information. The BIND format is designed to contain information about experimental conditions that observe the interactions stored in the database, as well as information about binding site location, biochemical activity, kinetics, and thermodynamics. BIND is still in its beta testing phase, containing only a few hundred interactions. However, BIND has been funded to hire indexers and it is expected to grow rapidly in the near future. One interesting aspect of the BIND data entry process is that methods for automated reading of existing journal literature are being implemented to extract known interactions from their inconvenient location in dusty library stacks and put them more effectively in the public domain.

Putting It All Together

We can't give you a single recipe for using the techniques described in this chapter to characterize a protein. There are too many variables from system to system, and too much diversity in what you as a biologist might want to know. However, some common features of a structural modeling approach include:

  • Gathering useful structural and biophysical information about the system under study. Everything from site-directed mutagenesis to classic biochemical and biophysical studies may be useful.

  • Using multiple sequence analysis to analyze the protein as part of a related family. This may give insight into the location of functionally important residues and active sites.

  • Analyzing a crystal structure or theoretical model to identify the location of buried polar and charged residues, unusual hydrogen bonds, networks of structured solvent molecules, and other chemical features that may be involved in structural stability or function.

  • Analyzing a family of related proteins by superimposing or comparing their structures to identify common features.

  • Mapping identified properties—sites known to affect function if mutated, sites conserved in multiple sequences, etc.—onto the protein sequence and structure.

  • Visualizing the structure and interpreting the location of potentially important amino acids and sites.

  • Computing the molecular surface and characterizing possible substrate binding sites or molecular interaction regions by their shapes.

  • Computing electrostatic potentials and modeling electrostatic properties such as individual amino acid pKa values or molecular interaction energies. Unusually strong electrostatic potentials or unusual pKa values may indicate regions of catalytic importance.

Obviously, this type of analysis requires a real understanding of protein chemistry. We've identified the tools of structural biology for you, but you will decide how to put them to use. To help you toward that end, Table 9-3 contains a quick reference of molecular structure tools and how they are commonly used.

Table 9-3. Structure Tools and Techniques

What you do

Why you do it

What you use to do it

View molecular structure

Computer graphics are the only way to "see" a protein structure in detail

Browser plugins: RasMol, Cn3D, SWISS-PDBViewer; standalones: MolMol, MidasPlus, VMD

Create high-quality PostScript schematic diagrams and color graphics of proteins

For publication

MolScript

Create schematic diagrams of active sites

To help identify the structural components of the functional site; for publication

LIGPLOT

Structure classification

To identify relationships among proteins

CATH, SCOP

Secondary structure analysis

To extract recognizable features at the SS level, which aids in classification

DSSP, STRIDE

Topology analysis

To extract recognizable supersecondary motifs, which aids in classification

TOPS

Domain identification

To extract recognizable domains, which aids in classification

3Dee

Unique structure database subsets

To eliminate bias in source data sets for knowledge-based modeling

PDBSelect, culled PDB databases

Structure alignment

To identify relationships among distantly related proteins that may have evolved beyond recognizable sequence similarity, while preserving structural similarity

CE, DALI, VAST

Molecular geometry analysis

To identify strained conformations or incorrectly represented regions in a structure model

PROCHECK, WHAT IF

Intramolecular contact analysis

To identify residue-residue interactions that may help identify active sites, structure-stabilizing features, etc.

CSU, HBPLUS

Solvent accessibility calculation

To identify amino acids that interact with a solvent

naccess, Alpha Shapes

Solvent modeling

To place a chemically realistic solvent shell around the molecule in preparation for some types of simulations; aids in understanding functional mechanism

HBUILD

Molecular surface visualization

To gain a visual understanding of molecular shape and chemical surface features

GRASP, GRASS server, SWISS-PDBViewer

Electrostatic potential calculation

To visualize the chemically important surface features of a protein, and as a preliminary step in pKa calculations, binding energy calculations, and Brownian dynamics simulations

UHBD, DelPhi

Protein pKa calculation

To model pH-dependent behavior of proteins, identify possible active sites, and and identify residues in unusual chemical environments

UHBD, DelPhi



[*] The image in Figure 9-10 was contributed by Per J. Kraulis, from "MOLSCRIPT: A Program to Produce Both Detailed and Schematic Plots of Protein Structures," Journal of Applied Crystallography (1991), vol. 24, pp. 946-950.

[†] When rules for structure evaluation and optimization are derived from existing occurrences of patterns in a database, there is a trade-off between highly specific classification of occurrences and the size of the data set for each type of occurrence. The more data in the data set, the better the value of the rule is likely to be; however, the less specific the classification of occurrences, the less value the rule is likely to have for prediction. This is true not only of rotamer libraries but of PDFs and any other database-derived rules.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset