Chapter 3. Setting Up Your Workstation

In this chapter, we discuss how to set up a workstation running the Linux operating system. Linux is a free, open source version of Unix that makes it possible to turn an ordinary PC into a powerful workstation. By configuring your system with Linux and other open source software, you can have access to a lot of powerful computational biology and bioinformatics tools at a low cost.

In writing this chapter, we encountered a bit of a paradox—in order to get around in Unix you need to have your computer set up, but in order to set up your computer you need to know a few things about Unix. If you don't have much experience with Unix, we strongly suggest that you look through Chapter 4 and Chapter 5 before you set up a Linux workstation of your own. If you're already familiar with the ins and outs of Unix, feel free to skip ahead to Chapter 6.

Working on a Unix System

You are probably accustomed to working with personal computers; you may be familiar with windows interfaces, word processors, and even some data-analysis packages. But if you want to use computers as a serious component in your research, you need to work on computer systems that run under Unix or related multiuser operating systems.

What Does an Operating System Do?

Computer hardware without an operating system is like a dead animal. It isn't going to react, it isn't going to function; it's just going to sit there and look at you with glassy eyes until it rots (or rusts). The operating system breathes life into the inert body of your computer. It handles the low level processes that make hardware work together and provides an environment in which you can run and develop programs. The most important function of the operating system is that it allows you convenient access to your files and programs.

Why Use Unix?

So if the operating system is something you're not supposed to notice, why worry about which one you're using? Why use Unix?

Unix is a powerful operating system for multiuser computer systems. It has been in existence for over 25 years, and during that time has been used primarily in industry and academia, where networked systems and multiuser high-performance computer systems are required. Unix is optimized for tasks that are only fairly recent additions to personal-computer operating systems, or which are still not even available in some PC operating systems: networking with other computers, initiating multiple asynchronous tasks, retaining unique information about the work environments of multiple users, and protecting the information stored by individual users from other users of the system. Unix is the operating system of the World Wide Web; the software that powers the Web was invented in Unix, and many if not most web servers run on Unix servers.

Because Unix has been used extensively in universities, where much software for scientific data analysis is developed, you will find a lot of good-quality, interesting scientific software written for Unix systems. Computational biology and bioinformatics researchers are especially likely to have developed software for Unix, since until the mid-1990s, the only workstations able to visualize protein structure data in realtime were Silicon Graphics and Sun Unix workstations.

Unix is rich in commands and possibilities. Every distribution of Unix comes with a powerful set of built-in programs. Everything from networking software to word-processing software to electronic mail and news readers is already a part of Unix. Many other programs can be downloaded and installed on Unix systems for free.

It might seem that there's far too much to learn to make working on a Unix system practical. It's possible, however, to learn a subset of Unix and to become a productive Unix user without knowing or using every program and feature.

Different Flavors of Unix

Unix isn't a monolithic entity. Many different Unix operating systems are out there, some proprietary and some freely distributed. Most of the commands we present in this book work in the same way on any system you are likely to encounter.

Linux

Linux (LIH-nucks) is an open source version of Unix, named for its original developer, Linus Torvalds of the University of Helsinki in Finland. Originally undertaken as a one-man project to create a free Unix for personal computers, Linux has grown from a hobbyist project into a product that, for the first time, gives the average personal-computer user access to a Unix system.

In this book, we focus on Linux for three reasons. First, with the availability of Linux, Unix is cheap (or free, if you have the patience to download and install it). Second, under Linux, inexpensive PCs regarded as "obsolete" by Windows users become startlingly flexible and useful workstations. The Linux operating system can be configured to use a much smaller amount of system resources than the personal computer operating systems, which means computers that have been outgrown by the ever-expanding system requirements of PC programs and operating systems can be given a new lease on life by being reconfigured to run Linux. Third, Linux is an excellent platform for developing software, so there's a rich library of tools available for computational biology and for research in general.

You may think that if you install Linux on your computer, you'll be pretty much on your own. It's a freeware operating system, after all. Won't you have to understand just about everything about Linux to get it configured correctly on your system? While this might have been true a few years ago, it's not any more. Hardware companies are starting to ship personal computers with Linux preinstalled as an alternative to the Microsoft operating systems. There are a number of companies that sell distributions of Linux at reasonable prices. Probably the best known of these is the Red Hat distribution. We should mention that we (the authors) run Red Hat Linux. Most of our experience—and the examples in this book—are based on that distribution. If you purchase Linux from one of these companies, you get CDs that contain not only Linux but many other compatible free software tools. You'll also have access to technical support for your installation.

Will Linux run on your computer?

Linux started out as a Unix-like operating system for PCs, but various Linux development projects now support nearly every available system architecture, including PCs of all types, Macintosh computers old and new, Silicon Graphics, Sun, Hewlett-Packard, and other high-end workstations and high-performance multiprocessor machines. So even if you're starting with a motley mix of old and new hardware, you can use Linux to create a multiworkstation network of compatible computers. See Section 3.2 for more information on installing and running Linux.

Other common flavors

There are many varieties (or "flavors") of Unix out there. The other common free implementation is the Berkeley Software Distribution (BSD) originally developed at the University of California-Berkeley. For the PC, there are a handful of commercial Unix implementations, such as The Santa Cruz Operation (SCO) Unix. Several workstation makers sell their own platform-specific Unix implementations with their computers, often with their own peculiarities and quirks. Most common among these are Solaris (Sun Microsystems), IRIX (Silicon Graphics), Digital Unix (Compaq Corporation), HP-UX (Hewlett Packard), and AIX (IBM). This list isn't exhaustive, but it's probably representative of what you will find in most laboratories and computing centers.

Graphical Interfaces for Unix

Although Unix is a text-based operating system, you no longer have to experience it as a black screen full of glowing green or amber letters. Most Unix systems use a variant of the X Window System. The X Window System formats the screen environment and allows you to have multiple windows and applications open simultaneously. X windows are customizable so that you can use menu bars and other widgets much like PC operating systems. Individual Unix shells on the host machine as well as on networked machines are opened as windows, allowing you to exploit Unix's multitasking capabilities and to have many shells active simultaneously. In addition to Unix shells and tools, there are many applications that take advantage of the X system and use X windows as part of their graphical user interfaces, allowing these applications to be run while still giving access to the Unix command line.

The GNOME and KDE desktop environments, which are included in most major Linux distributions, make your Linux system look even more like a personal computer. Toolbars, visual file managers, and a configurable desktop replicate the feeling of a Windows or Mac work environment, except that you can also open a shell window and run Unix programs.

Setting Up a Linux Workstation

If you are already using an existing Unix/Linux system, feel free to skip this section and go directly to the next.

If you are used to working with Macintosh or PC operating systems, the simplest way to set up a Linux workstation or server is to go out and buy a PC that comes with Linux preinstalled. VA Linux, for example, offers a variety of Intel Pentium-based workstations and servers preconfigured with your choice of several of the most popular Linux distributions.

If you're looking for a complete, self-contained bioinformatics system, Iobion Systems (http://www.iobion.com) is developing Iobion, a ground-breaking bioinformatics network server appliance developed using open source technologies. Iobion is an Intel-based hardware system that comes preinstalled with Linux, Apache web server, a PostgreSQL relational database, the R statistical language, and a comprehensive suite of bioinformatics tools and databases. The system serves these scientific applications to web clients on a local intranet or over the Internet. The applications include tools for microarray data analysis complete with a microarray database, sequence analysis and annotation tools, local copies of the public sequence databases, a peer-to-peer networking tool for sharing biological data, and advanced biological lab tools. Iobion promotes and adheres to open standards in bioinformatics.

If you already have a PC, your next choice is to buy a prepackaged version of Linux, such as those offered by Red Hat, Debian, or SuSE. These prepackaged distributions have several advantages: they have an easy-to-use graphical interface for installing Linux, all the software they include is packed into package manager (for Red Hat, it's the Red Hat Package Manager or RPM) archives or similar easily extracted formats, and they often contain a large number of "extras" that are easier to install from the distribution disk using a package manager than they are if you install them by hand.

That said, let's assume you've gone out and bought something like the current version of Red Hat. You'll be asked if you want to do a workstation installation, a server installation, or a custom installation. What do these choices mean?

Your Linux machine can easily be set up to do some things you may not be used to doing with a PC or Macintosh. You can set up a web server on your machine, and if you dig a little deeper into the manuals, you can find out how to give each user of your machine a place to create his own web page. You can set up an anonymous FTP server so that guests can FTP in to pick up copies of files you choose to make public. You can set up an NFS server to allow directories you choose to be mounted on other machines. These are just some options that set a server apart from a workstation.

If you are inexperienced in Unix administration, you probably want to set up your first Linux machine as a workstation. With a workstation setup, you can access the Internet, but your machine can't provide any services to outside users (and you aren't responsible for maintaining these services). If you're feeling more adventurous, you can do a custom installation. This allows you to pick and choose the system components you want, rather than taking everything the installer thinks you may want.

Installing Linux

We can't possibly tell you everything you need to know to install and run Linux. That's beyond the scope of this book. There are many excellent books on the market that cover all possible angles of installing and running Linux, and you can find a good selection in this book's Bibliography. In this section, we simply offer some advice on the more important aspects of installation.

System requirements

Linux runs on a range of PC hardware combinations, but not all possible combinations. There are certain minimum requirements. For optimum performance, your PC should have an 80486 processor or better. Most Linux users have systems that use Intel chips. If your system doesn't, you should be aware that while Linux does support a few non-Intel processors, there is less documentation to help you resolve potential problems on those systems.

For optimum performance your system should have at least 16 MB of RAM. If you're planning to run X, you should seriously consider installing more memory—perhaps 64 MB. X runs well on 16 MB, but it runs more quickly and allows you to open more windows if additional memory is available.

If you plan to use your Linux system as a workstation, you should have at least 600 MB of free disk space. If you want to use it as a server, you should allow 1.6 GB of free space. You can never have too much disk space, so if you are setting up a new system, we recommend buying the largest hard drive possible. You'll never regret it.

In most cases the installation utility that comes with your distribution can determine your system configuration automatically, but if it fails to do so, you must be prepared to supply the needed information. Table 3-1 lists the configuration information you need to start your installation.

Table 3-1. Configuration Information Needed to Install Linux

Device

Information Needed

Hard drive(s)

  • The number, size, and type of each hard drive

  • Which hard drive is first, second, and so on

  • Which adapter type (IDE or SCSI) is used by each drive

  • For each IDE drive, if the BIOS is set in LBS mode

RAM

  • The amount of installed RAM

CD-ROM drive(s)

  • Which adapter type (IDE, SCSI, other) is used by each drive

  • For each drive using a non-IDE, non-SCSI adapter, the make and model of the drive

SCSI adapter (if any)

  • The make and model of the card

Network adapter (if any)

  • The make and model of the card

Mouse

  • The type (serial, PS/2, or bus)

  • The protocol (Microsoft, Logitech, MouseMan, etc.)

  • The number of buttons

  • For a serial mouse, the serial port to which it's connected

Video adapter

  • The make and model of the card

  • The amount of video RAM

To obtain information, you may need to examine your system's BIOS settings or open the case and look at the installed hardware. Consult your system documentation or your system administrator to learn how to do so.

Here are three of the more popular Linux distributions:

  • Red Hat (http://www.redhat.com/support/hardware/)

  • boot: Debian (http://www.debian.org/doc/FAQ/ch-compat.html)

  • SuSE (http://www.suse.com)

All have well-organized web sites with information about the hardware their distributions support. Once you've collected the information in Table 3-1, take a few minutes to check the appropriate web site to see if your particular PC hardware configuration is supported.

Partitioning your disk

Linux runs most efficiently on a partitioned hard drive. Partitioning is the process of dividing your disk up into several independent sections. Each partition on a hard drive is a separate filesystem. Files in one filesystem are to some extent protected from what goes on in other filesystems. If you download a bunch of huge image files, you can fill up only the partition in which your home directories live; you can't make the machine unusable by filling up all the available space for essential system functions. And if one partition gets corrupted, you can sometimes fix the problem without reformatting the entire drive and losing data stored in the other partitions.

When you start a Red Hat Linux installation, you need the Linux boot disk in your floppy drive and the Linux CD-ROM in your CD drive. When you turn the computer on, you almost immediately encounter an installation screen that offers several installation mode options. At the bottom of the screen, there is a boot: prompt. Generally, you should just hit the Enter key; however, if you're using a new model of computer, especially a laptop, you may want to enter text, then press the Enter key for a text-mode installation, in case your video card isn't supported by the current Linux release.

Click through the next few screens, selecting the appropriate language and keyboard. You'll come to a point at which you're offered the option of selecting a GNOME workstation, a KDE workstation, a server, or a custom installation. At this point, you can just choose one of the single user workstation options, and you're essentially done. However, we suggest doing a custom installation to allow you greater control over what is installed on your computer and where it's installed.

If you have a single machine that's not going to be interacting with other machines on the network, you can probably get away with putting the entire Linux installation into one big filesystem, if that's what you want. But if you're setting up a machine that will, for instance, share software in its /usr/local directory with all the other machines in your lab, you'll want to do some creative partitioning.

On any given hard disk, you can have four partitions. Partitions can be of two types: primary and extended. Within an extended partition, you can have as many subpartitions as you like. Red Hat and other commercial Linux distributions have simple graphical interfaces that allow you to format your hard disk. More advanced users can use the fdisk program to achieve precise partitioning. Refer to one of the "Learning Linux" books we recommend in the Bibliography for an in-depth discussion of partitioning and how to use the fdisk program.

Selecting major package groupings

After you've set up partitions on your disk, chosen mount points for your partitions, and completed a few other configuration steps, you need to pick the packages to install.

First, go through the Package Group Selection list. You'll definitely need printer support; the X Window System; either the GNOME or KDE desktop (we like KDE); mail, web, and news tools, graphics manipulation tools; multimedia support; utilities; and networked workstation support. If you'll be installing software (and you will), you need a number of items in the development package group (C, FORTRAN, and other compilers come in handy, as do some development libraries). You may also want to install the Emacs text editor and the authoring/publishing tools. Depending on where you use your system from, you may need dial-up workstation support.

The rest of the package groups add server functionality to your machine. If you want your machine to function as a web server, add the web server package group. If you want to make some of the directories on your machine available for NFS mounting, choose the NFS server group. If you plan to create your own databases, you may want to set up your machine as a PostgreSQL server. Generally, if you have no idea what it is or how you'd use it, you probably don't need to install it at this point.

If you're concerned about running out of space on your machine, you can now sift through the contents of each package grouping and get rid of software you won't be using. For example, the "Mail, Web and News" package grouping contains many different types of software for reading email and newsgroups. Don't install it all, just pick your favorite package, and get rid of the rest. (In case you're wondering what to choose, here's a hint: it's very easy to configure the Netscape browser to do all the mail and news reading you'll need.) If you're installing a Red Hat system, check under "Applications/Editors" and make sure you have the vim editor selected; in "Applications/Engineering," select gnuplot; and in "Applications/Publishing," select enscript. Don't worry if you don't install something at the beginning and find you need to install it later, it's pretty easy to do.

Other useful packages to add

Once you've done a basic Linux installation on your machine, you can add new packages easily using the kpackage command (if you're using the KDE desktop environment) or gnorpm (if you are using GNOME).

In order to compile some of the software we'll be discussing in the next few chapters, and to expand the functionality of your Linux workstation, you may want to install some of the following tools. The first set of tools are from the Red Hat Linux Power Tools CD:

R

A powerful system for statistical computation and graphics. It's based on S and consists of a high-level language and a runtime environment.

OpenGL /Mesa

A development kit for creating graphical user interfaces that enhances performance of some molecular visualization software.

LessTif

A widget set for application development. You might not use it directly, but it's used when you compile some of the software discussed later in this book. Install at least the main package and the client package.

Xbase

Another widget set.

MySQL

A database server for smaller data sets. It's useful if you're just starting to build your own databases.

octave

A MatLab-like high-level language for numerical computations.

xv

A multipurpose image-editing and conversion tool.

xemacs

A powerful X Windows-based editor with special extensions for editing source code.

plugger

A generic Netscape plug-in that supports many formats.

You can download from the Web and install the following tools:

JDK /JRE (http://java.sun.com)

A Java Development Kit and Java Runtime Environment are needed if you want to use Java-based tools such as the Jalview sequence editor we discuss in Chapter 4. They are freely available for Linux from IBM, Sun, and Blackdown (http://blackdown.org). Blackdown also offers a Java plug-in for Netscape, which is required to run some of the applications we discuss.

NCBI Toolkit ( ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools/README.htm)

A software library for developers of biology applications. It's required in order to compile some software originating at NCBI.

StarOffice (http://www.staroffice.com)

A comprehensive office productivity package freely available from Sun Microsystems. It replaces most or all functionality of Microsoft Office and other familiar office-productivity packages.

How to Get Software Working

You've gone out and done the research and found a bioinformatics software package you want to install on your own computer. Now what do you do?

When you look for Unix software on the Web, you will find that it's distributed in a number of different formats. Each type of software distribution requires a different type of handling. Some are very simple to install, almost like installing software on a Mac or PC. On the other hand, some software is distributed in a rudimentary form that requires your active intervention to get it running. In order to get this software working, you may have to compile it by hand or even modify the directions that are sent to the compiler so that the program will work on your system. Compiling is the process of converting software from its human-readable form, source code, to a machine-readable executable form. A compiler is the program that performs this conversion.

Software that's difficult to install isn't necessarily bad software. It may be high-quality software from a research group that doesn't have the resources to produce an easy-to-use installation kit. While this is becoming less common, it's still common enough that you will need to know some things about compiling software.

Unix tar Archives

Software is often distributed as a tar archive, which is short for "tape archive." We discuss tar and other file-compression options in more detail in Chapter 5. Not coincidentally, these archives are one of the most common ways to distribute Unix software on the Internet. tar allows you to download one file that contains the complete image of the developer's working software installation and unpack it right back into the correct subdirectories. If tar is used with the p option, file permissions can even be preserved. This ensures that, if the developer has done a competent job of packing all the required files in the tar archive, you can compile the software relatively easily.

tar archives are often compressed further using either the Unix compress command (indicated by a .tar.Z extension) or with gzip (indicated by a .tar.gz or .tgz extension).

Binary Distributions

Software can be distributed either as uncompiled source code or binaries. If you have a choice, and if you don't know any reason to do otherwise, choose the binary distribution. It will probably save you a lot of headaches.

Binary software distributions are precompiled and (at least in theory) ready to run on your machine. When you download software that is distributed in binary form, you will have a number of options to choose from. For example, the following listing is the contents of the public FTP site for the BLAST sequence alignment software. There are several archives available, each for a different operating system; if you're going to run the software on a Linux workstation, download the file blast.linux.tar.Z.

README.bls               52 Kb    Wed Jan 26 18:45:00 2000  
blast.alphaOSF1.tar.Z    12756 Kb Wed Jan 26 18:40:00 2000     Unix Tape Archive 
blast.hpux11.tar.Z       11964 Kb Wed Jan 26 18:43:00 2000     Unix Tape Archive 
blast.linux.tar.Z        9334 Kb  Wed Jan 26 18:41:00 2000     Unix Tape Archive 
blast.sgi.tar.Z          14746 Kb Wed Jan 26 18:44:00 2000     Unix Tape Archive 
blast.solaris.tar.Z      12724 Kb Wed Jan 26 18:37:00 2000     Unix Tape Archive 
blast.solarisintel.tar.Z 10679 Kb Wed Jan 26 18:43:00 2000     Unix Tape Archive 
blastz.exe               3399 Kb  Wed Jan 26 18:44:00 2000     Binary Executable

Here are the basic binary installation steps:

  1. Download the correct binaries. Be sure to use binary mode when you download. Download and read the instructions (usually a README or INSTALL file).

  2. Follow the instructions.

  3. Make a new directory and move the archive into it, if necessary.

  4. uncompress (*.Z ) or gunzip (*.gz) to uncompress the file.

  5. Use tar tf to examine the contents of the archive and tar xvf to extract it.

  6. Run configuration and installation scripts, if present.

  7. Link binary into a directory in your default path using ln -s, if necessary.

RPM Archives

RPM archives are a new kind of Unix software distribution that has recently become popular. These archives can be unpacked using the command rpm. The Red Hat Package Manager program is included in Red Hat Linux distributions and is automatically installed on your machine when you install Linux. It can also be downloaded freely from http://www.rpm.org and used on any Linux or other Unix system. rpm creates a software database on your machine and simplifies installations and updates, and even allows you to create RPM archives. RPM archives come in either source or binary form, but aside from the question of selecting the right binary, the installation is equally simple either way.

(As we introduce commands, we'll show you the format of the command line for each command—for example, "Usage: man name" -- and describe the effects of some options we find most useful.)

Usage: rpm --[ options ] *.rpm

Here are the important rpm options:

rebuild

Builds a package from a source RPM

install

Installs a new package from a binary RPM

upgrade

Upgrades existing software

uninstall (or erase)

Removes an installed package

query

Checks to see if a package is installed

verify

Checks information about installed files in a package

GnoRPM

Recent versions of Linux that include the GNOME user interface also include an interactive installation tool called GnoRPM. It can be accessed from the System folder in the main GNOME menu. To install software from a CD-ROM with GnoRPM, simply insert and mount the CD-ROM, click the Install button in GnoRPM, and GnoRPM provides a selectable list of every package on the CD-ROM you haven't already installed. You can also uninstall and update packages with GnoRPM, ensuring that the entire package is cleanly removed from your system. GnoRPM informs you if there are package dependencies that require you to download code libraries or other software before completing the installation.

Source Distributions

Sometimes the correct binary isn't available for your system, there's no RPM archive, and you have no choice but to install from source code.

Source distributions can be easy or hard to install. The easy ones come with a configuration script, an install script, and a Makefile for your operating system that holds the instructions to the compiler.

An example of an easy-to-install package is the LessTif source code distribution. LessTif is an open source version of the OSF/Motif window manager software. Motif was developed for high-end workstations and costs a few thousand dollars a year to license; LessTif supports many Motif applications (such as the multiple sequence alignment package ClustalX and the useful 2D plotting package Grace, for example) for free. When the LessTif distribution is unpacked, it looks like:

AUTHORS        KNOWN_BUGS      acconfig.h    configure    ltmain.sh
BUG-REPORTING  Makefile        acinclude.m4  configure.in make.out
COPYING        Makefile.am     aclocal.m4    doc          missing
COPYING.LIB    Makefile.in     clients       etc          mkinstalldirs
CREDITS        NEWS            config.cache  include      scripts
CURRENT_NOTES  NOTES           config.guess  install-sh   test
CVSMake        README          config.log    lib          test_build
ChangeLog      RELEASE-POLICY  config.status libtool 
INSTALL        TODO            config.sub    ltconfig

Configuration and installation of LessTif on a Linux workstation is a practically foolproof process. As the superuser, move the source tar archive to the /usr/local/src directory. Uncompress and extract the archive. Inside the directory that is created (lesstif or lesstif.0-89, for example), enter ./configure. The configuration script will take a while to run; when it's done, enter make. Compilation will take several minutes; at the end, edit the file /etc/ld.so.conf. Add the line /usr/lesstif/lib, save the file, and then run ldconfig -v to make the shared LessTif libraries available on your machine.

Complex software such as LessTif is assembled from many different source code modules. The Makefile tells the compiler how to put them together into one large executable. Other programs are simple: they have only one source code file and no Makefile, and they are compiled with a one-line directive to the compiler. You should be able to tell which compiler to use by the extension on the program filename. C programs are often labeled *.c, FORTRAN programs *.f, etc. To compile a C program, enter gcc program.c -o program; for a FORTRAN program, the command is g77 program.f -o program. The manpages for the compilers, or the program's documentation (if there is any) should give you the form and possible arguments of the compiler command.

Compilers convert human-readable source code into machine-readable binaries. Each programming language has its own compilers and compiler instructions. Some compilers are free, others are commercial. The compilers you will encounter on Linux systems are gcc , the GNU Project C and C++ compiler, and g77 , the GNU Project FORTRAN compiler.[*] In computational biology and bioinformatics, you are likely to encounter programs written in C, C++, FORTRAN, Perl, and Java. Use of other languages is relatively rare. Compilers or interpreters for all these languages are available in open source distributions.

Difficult-to-install programs come in many forms. One of the main problems you may encounter will be source code with dependencies on code libraries that aren't already installed on your machine. Be sure to check the documentation or the README file that comes with the software to determine whether additional code or libraries are required for the program to run properly.

An example of an undeniably useful program that is somewhat difficult to install is ClustalX, the X windows interface to the multiple sequence alignment program ClustalW. In order to install ClustalX successfully on a Linux workstation, you first need to install the NCBI Toolkit and its included Vibrant libraries. In order to create the Vibrant libraries, you need to install the LessTif libraries and to have XFree86 development libraries installed on your computer.

Here are the basic steps for installing any package from source code:

  1. Download the source code distribution. Use binary mode; compressed text files are encoded.

  2. Download and read the instructions (usually a README or INSTALL file; sometimes you have to find it after you extract the archive).

  3. Make a new directory and move the archive into it, if necessary.

  4. uncompress (*.Z ) or gunzip (*.gz) the file.

  5. Extract the archive using tar xvf or as instructed.

  6. Follow the instructions (did we say that already?).

  7. Run the configuration script, if present.

  8. Run make if a Makefile is present.

  9. If a Makefile isn't present and all you see are *.f or *.c files, use gcc or g77 to compile them, as discussed earlier.

  10. Run the installation script, if present.

  11. Link the newly created binary executable into one of the binary-containing directories in your path using ln -s (this is usually part of the previous step, but if there is no installation script, you may need to create the link by hand).

Perl Scripts

The Perl language is used to develop web applications and is frequently used by computational biologists. Perl programs (called scripts) have the extension *.pl (or *.cgi if they are web applications). Perl is an interpreted language; in other words, Perl programs don't have to be compiled in order to run. Instead, each command in a Perl script is sent to a program called the Perl interpreter, which executes the commands.[†]

To run Perl programs, you need to have the Perl interpreter installed on your machine. Most Linux distributions contain and automatically install Perl. The most recent version of Perl can always be obtained from http://www.perl.com, along with plenty of helpful information about how to use Perl in your own work. We discuss some of the basic elements of Perl in Chapter 12.

Putting It in Your Path

When you give a command, the default path or lookup path is where the system expects to find the program (which is also known as the executable). To make life easier, you can link the binary executable created when you compile a program to a directory like /usr/local/bin, rather than typing the full pathname to the program every time you run it. If you're linking across filesystems, use the command ln -s (which we cover in Chapter 4) to link the command to a directory of executable files. Sometimes this results in the error "too many levels of symbolic links" when you try to run the program. In that case, you have to access the executable directly or use mv or cp to move the actual executable file into the default path. If you do this, be sure to also move any support files the program needs, or create a link to them in the directory in which the program is being run.

Some software distributions automatically install their executables in an appropriate location. The command that usually does this is make install . Be sure to run this command after the program is compiled. For more information on symbolic linking, refer to one of the Unix references listed in the Bibliography, or consult your system administrator.

Sharing Software Among Multiple Users

Before you start installing software on a Unix system, one of the first things to do is to find out where shared software and data are stored on your machines. It's customary to install local applications in /usr/local, with executable files in /usr/local/bin. If /usr/local is set up as a separate partition on the system, it then becomes possible to upgrade the operating system without overwriting local software installations.

Maintaining a set of shared software is a good idea for any research group. Installation of a single standard version of a program or software package by the system administrator ensures that every group member will be using software that works in exactly the same way. This makes troubleshooting much easier and keeps results consistent. If one user has a problem running a version of a program that is used by everyone in the group, the troubleshooting focus can fall entirely on the user's input, without muddying the issue by trying to figure out whether a local version of the program was compiled correctly.

For the most part, it's unnecessary for each user of a program to have her own copy of that program residing in a personal directory. The main exception to this is if a user is actually modifying a program for her own use. Such modifications should not be applied to the public, standard version of the program until they have been thoroughly tested, and therefore the user who is modifying the program needs her own version of the program source and executable.

What Software Is Needed?

New computational biology software is always popping up, but through a couple of decades of collective experience, a consensus set of tools and methods has emerged. Many scientists are familiar with standard commercial packages for sequence analysis, such as GCG, and for protein structure analysis, such as Quanta or Insight. For beginners, these packages provide an integrated interface to a variety of tools.

Commercial software packages for sequence analysis integrate a number of functions, including mapping and fragment assembly, database searching, gene discovery, pairwise and multiple sequence analysis, motif identification, and evolutionary analysis. One caveat is that these software packages can be prohibitively expensive. It can be difficult, especially for educational institutions and research groups on a limited budget, to purchase commercial software and pay the annual costs for license maintenance (which can be in the many thousands of dollars).

A related cost issue is that many commercial software packages, especially those for macromolecular structure analysis, don't yet run on consumer PCs. These packages were originally developed for high-end workstations when these workstations were the only computers with sufficient graphics capability to display protein structures. Although these days most home computers have high-powered graphics cards, the makers of commercial molecular modeling software have been slow to keep up.

While commercial computational biology software packages can be excellent and easy to use, they often seem to lag at least a couple of years behind cutting-edge method development. The company that produces a commercial software package usually commits to only one method for each type of tool, buys it at a particular phase in its development cycle, focuses on turning it into a commercially viable product, and may not incorporate developments in the method into their package in a timely fashion, or at all.

On the other hand, while academic software is usually on the cutting edge, it can be poorly written and hard to install. Documentation (beyond the published paper that describes the software) may be nonexistent. Graphical user interfaces in academic software packages are often rudimentary, which can be aggravating for the beginning user.

With this book, we've taken the "science on a shoestring" approach. In Chapter 6, Chapter 7, Chapter 9, Chapter 10, and Chapter 11 we've compiled quick-reference tables of fundamental techniques and free software applications you can use to analyze your data. Hopefully, these will help you to know what you need to do, how to seek out the tools that do it, and how to put them both together in the way that best suits your needs. This approach keeps you independent of the vagaries of the software industry and in touch with the most current methods.



[*] The GNU project is a collaborative project of the Free Software Foundation to develop a completely open source Unix-like operating system. Linux systems are, formally, GNU/Linux systems as they can be distributed under the terms of the GNU Public License (GPL), the license developed by the GNU project.

[†] There is now a Perl compiler, which can optionally be used to create binary executables from Perl scripts. This can speed up execution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset